Data Exercise

Author

Rachel Robertson

Published

February 1, 2024

Data Exercise

Option 2

For this data exercise, I will produce a synthetic data set and then use plots and tables to explore the synthetic data. Lastly, I will use a linear model to test the associations built within the data and use a logistic model to test if it is able to capture the same patterns within the synthetic data.

Generating Synthetic Data

I will start by opening the packages that I need to create multiple types of data.

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(purrr)
library(lubridate)

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
library(ggplot2)
library(here)
here() starts at C:/Users/rrsta/OneDrive/Desktop/MADAcourseexercises/MADAcourserepo
library(readr)
library(broom)
library(parsnip)

I will then set a seed, which makes a random string reproducible when ran on another system. I will also give R a number of values to produce.

set.seed(3) #Set seed to 3
n_patients <- 150 #Make 150 the number of patients

To create variables, I will begin with an empty data frame. In this example I will use the example of a vaccine trial. This data is not based on a real vaccine trial, but it will have variables that one would find in a clinical trial.

vac_data <- data.frame(
  patient_id = numeric(n_patients),
  age = numeric(n_patients),
  gender = character(n_patients),
  vac_group = character(n_patients),
  vac_date = lubridate::as_date(character(n_patients)),
  disease = integer(n_patients),
  antibody = numeric(n_patients),
  adverse_event = integer(n_patients)
) #Set a blank data frame called vac_data with the variables, patient_id, age, gender, vac_group, vac_date, disease, antibody, and adverse_event

Next, I will begin to generate the data for each of the aforementionned variables in the blank data frame.

vac_data$patient_id <- 1:n_patients #Set random values for patient ID
vac_data$age <- round(rnorm(n_patients, mean = 25, sd = 5), 1) #Set age to be normally distributed with a mean of 25 and sd of 5
vac_data$gender <- purrr::map_chr(sample(c("male", "female", "other"), n_patients, replace= TRUE), as.character) #Set gender to be male, female, or other
vac_data$vac_group <- purrr::map_chr(sample(c("a", "b", "none"), n_patients, replace= TRUE), as.character) #Set the vaccine group to a theoretical dose a (a), dose value b (b), or to no vaccine (none)
vac_data$vac_date <- lubridate::as_date(sample(seq(from = lubridate::as_date("2023-01-01"), to = lubridate::as_date("2023-12-31"), by = "days"), n_patients, replace = TRUE)) #Set the date values to be randomly assigned by day between the first day and last day of the year 2023
vac_data$disease[vac_data$vac_group == "a"] <- purrr::map_int(sample(0:1, sum(vac_data$vac_group == "a"), replace = TRUE, prob = c(0.9, 0.1)), as.integer) #Set disease as 1 and no disease as 0, with a 90% vaccine efficacy for vaccine a
vac_data$disease[vac_data$vac_group == "b"] <- purrr::map_int(sample(0:1, sum(vac_data$vac_group == "b"), replace = TRUE, prob = c(0.85, 0.15)), as.integer) #Set vaccine b efficacy to be 85%
vac_data$disease[vac_data$vac_group == "none"] <- purrr::map_int(sample(0:1, sum(vac_data$vac_group == "none"), replace = TRUE, prob = c(0.5, 0.5)), as.integer) #Set the disease outcome to be 50% chances that one will get the disease (1) or not (0) without recieving a vaccine (none)

I initially coded it with the number of items being n_patients, but I received an error because there were less “a”, “b”, and “none” values than the number of patients. To fix this, I asked ChatGPT to show me how not to create extra values in this scenario and it output the sum() function. Each variable would have a total number of values that equals the sum number of vaccine types given. This will not allow it to exceed the maximum and values can be replaced. Additionally, because the disease outcome (0=no, 1=yes) should correspond to the vaccine (a,b, or none), I assigned each vaccine group was assigned an efficacy probability.

When adding an antibody titer variable (antibody) we will assume that the antibody titer is normally distributed in the patients and that vaccine efficacy is positively correlated with antibody titer. This means that vaccine a will have a higher mean than vaccine b. None have very low levels with a high sd to reflect that some individuals may ahev had past exposures to the disease.

vac_data$antibody [vac_data$vac_group == "a"] <- round(rnorm(sum(vac_data$vac_group == "a"), mean = 40, sd = 5), 1) #The first vaccine has the highest probability of preventing disease so it has the highest mean antibody titer of 1:40 and a small standard deviation
vac_data$antibody [vac_data$vac_group == "b"] <- round(rnorm(sum(vac_data$vac_group == "b"), mean = 32, sd = 5), 1) #The second vaccine has the second highest probability of preventing disease so it has the second highest mean antibody titer of 1:32
vac_data$antibody [vac_data$vac_group == "none"] <- round(rnorm(sum(vac_data$vac_group == "none"), mean = 10, sd = 10), 1) #No vaccine has a mean of 10 with a high standard deviation, of 10, to account for past exposures at different time points

Lastly, adverse events will only occur if the individual has recieved the vaccine. Other conditions seen in the “none” group will not be resorded as adverse events because it would jsut be the placebo effect. Because of this the none group will always have 0 for adverse events.

vac_data$adverse_event [vac_data$vac_group == "a"] <- purrr::map_int(sample(0:1, sum(vac_data$vac_group == "a"), replace = TRUE, prob = c(0.6, 0.4)), as.integer) #Setting adverse event probability for vaccine a to 40%
vac_data$adverse_event [vac_data$vac_group == "b"] <- purrr::map_int(sample(0:1, sum(vac_data$vac_group == "b"), replace = TRUE, prob = c(0.8, 0.2)), as.integer) #Setting adverse event probability for vaccine b to 20%
vac_data$adverse_event [vac_data$vac_group == "none"] <- purrr::map_int(sample(0:1, sum(vac_data$vac_group == "none"), replace = TRUE, prob = c(1.0, 0.0)), as.integer)  #Setting adverse event probability for no vaccine to 0%

I will now save the synthetic dataset as a cvs file

write.csv(vac_data, here("data-exercise","vac_data.csv"), row.names = FALSE)

Checking the Synthetic Data Set

I will now check the synthetic data set with summary stats and then examine the structure.

summary(vac_data) #checking summary stats
   patient_id          age           gender           vac_group        
 Min.   :  1.00   Min.   :13.00   Length:150         Length:150        
 1st Qu.: 38.25   1st Qu.:21.32   Class :character   Class :character  
 Median : 75.50   Median :25.15   Mode  :character   Mode  :character  
 Mean   : 75.50   Mean   :24.82                                        
 3rd Qu.:112.75   3rd Qu.:28.48                                        
 Max.   :150.00   Max.   :34.70                                        
    vac_date             disease          antibody      adverse_event
 Min.   :2023-01-01   Min.   :0.0000   Min.   :-11.20   Min.   :0.0  
 1st Qu.:2023-04-03   1st Qu.:0.0000   1st Qu.: 22.32   1st Qu.:0.0  
 Median :2023-06-22   Median :0.0000   Median : 32.00   Median :0.0  
 Mean   :2023-06-29   Mean   :0.2467   Mean   : 28.52   Mean   :0.2  
 3rd Qu.:2023-09-30   3rd Qu.:0.0000   3rd Qu.: 37.77   3rd Qu.:0.0  
 Max.   :2023-12-26   Max.   :1.0000   Max.   : 48.00   Max.   :1.0  
str(vac_data) #checking data structure
'data.frame':   150 obs. of  8 variables:
 $ patient_id   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ age          : num  20.2 23.5 26.3 19.2 26 25.2 25.4 30.6 18.9 31.3 ...
 $ gender       : chr  "female" "male" "male" "other" ...
 $ vac_group    : chr  "a" "none" "b" "none" ...
 $ vac_date     : Date, format: "2023-05-05" "2023-05-15" ...
 $ disease      : int  0 0 0 0 0 1 0 0 0 0 ...
 $ antibody     : num  44.2 20.4 36.7 10.9 35.6 33.2 31.9 39.6 38.1 44.4 ...
 $ adverse_event: int  1 0 0 0 1 1 1 0 0 1 ...

I now use a table to check the number if individuals in each vaccine group that had the disease outcome.

table(vac_data$disease,vac_data$vac_group)#Create table with disease versus vaccine group
   
     a  b none
  0 42 52   19
  1  2 12   23

I will now check the distributuion of antibody titer values depending on vaccine group by using a box plot.

ggplot(vac_data, aes(x = vac_group, y = antibody)) +
  geom_boxplot() +
  labs(x = "Vaccine Administerred", y = "Antibody Titer (1:y)") +
  theme_bw()

The data should not include values of negative values, so this does not reflect a real-life scenerio. However, I was unable to add a minimum value to the normally distributed, generated data. So these values must be removed through data-cleaning steps before analysis. ## Cleaning synthetic data

vac_data2 <- vac_data %>%
  filter(antibody >= 0) #Removing zero and negative data
summary(vac_data2) #Checking total number of values and that negative values of antibody were removed
   patient_id          age           gender           vac_group        
 Min.   :  1.00   Min.   :13.00   Length:146         Length:146        
 1st Qu.: 39.25   1st Qu.:21.32   Class :character   Class :character  
 Median : 76.00   Median :25.15   Mode  :character   Mode  :character  
 Mean   : 75.84   Mean   :24.80                                        
 3rd Qu.:112.75   3rd Qu.:28.40                                        
 Max.   :150.00   Max.   :34.70                                        
    vac_date             disease         antibody     adverse_event   
 Min.   :2023-01-01   Min.   :0.000   Min.   : 0.10   Min.   :0.0000  
 1st Qu.:2023-04-02   1st Qu.:0.000   1st Qu.:24.02   1st Qu.:0.0000  
 Median :2023-06-19   Median :0.000   Median :32.70   Median :0.0000  
 Mean   :2023-06-27   Mean   :0.226   Mean   :29.44   Mean   :0.2055  
 3rd Qu.:2023-09-29   3rd Qu.:0.000   3rd Qu.:38.02   3rd Qu.:0.0000  
 Max.   :2023-12-26   Max.   :1.000   Max.   :48.00   Max.   :1.0000  

Three negative values were removed from the antibody column, which reduced the sample. The remaining sample will be used to check for the trends that I built into the data set.