Fitting Exercise

Author

Rachel Robertson

Published

March 1, 2024

Model Fitting Exercise

Data Cleaning

First, I will open the libraries I need for this project.

library(tidymodels) # modeling and fitting data

── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

✔ broom        1.0.5     ✔ recipes      1.0.9
✔ dials        1.2.0     ✔ rsample      1.2.0
✔ dplyr        1.1.4     ✔ tibble       3.2.1
✔ ggplot2      3.4.4     ✔ tidyr        1.3.0
✔ infer        1.0.5     ✔ tune         1.1.2
✔ modeldata    1.2.0     ✔ workflows    1.1.3
✔ parsnip      1.1.1     ✔ workflowsets 1.0.1
✔ purrr        1.0.2     ✔ yardstick    1.2.0

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org

library(readr) # reading and importing data


Attaching package: 'readr'

The following object is masked from 'package:yardstick':

    spec

The following object is masked from 'package:scales':

    col_factor

library(broom.mixed) # converting Bayesian models to tidy tibbles
library(dotwhisker) # visualizing regression results
library(ggplot2) # producing graphs and figures to visualize results
library(here) # creating relative pathways to files

here() starts at C:/Users/rrsta/OneDrive/Desktop/MADAcourseexercises/MADAcourserepo

library(dplyr) # manipulating and cleaning data frame
library(psych) # create summary tables


Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

The following objects are masked from 'package:scales':

    alpha, rescale

library(tidyverse) # creating tidy tables

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ psych::%+%()        masks ggplot2::%+%()
✖ psych::alpha()      masks ggplot2::alpha(), scales::alpha()
✖ readr::col_factor() masks scales::col_factor()
✖ purrr::discard()    masks scales::discard()
✖ dplyr::filter()     masks stats::filter()
✖ stringr::fixed()    masks recipes::fixed()
✖ dplyr::lag()        masks stats::lag()
✖ readr::spec()       masks yardstick::spec()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(corrplot) # creating correlation plots

corrplot 0.92 loaded

Next, I will read the csv file to import the Mavoglurant data into R.

trial_data <- read_csv(here("fitting-exercise", "Mavoglurant_A2121_nmpk.csv"), show_col_types = FALSE) # read csv in the relative path to the fitting exercise folder, without showing column types
str(trial_data) # examine structure of data in the trial_data frame

spc_tbl_ [2,678 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID  : num [1:2678] 793 793 793 793 793 793 793 793 793 793 ...
 $ CMT : num [1:2678] 1 2 2 2 2 2 2 2 2 2 ...
 $ EVID: num [1:2678] 1 0 0 0 0 0 0 0 0 0 ...
 $ EVI2: num [1:2678] 1 0 0 0 0 0 0 0 0 0 ...
 $ MDV : num [1:2678] 1 0 0 0 0 0 0 0 0 0 ...
 $ DV  : num [1:2678] 0 491 605 556 310 237 147 101 72.4 52.6 ...
 $ LNDV: num [1:2678] 0 6.2 6.41 6.32 5.74 ...
 $ AMT : num [1:2678] 25 0 0 0 0 0 0 0 0 0 ...
 $ TIME: num [1:2678] 0 0.2 0.25 0.367 0.533 0.7 1.2 2.2 3.2 4.2 ...
 $ DOSE: num [1:2678] 25 25 25 25 25 25 25 25 25 25 ...
 $ OCC : num [1:2678] 1 1 1 1 1 1 1 1 1 1 ...
 $ RATE: num [1:2678] 75 0 0 0 0 0 0 0 0 0 ...
 $ AGE : num [1:2678] 42 42 42 42 42 42 42 42 42 42 ...
 $ SEX : num [1:2678] 1 1 1 1 1 1 1 1 1 1 ...
 $ RACE: num [1:2678] 2 2 2 2 2 2 2 2 2 2 ...
 $ WT  : num [1:2678] 94.3 94.3 94.3 94.3 94.3 94.3 94.3 94.3 94.3 94.3 ...
 $ HT  : num [1:2678] 1.77 1.77 1.77 1.77 1.77 ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_double(),
  ..   CMT = col_double(),
  ..   EVID = col_double(),
  ..   EVI2 = col_double(),
  ..   MDV = col_double(),
  ..   DV = col_double(),
  ..   LNDV = col_double(),
  ..   AMT = col_double(),
  ..   TIME = col_double(),
  ..   DOSE = col_double(),
  ..   OCC = col_double(),
  ..   RATE = col_double(),
  ..   AGE = col_double(),
  ..   SEX = col_double(),
  ..   RACE = col_double(),
  ..   WT = col_double(),
  ..   HT = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

I will go ahead and convert the columns for subject ID, Compartment number, Event ID, occasion, dose amount keyword, and sex to factor variables.

trial_data <- trial_data %>%
  mutate(across(c("ID", "CMT", "EVID", "AMT", "OCC", "SEX", "RACE"), as.factor))
str(trial_data)

tibble [2,678 × 17] (S3: tbl_df/tbl/data.frame)
 $ ID  : Factor w/ 120 levels "793","794","795",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ CMT : Factor w/ 2 levels "1","2": 1 2 2 2 2 2 2 2 2 2 ...
 $ EVID: Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
 $ EVI2: num [1:2678] 1 0 0 0 0 0 0 0 0 0 ...
 $ MDV : num [1:2678] 1 0 0 0 0 0 0 0 0 0 ...
 $ DV  : num [1:2678] 0 491 605 556 310 237 147 101 72.4 52.6 ...
 $ LNDV: num [1:2678] 0 6.2 6.41 6.32 5.74 ...
 $ AMT : Factor w/ 4 levels "0","25","37.5",..: 2 1 1 1 1 1 1 1 1 1 ...
 $ TIME: num [1:2678] 0 0.2 0.25 0.367 0.533 0.7 1.2 2.2 3.2 4.2 ...
 $ DOSE: num [1:2678] 25 25 25 25 25 25 25 25 25 25 ...
 $ OCC : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ RATE: num [1:2678] 75 0 0 0 0 0 0 0 0 0 ...
 $ AGE : num [1:2678] 42 42 42 42 42 42 42 42 42 42 ...
 $ SEX : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ RACE: Factor w/ 4 levels "1","2","7","88": 2 2 2 2 2 2 2 2 2 2 ...
 $ WT  : num [1:2678] 94.3 94.3 94.3 94.3 94.3 94.3 94.3 94.3 94.3 94.3 ...
 $ HT  : num [1:2678] 1.77 1.77 1.77 1.77 1.77 ...

I want to create a line plot showing a dosage over time for each individual in the data set, that is stratified by total dose amount.

ggplot(trial_data, aes(x = TIME, y = DV, group = ID, color = DOSE)) + 
  geom_line() +
  labs(title = "Mavoglurant dose over time by individual and stratified by dose", y= "Dosage Variable", x = "Time in hours")

Because the value of 2 for OCC (or occasion) reflects 2 dosages, and we are only examining cases that are given 1 dose, we will drop all levels= ‘2’ of the factor variable, OCC.

trial_data <- droplevels(trial_data[!trial_data$OCC == '2',]) # drop rows where the level value is '2'
str(trial_data) # examine dataframe structure

tibble [1,665 × 17] (S3: tbl_df/tbl/data.frame)
 $ ID  : Factor w/ 120 levels "793","794","795",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ CMT : Factor w/ 2 levels "1","2": 1 2 2 2 2 2 2 2 2 2 ...
 $ EVID: Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
 $ EVI2: num [1:1665] 1 0 0 0 0 0 0 0 0 0 ...
 $ MDV : num [1:1665] 1 0 0 0 0 0 0 0 0 0 ...
 $ DV  : num [1:1665] 0 491 605 556 310 237 147 101 72.4 52.6 ...
 $ LNDV: num [1:1665] 0 6.2 6.41 6.32 5.74 ...
 $ AMT : Factor w/ 4 levels "0","25","37.5",..: 2 1 1 1 1 1 1 1 1 1 ...
 $ TIME: num [1:1665] 0 0.2 0.25 0.367 0.533 0.7 1.2 2.2 3.2 4.2 ...
 $ DOSE: num [1:1665] 25 25 25 25 25 25 25 25 25 25 ...
 $ OCC : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
 $ RATE: num [1:1665] 75 0 0 0 0 0 0 0 0 0 ...
 $ AGE : num [1:1665] 42 42 42 42 42 42 42 42 42 42 ...
 $ SEX : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ RACE: Factor w/ 4 levels "1","2","7","88": 2 2 2 2 2 2 2 2 2 2 ...
 $ WT  : num [1:1665] 94.3 94.3 94.3 94.3 94.3 94.3 94.3 94.3 94.3 94.3 ...
 $ HT  : num [1:1665] 1.77 1.77 1.77 1.77 1.77 ...

levels(trial_data$OCC) # examine levels of the OCC variable

[1] "1"

All factor levels of OCC except for ‘1’, were dropped. Next, we will drop all rows with the numeric value for TIME = ‘0’

trial_data2 <-trial_data[!(trial_data$TIME %in% 0),] # exclude all rows where time has a value fo zero
summary(trial_data2$TIME) # check the minimum avlue for TIME

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.200   0.583   3.200   6.930   8.250  48.217

Now we will use dplyr to find the sum of dosage values (DV) for each individual (ID).

sum_DV <- trial_data2 %>% # create data frame for the sum_DV
  group_by(ID) %>% # group by individual 
  summarise( Y = sum(DV)) # make a summary column of DV using dpylr
str(sum_DV) # check structure of sumDV

tibble [120 × 2] (S3: tbl_df/tbl/data.frame)
 $ ID: Factor w/ 120 levels "793","794","795",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Y : num [1:120] 2691 2639 2150 1789 3126 ...

Y, as the sum of DV, was found in a new data frame called sum_DV. We will get a produce for values where TIME=0 and join the data frames.

trial_data3 <- trial_data %>%
  filter(TIME == 0) # filter for values of TIME = '0' using dpylr
str(trial_data3) # check structure of new df

tibble [120 × 17] (S3: tbl_df/tbl/data.frame)
 $ ID  : Factor w/ 120 levels "793","794","795",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ CMT : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ EVID: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ EVI2: num [1:120] 1 1 1 1 1 1 1 1 1 1 ...
 $ MDV : num [1:120] 1 1 1 1 1 1 1 1 1 1 ...
 $ DV  : num [1:120] 0 0 0 0 0 0 0 0 0 0 ...
 $ LNDV: num [1:120] 0 0 0 0 0 0 0 0 0 0 ...
 $ AMT : Factor w/ 4 levels "0","25","37.5",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ TIME: num [1:120] 0 0 0 0 0 0 0 0 0 0 ...
 $ DOSE: num [1:120] 25 25 25 25 25 25 25 25 25 25 ...
 $ OCC : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
 $ RATE: num [1:120] 75 150 150 150 150 150 150 150 150 150 ...
 $ AGE : num [1:120] 42 24 31 46 41 27 23 20 23 28 ...
 $ SEX : Factor w/ 2 levels "1","2": 1 1 1 2 2 1 1 1 1 1 ...
 $ RACE: Factor w/ 4 levels "1","2","7","88": 2 2 1 1 2 2 1 4 2 1 ...
 $ WT  : num [1:120] 94.3 80.4 71.8 77.4 64.3 ...
 $ HT  : num [1:120] 1.77 1.76 1.81 1.65 1.56 ...

trial_data4 <- merge(x = trial_data3, y = sum_DV, by = "ID", all = TRUE) # outer join the data frame to preserve all rows and join them by the ID variable
str(trial_data4) # check structure of joined df

'data.frame':   120 obs. of  18 variables:
 $ ID  : Factor w/ 120 levels "793","794","795",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ CMT : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ EVID: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ EVI2: num  1 1 1 1 1 1 1 1 1 1 ...
 $ MDV : num  1 1 1 1 1 1 1 1 1 1 ...
 $ DV  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LNDV: num  0 0 0 0 0 0 0 0 0 0 ...
 $ AMT : Factor w/ 4 levels "0","25","37.5",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ TIME: num  0 0 0 0 0 0 0 0 0 0 ...
 $ DOSE: num  25 25 25 25 25 25 25 25 25 25 ...
 $ OCC : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
 $ RATE: num  75 150 150 150 150 150 150 150 150 150 ...
 $ AGE : num  42 24 31 46 41 27 23 20 23 28 ...
 $ SEX : Factor w/ 2 levels "1","2": 1 1 1 2 2 1 1 1 1 1 ...
 $ RACE: Factor w/ 4 levels "1","2","7","88": 2 2 1 1 2 2 1 4 2 1 ...
 $ WT  : num  94.3 80.4 71.8 77.4 64.3 ...
 $ HT  : num  1.77 1.76 1.81 1.65 1.56 ...
 $ Y   : num  2691 2639 2150 1789 3126 ...

Because I have already converted the variables, SEX and RACE, to factors, I will not repeat this step. I will delete the unnecessary columns of my final data frame to only include the variables: Y, DOSE, AGE, SEX, RACE, WT, and HT.

trial_data4 <- trial_data4 %>%
  select("Y", "DOSE", "AGE", "SEX", "RACE", "WT", "HT") # select only the aforementioned variables
str(trial_data4) # check data structure

'data.frame':   120 obs. of  7 variables:
 $ Y   : num  2691 2639 2150 1789 3126 ...
 $ DOSE: num  25 25 25 25 25 25 25 25 25 25 ...
 $ AGE : num  42 24 31 46 41 27 23 20 23 28 ...
 $ SEX : Factor w/ 2 levels "1","2": 1 1 1 2 2 1 1 1 1 1 ...
 $ RACE: Factor w/ 4 levels "1","2","7","88": 2 2 1 1 2 2 1 4 2 1 ...
 $ WT  : num  94.3 80.4 71.8 77.4 64.3 ...
 $ HT  : num  1.77 1.76 1.81 1.65 1.56 ...

summary(trial_data4)

       Y               DOSE            AGE        SEX     RACE   
 Min.   : 826.4   Min.   :25.00   Min.   :18.00   1:104   1 :74  
 1st Qu.:1700.5   1st Qu.:25.00   1st Qu.:26.00   2: 16   2 :36  
 Median :2349.1   Median :37.50   Median :31.00           7 : 2  
 Mean   :2445.4   Mean   :36.46   Mean   :33.00           88: 8  
 3rd Qu.:3050.2   3rd Qu.:50.00   3rd Qu.:40.25                  
 Max.   :5606.6   Max.   :50.00   Max.   :50.00                  
       WT               HT       
 Min.   : 56.60   Min.   :1.520  
 1st Qu.: 73.17   1st Qu.:1.700  
 Median : 82.10   Median :1.770  
 Mean   : 82.55   Mean   :1.759  
 3rd Qu.: 90.10   3rd Qu.:1.813  
 Max.   :115.30   Max.   :1.930

I will now check for any missing values.

which(is.na(trial_data4)) # find location of missing values

integer(0)

sum(is.na(trial_data4)) # count total missing values

[1] 0

path <- here("./fitting-exercise/cleandata.rds") # create path using here package
saveRDS(trial_data4, file = path) # save as an rds using specified here path

No missing variables will found, so I will not continue to clean the data and move onto exploratory data analysis. I save the clean data as an rds called cleandata.rds. ## Exploratory Data Analysis ### Summary Table I will create a summary table of the data frame in R by using the describe() function from the “psych” package.

table1 <- describe(trial_data4[ , c("Y", "DOSE", "AGE","WT", "HT")],fast=TRUE) # summary stats of cleaned data for the numeric columns
print(table1)

     vars   n    mean     sd    min     max   range    se
Y       1 120 2445.41 961.64 826.43 5606.58 4780.15 87.78
DOSE    2 120   36.46  11.86  25.00   50.00   25.00  1.08
AGE     3 120   33.00   8.98  18.00   50.00   32.00  0.82
WT      4 120   82.55  12.52  56.60  115.30   58.70  1.14
HT      5 120    1.76   0.09   1.52    1.93    0.41  0.01

In this summary table, we see the numeric variables. They all have a relatively small SE and reasonable min and max values. The range sumDV (Y) is large, but this is likely because this is the sum of the concentration administered to each individual. Now I will bring in the factor variables. I want to stratify the Y variable and dose by AGE, RACE, AGE, WT and HT. I will be using the tidyverse and gtsummary (which I found on stackoverflow) packages to produce this table.

table2 <- trial_data4 %>%
  group_by(SEX, RACE) %>%
  summarise(
    mean_Y = mean(Y),
    mean_DOSE = mean(DOSE),
    mean_AGE = mean(AGE),
    sd_Y = sd(Y),
    sd_DOSE = sd(DOSE),
    sd_AGE = sd(AGE),
  )

`summarise()` has grouped output by 'SEX'. You can override using the `.groups`
argument.

print(table2)

# A tibble: 8 × 8
# Groups:   SEX [2]
  SEX   RACE  mean_Y mean_DOSE mean_AGE  sd_Y sd_DOSE sd_AGE
  <fct> <fct>  <dbl>     <dbl>    <dbl> <dbl>   <dbl>  <dbl>
1 1     1      2468.      38.1     32.9  995.    11.8   9.14
2 1     2      2497.      35.6     31.0  910.    12.2   7.91
3 1     7      1491.      25       41     NA     NA    NA   
4 1     88     2612.      35.7     24.6  974.    13.4   3.64
5 2     1      2239.      34.1     40.7 1099.    11.3   6.08
6 2     2      2087.      25       39   1010.     0    10.1 
7 2     7      2790.      50       49     NA     NA    NA   
8 2     88     2093.      25       39     NA     NA    NA

The means of Y and dosage seam to be similar, and within the range specified above with a couple exceptions. SEX(1) and RACE(7) has a lower mean Y. There is also a high DOSE value for SEX(2) RACE(7). This value of 50 is the max, so either all of the values here must be 50. The mean age for race and sex also varies quite a bit. It is also strange the sd for dose for SEX(2) RACE(2) is 0. ### Data Distribution #### Scatter Plots To examine these strange summary stats in the table in more detail, we will visualize the data distribution using ggplot2. First, I will produce some scatterplots to see if there are any associations or if the data is randomly distributed

# Y versus AGE scatter plot by SEX
plot1 <- ggplot(trial_data4, aes(x = AGE, y = Y, color = SEX)) +
  geom_point() +
  labs(x = "AGE", y = "Y", title = "Scatter Plot of Y versus Age by Sex") +
  theme_minimal()
print(plot1)

# Y versus AGE scatter plot by RACE
plot2 <- ggplot(trial_data4, aes(x = AGE, y = Y, color = RACE)) +
  geom_point() +
  labs(x = "AGE", y = "Y", title = "Scatter Plot of Y versus Age by Race") +
  theme_minimal()
print(plot2)

# WT versus HT scatter plot by Sex
plot3 <- ggplot(trial_data4, aes(x = HT, y = WT, color = SEX)) +
  geom_point() +
  labs(x = "Height", y = "Weight", title = "Scatter Plot of Weight versus Height by Sex") +
  theme_minimal()
print(plot3)

# WT versus HT scatter plot by Race
plot4 <- ggplot(trial_data4, aes(x = HT, y = WT, color = RACE)) +
  geom_point() +
  labs(x = "Height", y = "Weight", title = "Scatter Plot of Weight versus Height by Race") +
  theme_minimal()
print(plot4)

The distribution of age seems and Y seem to be fairly random, with the exception of an outlier in the Y variable that is extremely high. I will leave this outlier as there is no indication of whether this holds importance (we don’t know the definition of many of these variables). The distribution of Sex for Y and Age values seems to be a higher age for sex 2, but random for the Y variable. In general, Race seems fairly random for Age and Y, except Race 7 seems to also predominantly be higher ages. Weight and height show a slight trend, as they normally do, without any extreme outliers. The sex distribution of height and weight also makes sense as females tend to be shorter and weight less than males. For this reason, we might assume that sex 2 is female, but we should not make assumptions. Races 1, 2, and 7 seem to be randomly distributed for weight and height. However, race 88 seems to be on the lower end for both height and weight, which may indicate some sex correlation with race. #### Box Plots Now I will use box plots to examine the distribution of Dose and Y stratified by the two factor variables, Race and Sex.

# Distribution of Y stratified by SEX and RACE
plot6 <- trial_data4 %>% 
  ggplot(aes(x=RACE, y=Y, color = SEX)) +  
  geom_boxplot() +
  labs(x = "RACE", y = "Drug concentration", title = "Race and drug concentraiton distribution, stratified by sex") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))  # Rotate x-axis text by 90 degrees
plot(plot6)

# Distribution of DOSE stratified by SEX and RACE
plot7 <- trial_data4 %>% 
  ggplot(aes(x=RACE, y=DOSE, color = SEX)) +  
  geom_boxplot() +
  labs(x = "RACE", y = "Drug concentration", title = "Race and Dose distribution, stratified by sex") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))  # Rotate x-axis text by 90 degrees
plot(plot7)

There are two outliers for Y, in race 1. Each outlier is of a different sex, so sex might not impact the outliers.There are only 2 values for race 7, making their distribution look wonky. In general, sex 1 has a higher Y and dose than sex 2. No outliers is dose were seen as there are only 3 options for dose. Again, because there are two values in race 7, the distribution is wonky. I will choose to keep Race 7, because I do not know if this represents an important minority group, despite there only being 2 values. ### Correlation Plots Now I will create a correlation plot of the numeric variables. To do this, I asked ChatGPT to provide me with the functions used to make correlation plots for numeric variables.

numeric_data <- trial_data4[, sapply(trial_data4, is.numeric)]
correlation_matrix <- cor(numeric_data) # Calculating correlation matrix
corrplot(correlation_matrix, method = "circle", type = "upper", 
         tl.col = "black", tl.srt = 45) # Creating correlation plot

From the correlation plot, it seems that weight and height, and Y and Dose are most highly correlated. Age and height are moderately negatively correlated, as well as weight and Y are slightly negatively correlated. ## Model Fitting First, I will fit a linear model to the outcome, Y, using the predictor of DOSE.

set.seed(333)
data_split <- trial_data4 %>% 
  initial_split(prop = 3/4) #split the data so that 3/4 is in the training set and 1/4 in the reference set
train_data <- training(data_split) # make the training data object
test_data  <- testing(data_split) # make testing data object

trial_linear_fit_1 <- 
  recipe(Y ~ DOSE, data = train_data) 

lm_mod <- linear_reg() %>% # make object called lm_mod for linear_regression function
  set_engine("lm") %>% # set linear regression engine to lm
  set_mode("regression") # set to regression mode

trials_workflow1 <- workflow() %>% #create a workflow
    add_model(lm_mod) %>% #apply the linear regression model
  add_recipe(trial_linear_fit_1) #then, apply the recipe

lm_fit <- 
  lm_mod %>% 
  fit(Y ~ DOSE, data = trial_data4) # fit Y to dose 
tidy(lm_fit) # produce tibble for linear regression fit

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    323.     199.        1.62 1.07e- 1
2 DOSE            58.2      5.19     11.2  2.69e-20

The intercept has a relatively large standard error, while the DOSE regression estimate has a more reasonable SE. This linear regression reflects a positive trend between dose and Y with a slope of 58.2. I will now use this model to predict the trial data. This will allow me to calculate the predictive power using the RMSE and R-squared metrics.

trial_fit1 <- 
  trials_workflow1 %>% 
  fit(data = train_data) # shows model to fit the data used from training from the aforementioned workflow
   predict(trial_fit1, test_data) #predict the outcome Y

# A tibble: 30 × 1
   .pred
   <dbl>
 1 1813.
 2 1813.
 3 2560.
 4 2560.
 5 2560.
 6 2560.
 7 2560.
 8 3307.
 9 3307.
10 3307.
# ℹ 20 more rows

trial_aug <- 
  augment(trial_fit1, test_data)
trial_aug %>%
  select("Y", "DOSE", "AGE", "SEX", "RACE", "WT", "HT")

# A tibble: 30 × 7
       Y  DOSE   AGE SEX   RACE     WT    HT
   <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
 1 1789.  25      46 2     1      77.4  1.65
 2 2549.  25      46 1     1      83    1.78
 3 2353.  37.5    43 2     1      64.4  1.56
 4 2009.  37.5    19 1     2      86.1  1.91
 5 2934.  37.5    46 1     1      71.2  1.67
 6 2155.  37.5    30 1     1      85.4  1.86
 7 2424.  37.5    25 1     1      73.3  1.69
 8 2667.  50      45 2     1      80.7  1.66
 9 3004.  50      28 1     1      83.2  1.74
10 2746.  50      40 1     2      74.8  1.65
# ℹ 20 more rows

names(trial_aug) # check the name for the predicted column

[1] "Y"     "DOSE"  "AGE"   "SEX"   "RACE"  "WT"    "HT"    ".pred"

eval_metrics <- metric_set(rmse, rsq) # use yardstick to select multiple regression metrics (rmse and r squared)

eval_metrics(data = trial_aug,
             truth = Y,
             estimate = .pred) %>% 
  select(-2) # Evaluate RMSE, R2 based on the results

# A tibble: 2 × 2
  .metric .estimate
  <chr>       <dbl>
1 rmse      630.   
2 rsq         0.533

The RMSE is very large and rsq is close to 0.5 which shows that the predictive model is not a very good fit for the data. Next, I will fit a linear model to the outcome, Y, using all of the predictors.

trial_linear_fit_2 <- 
  recipe(Y ~ ., data = train_data) 

lm_mod2 <- linear_reg() %>% # make object called lm_mod for linear_regression function
  set_engine("lm") %>% # set linear regression engine to lm
  set_mode("regression") # set to regression mode

trials_workflow2 <- workflow() %>% #create a workflow
    add_model(lm_mod2) %>% #apply the linear regression model
  add_recipe(trial_linear_fit_2) #then, apply the recipe

lm_fit2 <- 
  lm_mod2 %>% 
  fit(Y ~ ., data = trial_data4) # fit Y to dose 
tidy(lm_fit2) # produce tibble for linear regression fit

# A tibble: 9 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  3387.     1835.       1.85  6.76e- 2
2 DOSE           59.9       4.88    12.3   2.05e-22
3 AGE             3.16      7.82     0.403 6.88e- 1
4 SEX2         -358.      217.      -1.65  1.02e- 1
5 RACE2         155.      129.       1.21  2.31e- 1
6 RACE7        -405.      448.      -0.904 3.68e- 1
7 RACE88        -53.5     245.      -0.219 8.27e- 1
8 WT            -23.0       6.40    -3.60  4.71e- 4
9 HT           -748.     1104.      -0.678 4.99e- 1

The standard error for the intercept, SEX2, RACE2, Race7, and HT are extremely large. This may be in part due to the large standard error of the original Y (outcome) variable. Dose, Age and Race2 have a positive trend with Y while the variables Sex2, Race7, Race88, WT, and HT have a negative trend with Y.

trial_fit2 <- 
  trials_workflow2 %>% 
  fit(data = train_data) # shows model to fit the data sued from training from the aforementioned workflow
   predict(trial_fit2, test_data) #predict the outcome Y

# A tibble: 30 × 1
   .pred
   <dbl>
 1 1646.
 2 1857.
 3 2838.
 4 2428.
 5 3070.
 6 2377.
 7 2859.
 8 3139.
 9 3381.
10 3984.
# ℹ 20 more rows

trial_aug <- 
  augment(trial_fit2, test_data)
trial_aug %>%
  select("Y", "DOSE", "AGE", "SEX", "RACE", "WT", "HT")

# A tibble: 30 × 7
       Y  DOSE   AGE SEX   RACE     WT    HT
   <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
 1 1789.  25      46 2     1      77.4  1.65
 2 2549.  25      46 1     1      83    1.78
 3 2353.  37.5    43 2     1      64.4  1.56
 4 2009.  37.5    19 1     2      86.1  1.91
 5 2934.  37.5    46 1     1      71.2  1.67
 6 2155.  37.5    30 1     1      85.4  1.86
 7 2424.  37.5    25 1     1      73.3  1.69
 8 2667.  50      45 2     1      80.7  1.66
 9 3004.  50      28 1     1      83.2  1.74
10 2746.  50      40 1     2      74.8  1.65
# ℹ 20 more rows

eval_metrics <- metric_set(rmse, rsq) # use yardstick to select multiple regression metrics (rmse and r squared)

eval_metrics(data = trial_aug,
             truth = Y,
             estimate = .pred) %>% 
  select(-2) # Evaluate RMSE, R2 based on the results

# A tibble: 2 × 2
  .metric .estimate
  <chr>       <dbl>
1 rmse      629.   
2 rsq         0.613

The R squared value for the model including all of the predictors is slightly better than the previous model (at around 0.6), but still NOT good. The rmse has not changed much, reflecting a large impact of outliers on the standard error of the predictions in the linear model.

Now we will make a model with SEX to practice fitting with a categorical outcome variable. We will fit a logistic model to the SEX outcome using the predictor, DOSE.

trial_logreg_rec <- 
  recipe(SEX ~ DOSE, data = train_data) %>% # create recipe for logistic regression model using SEX as outcome
  prep() # Prepare the recipe

lr_mod <- logistic_reg() %>% # make object called lr_mod for logistic regression function
  set_engine("glm") # set logistic regression engine to glm

lr_trials_workflow <- workflow() %>% # create a workflow
  add_model(lr_mod) %>% # apply the logistic regression model
  add_recipe(trial_logreg_rec) # then, apply the recipe

lr_fit <- 
  lr_mod %>% 
  fit(SEX ~ DOSE, data = trial_data4) # fit Y to dose 
tidy(lr_fit) # produce tibble for log regression fit

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  -0.765     0.854     -0.896   0.370
2 DOSE         -0.0318    0.0243    -1.31    0.192

There is a relatively high standard error for both the dose and intercept estimates. Both estimates are negative, meaning there is a negative trend between dose and sex.

We will now compute accuracy and ROC-AUC for this model.

lr_trial_fit <- 
  lr_trials_workflow %>% 
  fit(data = train_data)

predict(lr_trial_fit, new_data = test_data)

# A tibble: 30 × 1
   .pred_class
   <fct>      
 1 1          
 2 1          
 3 1          
 4 1          
 5 1          
 6 1          
 7 1          
 8 1          
 9 1          
10 1          
# ℹ 20 more rows

lr_trial_aug <- 
  augment(lr_trial_fit, test_data)

lr_trial_aug %>%
  select("Y", "DOSE", "AGE", "SEX", "RACE", "WT", "HT")

# A tibble: 30 × 7
       Y  DOSE   AGE SEX   RACE     WT    HT
   <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
 1 1789.  25      46 2     1      77.4  1.65
 2 2549.  25      46 1     1      83    1.78
 3 2353.  37.5    43 2     1      64.4  1.56
 4 2009.  37.5    19 1     2      86.1  1.91
 5 2934.  37.5    46 1     1      71.2  1.67
 6 2155.  37.5    30 1     1      85.4  1.86
 7 2424.  37.5    25 1     1      73.3  1.69
 8 2667.  50      45 2     1      80.7  1.66
 9 3004.  50      28 1     1      83.2  1.74
10 2746.  50      40 1     2      74.8  1.65
# ℹ 20 more rows

names(lr_trial_aug)

 [1] "Y"           "DOSE"        "AGE"         "SEX"         "RACE"       
 [6] "WT"          "HT"          ".pred_class" ".pred_1"     ".pred_2"

ROC1 <- lr_trial_aug %>% 
  roc_auc(truth = SEX, .pred_1)
print(ROC1)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.621

ROC2 <- lr_trial_aug %>% 
  roc_auc(truth = SEX, .pred_2)
print(ROC2)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.379

The AUC metric for the first predictor is 0.62 and for the second is 0.38. They are both the same magnitude away from 0.5. This means that both predictions are equally as good, but neither prediction is great because it they are not very close to 1 or 0. This means that this model doesn’t have great predictive power either.

Next, we will fit a logistic model to SEX using all of the other predictors.

trial_logreg_rec2 <- 
  recipe(SEX ~ ., data = train_data) %>% # create recipe for logistic regression model using SEX as outcome
  prep() # Prepare the recipe

lr_mod2 <- logistic_reg() %>% # make object called lr_mod for logistic regression function
  set_engine("glm") # set logistic regression engine to glm

lr_trials_workflow2 <- workflow() %>% # create a workflow
  add_model(lr_mod2) %>% # apply the logistic regression model
  add_recipe(trial_logreg_rec2) # then, apply the recipe

lr_fit2 <- 
  lr_mod2 %>% 
  fit(SEX ~ ., data = trial_data4) # fit Y to dose 
tidy(lr_fit2) # produce tibble for log regression fit

# A tibble: 9 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  60.3     18.0         3.34   0.000824
2 Y            -0.00104  0.000963   -1.08   0.280   
3 DOSE         -0.0308   0.0776     -0.396  0.692   
4 AGE           0.0834   0.0607      1.37   0.170   
5 RACE2        -1.93     1.37       -1.40   0.161   
6 RACE7         0.118    3.85        0.0306 0.976   
7 RACE88       -1.50     2.19       -0.683  0.494   
8 WT           -0.0628   0.0794     -0.791  0.429   
9 HT          -33.2     11.1        -3.00   0.00274

The intercept now has a much higher estimate, of 60 and a smaller SE than the previous logistic regression model. Almost all of the variables have a negative trends with the Sex variables, except for Race7. In addition, all of the estimates for the predictor variables versus Sex (outcome) have a very high SE, meaning that they have a high variability.

We will also compute accuracy and ROC-AUC for this model.

lr_trial_fit2 <- 
  lr_trials_workflow2 %>% 
  fit(data = train_data)

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

predict(lr_trial_fit2, new_data = test_data)

# A tibble: 30 × 1
   .pred_class
   <fct>      
 1 2          
 2 1          
 3 2          
 4 1          
 5 1          
 6 1          
 7 1          
 8 2          
 9 1          
10 1          
# ℹ 20 more rows

lr_trial_aug2 <- 
  augment(lr_trial_fit2, test_data)

lr_trial_aug2 %>%
  select("Y", "DOSE", "AGE", "SEX", "RACE", "WT", "HT")

# A tibble: 30 × 7
       Y  DOSE   AGE SEX   RACE     WT    HT
   <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
 1 1789.  25      46 2     1      77.4  1.65
 2 2549.  25      46 1     1      83    1.78
 3 2353.  37.5    43 2     1      64.4  1.56
 4 2009.  37.5    19 1     2      86.1  1.91
 5 2934.  37.5    46 1     1      71.2  1.67
 6 2155.  37.5    30 1     1      85.4  1.86
 7 2424.  37.5    25 1     1      73.3  1.69
 8 2667.  50      45 2     1      80.7  1.66
 9 3004.  50      28 1     1      83.2  1.74
10 2746.  50      40 1     2      74.8  1.65
# ℹ 20 more rows

names(lr_trial_aug2)

 [1] "Y"           "DOSE"        "AGE"         "SEX"         "RACE"       
 [6] "WT"          "HT"          ".pred_class" ".pred_1"     ".pred_2"

ROC1_2 <- lr_trial_aug2 %>% 
  roc_auc(truth = SEX, .pred_1)
print(ROC1)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.621

ROC2_2 <- lr_trial_aug %>% 
  roc_auc(truth = SEX, .pred_2)
print(ROC2)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.379

With a AUC of 0.62 and 0.38, both predictions are equally good, but also both not the greatest. AUC values closer to 0.5 means the model is more near a random model versus an AUC close to 0 or 1, which might show good predictive power.

Part 2

Model Improvement

For this exercise, we will be using the data previously used to fit our linear and logistic models to create new models and improve those models. We will be using tidymodels and the packages previously opened for the first part of this exercise. We will start this exercise by eliminating the variable entitled RACE because it has values (7 and 88) with too few observations, and which might be acting as outliers.

trial_data5 <- subset(trial_data4, select = -RACE) # Use subset to column to drop 
str(trial_data5) # check structure of new df

'data.frame':   120 obs. of  6 variables:
 $ Y   : num  2691 2639 2150 1789 3126 ...
 $ DOSE: num  25 25 25 25 25 25 25 25 25 25 ...
 $ AGE : num  42 24 31 46 41 27 23 20 23 28 ...
 $ SEX : Factor w/ 2 levels "1","2": 1 1 1 2 2 1 1 1 1 1 ...
 $ WT  : num  94.3 80.4 71.8 77.4 64.3 ...
 $ HT  : num  1.77 1.76 1.81 1.65 1.56 ...

Next, we will set a seed for reproducibility. And split the data with 75% in the training set and 25% in the testing set.

rngseed <- 1234
set.seed(rngseed) # set seed according to the variable rngseed (which equals 1234)
# Data Splitting
library(rsample)
trial5_split <- initial_split(trial_data5, prop = 3/4)# create new data frame with the 3/4 split
train_data2 <- training(trial5_split) # create data frame for training data
test_data2  <- testing(trial5_split) # create data frame for testing data

Next, I will produce a linear model with just one predictor, DOSE, the outcome, or the Y variable. I will then produce a full model including all of the predictors: DOSE, AGE, SEX, WT, and HT.

# Single Predictor Model (Y v. DOSE)
single_rec <- 
  recipe(Y ~ DOSE, data = train_data2) # specify the recipe by putting the formula for the null linear model

single_linreg <- linear_reg() %>% # make object called null_linreg for linear_regression function
  set_engine("lm") %>% # set linear regression engine to lm
  set_mode("regression") # set to regression mode

workflow_single <- workflow() %>% #create a workflow for the linear model called workflow_null
    add_model(single_linreg) %>% #apply the linear regression model
  add_recipe(single_rec) #then, apply the recipe

single_lmfit <- 
  workflow_single %>% 
  fit(data = train_data2) # fit Y to dose using ONLY the training data
tidy(single_lmfit) # produce tibble for linear regression fit

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    535.     244.        2.19 3.08e- 2
2 DOSE            53.4      6.29      8.50 4.41e-13

# Full model (Y v. all predictors)
full_rec <- 
  recipe(Y ~ ., data = train_data2) # specify the recipe by putting the formula for the linear model with all the predictors

full_linreg <- linear_reg() %>% # make object called full_linreg for linear_regression function
  set_engine("lm") %>% # Use the same linear regression engine as the null model
  set_mode("regression") # and also, set to regression mode

workflow_full <- workflow() %>% #create a workflow for the full linear model called workflow_full
    add_model(full_linreg) %>% #apply the new linear regression model
  add_recipe(full_rec) #then, apply the recipe

full_lmfit <- 
   workflow_full %>% # add the workflow to the fit
  fit(data = train_data2) # fit Y to dose using ONLY the training data
tidy(full_lmfit) # produce tibble to organize the resulting linear regression fit

# A tibble: 6 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  4397.      2170.      2.03   4.59e- 2
2 DOSE           55.3        5.83    9.49   6.09e-15
3 AGE            -0.417      9.50   -0.0439 9.65e- 1
4 SEX2         -569.       285.     -1.99   4.95e- 2
5 WT            -22.6        7.65   -2.96   4.00e- 3
6 HT          -1130.      1358.     -0.832  4.08e- 1

Using RMSE to evaluate linear models

Now I will compute the predictions that are made by both the singular predictor model and the full model. RMSE will be calculated using the observed versus the predicted values. Lastly, we will compare these RMSE values to the RMSE of the null model as fitted using the null_model() function.

#Edited by Taylor Glass I noticed that the reason your RMSE values were not as expected is because you were using the test_data2 instead of the train_data2. This entire process should be using the train_data2 only, so I made those changes. The RMSE values are now 948, 702, and 627 respectively.

# Make predictions with the single predictor linear model
predict(single_lmfit, train_data2) #predict the outcome Y from the single predictor linear model (Y v. dose)

# A tibble: 90 × 1
   .pred
   <dbl>
 1 3207.
 2 1871.
 3 2539.
 4 1871.
 5 3207.
 6 1871.
 7 1871.
 8 1871.
 9 1871.
10 2539.
# ℹ 80 more rows

single_aug <- 
  augment(single_lmfit, train_data2) # add columns of interest to new data frame
single_aug %>%
  select("Y", "DOSE") # select columns to add to the data frame

# A tibble: 90 × 2
       Y  DOSE
   <dbl> <dbl>
 1 3004.  50  
 2 1347.  25  
 3 2772.  37.5
 4 2028.  25  
 5 2353.  50  
 6  826.  25  
 7 3866.  25  
 8 3126.  25  
 9 1108.  25  
10 2815.  37.5
# ℹ 80 more rows

names(single_aug) # check the names

[1] "Y"     "DOSE"  "AGE"   "SEX"   "WT"    "HT"    ".pred"

# Calculate the RMSE of the single predictor model
rmse_sing <- rmse(data = single_aug,
             truth = Y,
             estimate = .pred)
  print(rmse_sing) # Evaluate RMSE based on the results

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        703.

# Make predictions with the full predictor linear model
predict(full_lmfit, train_data2) #predict the outcome Y from the FULL predictor linear model (Y v. all predictors)

# A tibble: 90 × 1
   .pred
   <dbl>
 1 3303.
 2 1953.
 3 2745.
 4 2081.
 5 2894.
 6 1265.
 7 2429.
 8 1976.
 9 1561.
10 2549.
# ℹ 80 more rows

full_aug <- 
  augment(full_lmfit, train_data2) # add predictions to new data frame
full_aug %>%
  select("Y", "DOSE", "AGE", "SEX", "WT", "HT") # select columns to add to the data frame

# A tibble: 90 × 6
       Y  DOSE   AGE SEX      WT    HT
   <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
 1 3004.  50      28 1      83.2  1.74
 2 1347.  25      41 1      81    1.75
 3 2772.  37.5    28 1      78.3  1.72
 4 2028.  25      28 2      58.9  1.58
 5 2353.  50      37 1      99.1  1.78
 6  826.  25      30 1     105.   1.88
 7 3866.  25      23 1      65.3  1.65
 8 3126.  25      41 2      64.3  1.56
 9 1108.  25      48 2      79.5  1.62
10 2815.  37.5    26 1      84.5  1.77
# ℹ 80 more rows

names(full_aug) # check the names for the columns

[1] "Y"     "DOSE"  "AGE"   "SEX"   "WT"    "HT"    ".pred"

str(full_aug)

tibble [90 × 7] (S3: tbl_df/tbl/data.frame)
 $ Y    : num [1:90] 3004 1347 2772 2028 2353 ...
 $ DOSE : num [1:90] 50 25 37.5 25 50 25 25 25 25 37.5 ...
 $ AGE  : num [1:90] 28 41 28 28 37 30 23 41 48 26 ...
 $ SEX  : Factor w/ 2 levels "1","2": 1 1 1 2 1 1 1 2 2 1 ...
 $ WT   : num [1:90] 83.2 81 78.3 58.9 99.1 ...
 $ HT   : num [1:90] 1.74 1.75 1.72 1.58 1.78 ...
 $ .pred: num [1:90] 3303 1953 2745 2081 2894 ...

# Calculate the RMSE of the full model
rmse_full <- rmse(data = full_aug,
             truth = Y,
             estimate = .pred)
  print(rmse_full) # Evaluate RMSE based on the results

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        627.

# Make predictions using a null model
blank_model <- null_model() %>% 
  set_engine("parsnip") %>% 
  set_mode("regression") # use the null_model function and the parsnip engine to create a blank model

blank_rec <-
  recipe(Y ~ 1, data = train_data2) # produce recipe for blank model

workflow_blank <- workflow() %>% #create a workflow for the null model called workflow_blank
    add_model(blank_model) %>% #apply the new null model
  add_recipe(blank_rec) #then, apply the recipe

blank_fit <- 
   workflow_blank %>% # add the workflow to the fit
  fit(data = train_data2) # fit Y to dose using ONLY the training data
tidy(blank_fit) # produce tibble to organize the resulting linear regression fit

# A tibble: 1 × 1
  value
  <dbl>
1 2509.

predict(blank_fit, train_data2) #predict the outcome Y from the BLANK predictor linear model

# A tibble: 90 × 1
   .pred
   <dbl>
 1 2509.
 2 2509.
 3 2509.
 4 2509.
 5 2509.
 6 2509.
 7 2509.
 8 2509.
 9 2509.
10 2509.
# ℹ 80 more rows

blank_aug <- 
  augment(blank_fit, train_data2) # add predictions to new data frame
blank_aug %>%
  select("Y", "DOSE", "AGE", "SEX", "WT", "HT") # select columns to add to the data frame

# A tibble: 90 × 6
       Y  DOSE   AGE SEX      WT    HT
   <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
 1 3004.  50      28 1      83.2  1.74
 2 1347.  25      41 1      81    1.75
 3 2772.  37.5    28 1      78.3  1.72
 4 2028.  25      28 2      58.9  1.58
 5 2353.  50      37 1      99.1  1.78
 6  826.  25      30 1     105.   1.88
 7 3866.  25      23 1      65.3  1.65
 8 3126.  25      41 2      64.3  1.56
 9 1108.  25      48 2      79.5  1.62
10 2815.  37.5    26 1      84.5  1.77
# ℹ 80 more rows

names(blank_aug) # check the names for the columns

[1] "Y"     "DOSE"  "AGE"   "SEX"   "WT"    "HT"    ".pred"

# Calculate the RMSE of the blank model
rmse_blank <- rmse(data = blank_aug,
             truth = Y,
             estimate = .pred)
  print(rmse_blank) # Evaluate RMSE based on the results

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        948.

The RMSE for the single predictor model, full model, and null model are 560, 520, and 993 respectively. This is not equal to the values found by Dr. Handel, however I did set the seed in the initial step. The RMSE of both the full and single predictor models are similar, with a difference of 60, while the RMSE for the null model is much greater than both models including predictors. The null model has an RMSE that with a difference of 433 and 393 for the single predictor and full models, respectively. This reflects that both models that contain predictors are better at predicting values based on the original observations, than the null model is at predicting the observations. This means that both models including predictors are better than random models, and capture some trend using linear regression. ### Using cross validation to evaluate linear models I will now use cross validation to evaluate the linear models and null model that were produced. Cross validation is a re-sampling method to measure model performance while accounting for the variance-bias trade off. Here, we will use a 10-fold cross validation method, which compares the train and test data 10 times (using different subsamples) using RMSE to check for consistancy in model predictions. I start by specifying the workflow for the cross validation method. Then I will use the fit_resamples function to perform the CV fit. I will use RMSE to compare each models’ CV fit.

# Single Predictor Linear Model Cross Validation
folds <- vfold_cv(trial_data5, v = 10)
single_cv_wf <- 
  workflow() %>% #define an object for the cv workflow 
  add_model(single_linreg) %>% # add in the previous linear model
  add_recipe(single_rec) # specify the formula for the first linear model

single_fit_cv <- 
  single_cv_wf %>% # apply cv workflow to an object for the cv fit
  fit_resamples(folds) # specify the number of folds for the cv
print(single_fit_cv)

# Resampling results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits           id     .metrics         .notes          
   <list>           <chr>  <list>           <list>          
 1 <split [108/12]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>
 2 <split [108/12]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>
 3 <split [108/12]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>
 4 <split [108/12]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>
 5 <split [108/12]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>
 6 <split [108/12]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>
 7 <split [108/12]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>
 8 <split [108/12]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>
 9 <split [108/12]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>
10 <split [108/12]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>

# Examine rmse of the cv fit
rmse_values1 <- single_fit_cv %>%
  collect_metrics() %>% # get the metrics for all of the re-sampled folds
  filter(.metric == "rmse") # filter by the rmse values only
print(rmse_values1)

# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard    663.    10    52.6 Preprocessor1_Model1

# Full Linear Model Cross Validation
full_cv_wf <- 
  workflow() %>% #define an object for the cv workflow 
  add_model(full_linreg) %>% # add in the previous linear model
  add_recipe(full_rec) # specify the formula for the first linear model

full_fit_cv <- 
  full_cv_wf %>% # apply cv workflow to an object for the cv fit
  fit_resamples(folds) # specify the number of folds for the cv

# Examine rmse of the cv fit
rmse_values2 <- full_fit_cv %>%
  collect_metrics() %>% # get the metrics for all of the re-sampled folds
  filter(.metric == "rmse") # filter by the rmse values only
print(rmse_values2)

# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard    619.    10    46.7 Preprocessor1_Model1

# Null Model Cross Validation
null_cv_wf <- 
  workflow() %>% #define an object for the cv workflow 
  add_model(blank_model) %>% # add blank model in
  add_recipe(blank_rec)

null_fit_cv <- 
  null_cv_wf %>% # apply cv workflow to an object for the cv fit
  fit_resamples(folds) # specify the number of folds for the cv

→ A | warning: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.

There were issues with some computations   A: x1

There were issues with some computations   A: x10

# Examine rmse of the cv fit
rmse_values3 <- null_fit_cv %>%
  collect_metrics() %>% # get the metrics for all of the re-sampled folds
  filter(.metric == "rmse") # filter by the rmse values only
print(rmse_values3)

# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard    948.    10    63.7 Preprocessor1_Model1

The mean RMSE for both models including predictors greatly increased, while the mean RMSE for the null model stayed nearly the same. For the model including a single predictor, the RMSE increased by 99 and the full model RMSE increased by 97. Both models had an rmse value increase by approximately the same magnitude. The standard error for all three models (including the null) are very similar (58, 53, and 55), which demonstrates consistency between models despite varying numbers of predictors being used.

Lastly, I will run the code again with a different seed to see if the cross validation and rmse values change.

set.seed(333)
# Single Predictor Linear Model Cross Validation
folds <- vfold_cv(trial_data5, v = 10)
single_cv_wf <- 
  workflow() %>% #define an object for the cv workflow 
  add_model(single_linreg) %>% # add in the previous linear model
  add_recipe(single_rec) # specify the formula for the first linear model

single_fit_cv <- 
  single_cv_wf %>% # apply cv workflow to an object for the cv fit
  fit_resamples(folds) # specify the number of folds for the cv
print(single_fit_cv)

# Resampling results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits           id     .metrics         .notes          
   <list>           <chr>  <list>           <list>          
 1 <split [108/12]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>
 2 <split [108/12]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>
 3 <split [108/12]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>
 4 <split [108/12]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>
 5 <split [108/12]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>
 6 <split [108/12]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>
 7 <split [108/12]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>
 8 <split [108/12]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>
 9 <split [108/12]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>
10 <split [108/12]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>

# Examine rmse of the cv fit
rmse_values1 <- single_fit_cv %>%
  collect_metrics() %>% # get the metrics for all of the re-sampled folds
  filter(.metric == "rmse") # filter by the rmse values only
print(rmse_values1)

# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard    669.    10    28.3 Preprocessor1_Model1

# Full Linear Model Cross Validation
full_cv_wf <- 
  workflow() %>% #define an object for the cv workflow 
  add_model(full_linreg) %>% # add in the previous linear model
  add_recipe(full_rec) # specify the formula for the first linear model

full_fit_cv <- 
  full_cv_wf %>% # apply cv workflow to an object for the cv fit
  fit_resamples(folds) # specify the number of folds for the cv

# Examine rmse of the cv fit
rmse_values2 <- full_fit_cv %>%
  collect_metrics() %>% # get the metrics for all of the re-sampled folds
  filter(.metric == "rmse") # filter by the rmse values only
print(rmse_values2)

# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard    606.    10    47.7 Preprocessor1_Model1

# Null Model Cross Validation
null_cv_wf <- 
  workflow() %>% #define an object for the cv workflow 
  add_model(blank_model) %>% # add blank model in
  add_recipe(blank_rec)

null_fit_cv <- 
  null_cv_wf %>% # apply cv workflow to an object for the cv fit
  fit_resamples(folds) # specify the number of folds for the cv

→ A | warning: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.

There were issues with some computations   A: x1

There were issues with some computations   A: x7

There were issues with some computations   A: x10

# Examine rmse of the cv fit
rmse_values3 <- null_fit_cv %>%
  collect_metrics() %>% # get the metrics for all of the re-sampled folds
  filter(.metric == "rmse") # filter by the rmse values only
print(rmse_values3)

# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard    953.    10    40.7 Preprocessor1_Model1

The RMSE values for this cross validation are slightly lower than the previous; 669 for the single predictor model, 606 for the full model and 953 for the null model. The standard errors for all three fits have greatly decreased to 28, 48, and 41 respectively. Those rmse values with lower se values are likely a better reflection of model performance than the rmse values in the previous cv (with higher se).

This part added by Taylor Glass.

Model Predictions

I will plot the observed and predicted values from the first 3 models with all of the training data. I will create a data frame with these values first and add a label to indicate the model.

# set seed for reproducibility
set.seed(rngseed)

# create data frame with the observed values and 3 sets of predicted values 
plotdata <- data.frame(
  observed = c(train_data2$Y), 
  predicted_null = c(blank_aug$.pred), 
  predicted_model1 = c(single_aug$.pred), 
  predicted_model2 = c(full_aug$.pred), 
  model = rep(c("null model", "model 1", "model 2"), each = nrow(train_data2))) # add label indicating the model
plotdata

    observed predicted_null predicted_model1 predicted_model2      model
1    3004.21       2509.171         3206.650         3303.028 null model
2    1346.62       2509.171         1871.052         1952.556 null model
3    2771.69       2509.171         2538.851         2744.878 null model
4    2027.60       2509.171         1871.052         2081.182 null model
5    2353.40       2509.171         3206.650         2894.205 null model
6     826.43       2509.171         1871.052         1264.763 null model
7    3865.79       2509.171         1871.052         2428.627 null model
8    3126.37       2509.171         1871.052         1976.185 null model
9    1108.17       2509.171         1871.052         1561.336 null model
10   2815.26       2509.171         2538.851         2548.685 null model
11   1788.89       2509.171         1871.052         1575.912 null model
12   1374.48       2509.171         1871.052         1922.814 null model
13   2485.00       2509.171         3206.650         2361.116 null model
14   3458.43       2509.171         3206.650         3281.325 null model
15   1439.57       2509.171         1871.052         1978.826 null model
16   2532.10       2509.171         1871.052         2253.618 null model
17   1948.80       2509.171         3206.650         3422.975 null model
18   2978.20       2509.171         3206.650         2907.248 null model
19   1800.79       2509.171         2538.851         2449.336 null model
20   2610.00       2509.171         3206.650         3245.884 null model
21   4451.84       2509.171         3206.650         3956.416 null model
22   3306.15       2509.171         3206.650         3204.047 null model
23   1451.50       2509.171         1871.052         1267.688 null model
24   3462.59       2509.171         2538.851         2808.324 null model
25   2177.20       2509.171         3206.650         3163.879 null model
26   3751.90       2509.171         3206.650         3054.970 null model
27   2278.97       2509.171         3206.650         2431.811 null model
28   1175.69       2509.171         1871.052         1485.125 null model
29   1490.93       2509.171         1871.052         1798.728 null model
30   1886.48       2509.171         1871.052         2022.501 null model
31   3777.20       2509.171         3206.650         2776.584 null model
32   3609.33       2509.171         3206.650         3549.538 null model
33   2654.70       2509.171         1871.052         2019.776 null model
34   2063.43       2509.171         1871.052         1314.893 null model
35   2149.61       2509.171         1871.052         2097.371 null model
36   2193.20       2509.171         1871.052         1728.261 null model
37   2372.70       2509.171         3206.650         3183.098 null model
38   4493.01       2509.171         3206.650         3701.456 null model
39   1810.59       2509.171         1871.052         2157.603 null model
40   1666.10       2509.171         1871.052         1769.537 null model
41   2638.81       2509.171         1871.052         1962.071 null model
42   1731.80       2509.171         1871.052         2214.390 null model
43   1423.70       2509.171         1871.052         1860.657 null model
44   4378.37       2509.171         3206.650         3909.195 null model
45   3243.29       2509.171         3206.650         3301.437 null model
46   3774.00       2509.171         3206.650         3437.942 null model
47   2336.89       2509.171         1871.052         2024.359 null model
48   1712.00       2509.171         1871.052         2194.708 null model
49   3239.66       2509.171         3206.650         3436.449 null model
50   3193.98       2509.171         3206.650         3464.606 null model
51   2795.65       2509.171         1871.052         2416.191 null model
52   2572.45       2509.171         3206.650         2855.525 null model
53   2008.52       2509.171         2538.851         2167.618 null model
54   3507.10       2509.171         3206.650         3527.530 null model
55   1958.27       2509.171         1871.052         2275.623 null model
56   1898.00       2509.171         1871.052         1687.029 null model
57   3644.37       2509.171         3206.650         2814.220 null model
58   2092.89       2509.171         1871.052         2047.547 null model
59   2472.90       2509.171         3206.650         3283.833 null model
60   1288.64       2509.171         1871.052         1958.242 null model
61   3733.10       2509.171         3206.650         3078.806 null model
62   1392.78       2509.171         1871.052         1927.145 null model
63   1761.62       2509.171         1871.052         1346.855 null model
64   3037.39       2509.171         3206.650         3350.018 null model
65   1474.60       2509.171         1871.052         1711.456 null model
66   2345.50       2509.171         3206.650         3263.395 null model
67   4984.57       2509.171         3206.650         3367.617 null model
68    997.89       2509.171         1871.052         1469.488 null model
69   2748.86       2509.171         2538.851         2517.525 null model
70   1853.91       2509.171         1871.052         2158.730 null model
71   2036.20       2509.171         3206.650         2786.723 null model
72   2423.89       2509.171         2538.851         2892.948 null model
73   1935.24       2509.171         1871.052         2210.778 null model
74   2154.56       2509.171         2538.851         2424.827 null model
75   1464.29       2509.171         1871.052         1719.560 null model
76   3007.20       2509.171         1871.052         1690.728 null model
77   3622.80       2509.171         3206.650         3029.952 null model
78   2789.70       2509.171         3206.650         3213.928 null model
79   1424.00       2509.171         1871.052         1406.861 null model
80   2667.02       2509.171         3206.650         2874.060 null model
81   1967.61       2509.171         1871.052         2295.337 null model
82   2471.60       2509.171         3206.650         3010.616 null model
83   2690.52       2509.171         1871.052         1628.400 null model
84   2746.20       2509.171         3206.650         3589.721 null model
85   3593.55       2509.171         3206.650         2918.086 null model
86   1625.46       2509.171         1871.052         1694.713 null model
87   1067.56       2509.171         1871.052         1775.419 null model
88   5606.58       2509.171         3206.650         3211.962 null model
89   3408.61       2509.171         3206.650         3490.975 null model
90   2996.40       2509.171         3206.650         3283.515 null model
91   3004.21       2509.171         3206.650         3303.028    model 1
92   1346.62       2509.171         1871.052         1952.556    model 1
93   2771.69       2509.171         2538.851         2744.878    model 1
94   2027.60       2509.171         1871.052         2081.182    model 1
95   2353.40       2509.171         3206.650         2894.205    model 1
96    826.43       2509.171         1871.052         1264.763    model 1
97   3865.79       2509.171         1871.052         2428.627    model 1
98   3126.37       2509.171         1871.052         1976.185    model 1
99   1108.17       2509.171         1871.052         1561.336    model 1
100  2815.26       2509.171         2538.851         2548.685    model 1
101  1788.89       2509.171         1871.052         1575.912    model 1
102  1374.48       2509.171         1871.052         1922.814    model 1
103  2485.00       2509.171         3206.650         2361.116    model 1
104  3458.43       2509.171         3206.650         3281.325    model 1
105  1439.57       2509.171         1871.052         1978.826    model 1
106  2532.10       2509.171         1871.052         2253.618    model 1
107  1948.80       2509.171         3206.650         3422.975    model 1
108  2978.20       2509.171         3206.650         2907.248    model 1
109  1800.79       2509.171         2538.851         2449.336    model 1
110  2610.00       2509.171         3206.650         3245.884    model 1
111  4451.84       2509.171         3206.650         3956.416    model 1
112  3306.15       2509.171         3206.650         3204.047    model 1
113  1451.50       2509.171         1871.052         1267.688    model 1
114  3462.59       2509.171         2538.851         2808.324    model 1
115  2177.20       2509.171         3206.650         3163.879    model 1
116  3751.90       2509.171         3206.650         3054.970    model 1
117  2278.97       2509.171         3206.650         2431.811    model 1
118  1175.69       2509.171         1871.052         1485.125    model 1
119  1490.93       2509.171         1871.052         1798.728    model 1
120  1886.48       2509.171         1871.052         2022.501    model 1
121  3777.20       2509.171         3206.650         2776.584    model 1
122  3609.33       2509.171         3206.650         3549.538    model 1
123  2654.70       2509.171         1871.052         2019.776    model 1
124  2063.43       2509.171         1871.052         1314.893    model 1
125  2149.61       2509.171         1871.052         2097.371    model 1
126  2193.20       2509.171         1871.052         1728.261    model 1
127  2372.70       2509.171         3206.650         3183.098    model 1
128  4493.01       2509.171         3206.650         3701.456    model 1
129  1810.59       2509.171         1871.052         2157.603    model 1
130  1666.10       2509.171         1871.052         1769.537    model 1
131  2638.81       2509.171         1871.052         1962.071    model 1
132  1731.80       2509.171         1871.052         2214.390    model 1
133  1423.70       2509.171         1871.052         1860.657    model 1
134  4378.37       2509.171         3206.650         3909.195    model 1
135  3243.29       2509.171         3206.650         3301.437    model 1
136  3774.00       2509.171         3206.650         3437.942    model 1
137  2336.89       2509.171         1871.052         2024.359    model 1
138  1712.00       2509.171         1871.052         2194.708    model 1
139  3239.66       2509.171         3206.650         3436.449    model 1
140  3193.98       2509.171         3206.650         3464.606    model 1
141  2795.65       2509.171         1871.052         2416.191    model 1
142  2572.45       2509.171         3206.650         2855.525    model 1
143  2008.52       2509.171         2538.851         2167.618    model 1
144  3507.10       2509.171         3206.650         3527.530    model 1
145  1958.27       2509.171         1871.052         2275.623    model 1
146  1898.00       2509.171         1871.052         1687.029    model 1
147  3644.37       2509.171         3206.650         2814.220    model 1
148  2092.89       2509.171         1871.052         2047.547    model 1
149  2472.90       2509.171         3206.650         3283.833    model 1
150  1288.64       2509.171         1871.052         1958.242    model 1
151  3733.10       2509.171         3206.650         3078.806    model 1
152  1392.78       2509.171         1871.052         1927.145    model 1
153  1761.62       2509.171         1871.052         1346.855    model 1
154  3037.39       2509.171         3206.650         3350.018    model 1
155  1474.60       2509.171         1871.052         1711.456    model 1
156  2345.50       2509.171         3206.650         3263.395    model 1
157  4984.57       2509.171         3206.650         3367.617    model 1
158   997.89       2509.171         1871.052         1469.488    model 1
159  2748.86       2509.171         2538.851         2517.525    model 1
160  1853.91       2509.171         1871.052         2158.730    model 1
161  2036.20       2509.171         3206.650         2786.723    model 1
162  2423.89       2509.171         2538.851         2892.948    model 1
163  1935.24       2509.171         1871.052         2210.778    model 1
164  2154.56       2509.171         2538.851         2424.827    model 1
165  1464.29       2509.171         1871.052         1719.560    model 1
166  3007.20       2509.171         1871.052         1690.728    model 1
167  3622.80       2509.171         3206.650         3029.952    model 1
168  2789.70       2509.171         3206.650         3213.928    model 1
169  1424.00       2509.171         1871.052         1406.861    model 1
170  2667.02       2509.171         3206.650         2874.060    model 1
171  1967.61       2509.171         1871.052         2295.337    model 1
172  2471.60       2509.171         3206.650         3010.616    model 1
173  2690.52       2509.171         1871.052         1628.400    model 1
174  2746.20       2509.171         3206.650         3589.721    model 1
175  3593.55       2509.171         3206.650         2918.086    model 1
176  1625.46       2509.171         1871.052         1694.713    model 1
177  1067.56       2509.171         1871.052         1775.419    model 1
178  5606.58       2509.171         3206.650         3211.962    model 1
179  3408.61       2509.171         3206.650         3490.975    model 1
180  2996.40       2509.171         3206.650         3283.515    model 1
181  3004.21       2509.171         3206.650         3303.028    model 2
182  1346.62       2509.171         1871.052         1952.556    model 2
183  2771.69       2509.171         2538.851         2744.878    model 2
184  2027.60       2509.171         1871.052         2081.182    model 2
185  2353.40       2509.171         3206.650         2894.205    model 2
186   826.43       2509.171         1871.052         1264.763    model 2
187  3865.79       2509.171         1871.052         2428.627    model 2
188  3126.37       2509.171         1871.052         1976.185    model 2
189  1108.17       2509.171         1871.052         1561.336    model 2
190  2815.26       2509.171         2538.851         2548.685    model 2
191  1788.89       2509.171         1871.052         1575.912    model 2
192  1374.48       2509.171         1871.052         1922.814    model 2
193  2485.00       2509.171         3206.650         2361.116    model 2
194  3458.43       2509.171         3206.650         3281.325    model 2
195  1439.57       2509.171         1871.052         1978.826    model 2
196  2532.10       2509.171         1871.052         2253.618    model 2
197  1948.80       2509.171         3206.650         3422.975    model 2
198  2978.20       2509.171         3206.650         2907.248    model 2
199  1800.79       2509.171         2538.851         2449.336    model 2
200  2610.00       2509.171         3206.650         3245.884    model 2
201  4451.84       2509.171         3206.650         3956.416    model 2
202  3306.15       2509.171         3206.650         3204.047    model 2
203  1451.50       2509.171         1871.052         1267.688    model 2
204  3462.59       2509.171         2538.851         2808.324    model 2
205  2177.20       2509.171         3206.650         3163.879    model 2
206  3751.90       2509.171         3206.650         3054.970    model 2
207  2278.97       2509.171         3206.650         2431.811    model 2
208  1175.69       2509.171         1871.052         1485.125    model 2
209  1490.93       2509.171         1871.052         1798.728    model 2
210  1886.48       2509.171         1871.052         2022.501    model 2
211  3777.20       2509.171         3206.650         2776.584    model 2
212  3609.33       2509.171         3206.650         3549.538    model 2
213  2654.70       2509.171         1871.052         2019.776    model 2
214  2063.43       2509.171         1871.052         1314.893    model 2
215  2149.61       2509.171         1871.052         2097.371    model 2
216  2193.20       2509.171         1871.052         1728.261    model 2
217  2372.70       2509.171         3206.650         3183.098    model 2
218  4493.01       2509.171         3206.650         3701.456    model 2
219  1810.59       2509.171         1871.052         2157.603    model 2
220  1666.10       2509.171         1871.052         1769.537    model 2
221  2638.81       2509.171         1871.052         1962.071    model 2
222  1731.80       2509.171         1871.052         2214.390    model 2
223  1423.70       2509.171         1871.052         1860.657    model 2
224  4378.37       2509.171         3206.650         3909.195    model 2
225  3243.29       2509.171         3206.650         3301.437    model 2
226  3774.00       2509.171         3206.650         3437.942    model 2
227  2336.89       2509.171         1871.052         2024.359    model 2
228  1712.00       2509.171         1871.052         2194.708    model 2
229  3239.66       2509.171         3206.650         3436.449    model 2
230  3193.98       2509.171         3206.650         3464.606    model 2
231  2795.65       2509.171         1871.052         2416.191    model 2
232  2572.45       2509.171         3206.650         2855.525    model 2
233  2008.52       2509.171         2538.851         2167.618    model 2
234  3507.10       2509.171         3206.650         3527.530    model 2
235  1958.27       2509.171         1871.052         2275.623    model 2
236  1898.00       2509.171         1871.052         1687.029    model 2
237  3644.37       2509.171         3206.650         2814.220    model 2
238  2092.89       2509.171         1871.052         2047.547    model 2
239  2472.90       2509.171         3206.650         3283.833    model 2
240  1288.64       2509.171         1871.052         1958.242    model 2
241  3733.10       2509.171         3206.650         3078.806    model 2
242  1392.78       2509.171         1871.052         1927.145    model 2
243  1761.62       2509.171         1871.052         1346.855    model 2
244  3037.39       2509.171         3206.650         3350.018    model 2
245  1474.60       2509.171         1871.052         1711.456    model 2
246  2345.50       2509.171         3206.650         3263.395    model 2
247  4984.57       2509.171         3206.650         3367.617    model 2
248   997.89       2509.171         1871.052         1469.488    model 2
249  2748.86       2509.171         2538.851         2517.525    model 2
250  1853.91       2509.171         1871.052         2158.730    model 2
251  2036.20       2509.171         3206.650         2786.723    model 2
252  2423.89       2509.171         2538.851         2892.948    model 2
253  1935.24       2509.171         1871.052         2210.778    model 2
254  2154.56       2509.171         2538.851         2424.827    model 2
255  1464.29       2509.171         1871.052         1719.560    model 2
256  3007.20       2509.171         1871.052         1690.728    model 2
257  3622.80       2509.171         3206.650         3029.952    model 2
258  2789.70       2509.171         3206.650         3213.928    model 2
259  1424.00       2509.171         1871.052         1406.861    model 2
260  2667.02       2509.171         3206.650         2874.060    model 2
261  1967.61       2509.171         1871.052         2295.337    model 2
262  2471.60       2509.171         3206.650         3010.616    model 2
263  2690.52       2509.171         1871.052         1628.400    model 2
264  2746.20       2509.171         3206.650         3589.721    model 2
265  3593.55       2509.171         3206.650         2918.086    model 2
266  1625.46       2509.171         1871.052         1694.713    model 2
267  1067.56       2509.171         1871.052         1775.419    model 2
268  5606.58       2509.171         3206.650         3211.962    model 2
269  3408.61       2509.171         3206.650         3490.975    model 2
270  2996.40       2509.171         3206.650         3283.515    model 2

I will use the new data frame to create a visualization.

# create a visual representation
ggplot(plotdata, aes(x = observed)) +
  geom_point(aes(y = predicted_null, color = "null model"), shape = 1) +
  geom_point(aes(y = predicted_model1, color = "model 1"), shape = 2) +
  geom_point(aes(y = predicted_model2, color = "model 2"), shape = 3) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") + #add the 45 degree line
  xlim(0, 5000) + #limit the x-axis values
  ylim(0, 5000) + #limit the y-axis values
  labs(x = "Observed Values", y = "Predicted Values")

Warning: Removed 3 rows containing missing values (`geom_point()`).
Removed 3 rows containing missing values (`geom_point()`).
Removed 3 rows containing missing values (`geom_point()`).

The predictions from the null model are a horizontal line because each prediction is just the mean observation. The predictions from model 1 with the DOSE variable only are shown in three horizontal lines, which makes sense because the DOSE variable has 3 discrete values. The predicted values should be the same for each of those 3 dosage values because there are only 3 original values to predict. The predictions for model 2 with all variables is the best and shows a lot of scatter that seems to follow a pattern along the 45 degree line.

To determine if there is a pattern in the residuals, I will plot the predicted values against the residuals. The formula for finding the residual is just (predicted-observed). I will create a new data frame with the predicted values and residuals first.

# create data frame with predictions and residuals for model 2
plotresiduals <- data.frame(
  predicted_model2 = c(full_aug$.pred), 
  residuals = c(full_aug$.pred - full_aug$Y)) #calculate residuals by predicted - observed
plotresiduals

   predicted_model2   residuals
1          3303.028   298.81777
2          1952.556   605.93586
3          2744.878   -26.81172
4          2081.182    53.58237
5          2894.205   540.80500
6          1264.763   438.33334
7          2428.627 -1437.16314
8          1976.185 -1150.18459
9          1561.336   453.16614
10         2548.685  -266.57463
11         1575.912  -212.97752
12         1922.814   548.33383
13         2361.116  -123.88422
14         3281.325  -177.10537
15         1978.826   539.25608
16         2253.618  -278.48151
17         3422.975  1474.17465
18         2907.248   -70.95170
19         2449.336   648.54599
20         3245.884   635.88439
21         3956.416  -495.42354
22         3204.047  -102.10297
23         1267.688  -183.81229
24         2808.324  -654.26599
25         3163.879   986.67861
26         3054.970  -696.93001
27         2431.811   152.84116
28         1485.125   309.43542
29         1798.728   307.79786
30         2022.501   136.02115
31         2776.584 -1000.61616
32         3549.538   -59.79222
33         2019.776  -634.92395
34         1314.893  -748.53729
35         2097.371   -52.23881
36         1728.261  -464.93852
37         3183.098   810.39781
38         3701.456  -791.55415
39         2157.603   347.01261
40         1769.537   103.43746
41         1962.071  -676.73934
42         2214.390   482.58989
43         1860.657   436.95692
44         3909.195  -469.17471
45         3301.437    58.14672
46         3437.942  -336.05811
47         2024.359  -312.53101
48         2194.708   482.70770
49         3436.449   196.78909
50         3464.606   270.62635
51         2416.191  -379.45893
52         2855.525   283.07494
53         2167.618   159.09793
54         3527.530    20.43010
55         2275.623   317.35262
56         1687.029  -210.97096
57         2814.220  -830.14965
58         2047.547   -45.34339
59         3283.833   810.93293
60         1958.242   669.60182
61         3078.806  -654.29438
62         1927.145   534.36510
63         1346.855  -414.76528
64         3350.018   312.62796
65         1711.456   236.85619
66         3263.395   917.89451
67         3367.617 -1616.95322
68         1469.488   471.59820
69         2517.525  -231.33509
70         2158.730   304.81990
71         2786.723   750.52308
72         2892.948   469.05781
73         2210.778   275.53761
74         2424.827   270.26672
75         1719.560   255.27023
76         1690.728 -1316.47183
77         3029.952  -592.84802
78         3213.928   424.22801
79         1406.861   -17.13949
80         2874.060   207.04015
81         2295.337   327.72724
82         3010.616   539.01559
83         1628.400 -1062.12038
84         3589.721   843.52106
85         2918.086  -675.46376
86         1694.713    69.25323
87         1775.419   707.85896
88         3211.962 -2394.61828
89         3490.975    82.36480
90         3283.515   287.11531

I will plot the new residual values to look for any patterns.

# plot predictions versus residuals for model 2
ggplot(plotresiduals, aes(x=predicted_model2, y=residuals)) + 
  geom_point() + 
  geom_abline(slope = 0, intercept = 0, color = "pink", size = 1.5) + #add straight line at 0
  ylim(-2000,2000) + #make sure y-axis goes the same amount in positive and negative direction
  labs(x= "Predicted Values", y= "Residuals")

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Warning: Removed 1 rows containing missing values (`geom_point()`).

As Dr. Handel suggested, I see a pattern in these residuals because there are more and higher negative residuals compared to positive residuals. There are 6 observations less than -1000, but there is only 1 observation more than 1000. The positive residuals are grouped together more, while the negative residuals are more spaced out. This could be because there is important information missing or the model is too simple.

Model predictions and uncertainty

To determine uncertainty in the predictions, I will use a bootstrap method to sample the data 100 times and fit the model to each set of predictions.

# set the seed for reproducibility
set.seed(rngseed)

# create 100 bootstraps with the training data
bootstraps <- bootstraps(train_data2, times = 100)

# create empty vector to store predictions list 
preds_bs <- vector("list", length = length(bootstraps))

# write a loop to fit the model to each bootstrap and make predictions
for (i in 1:length(bootstraps)) {
  
  bootstrap_sample <- analysis(bootstraps$splits[[i]])  # isolate the bootstrap sample
  
  model <- lm(Y ~ DOSE + AGE + HT + WT + SEX, data = bootstrap_sample) # fit the model using the bootstrap sample
  
  predictions <- predict(model, newdata = train_data2) # make predictions with the training data
  
  preds_bs[[i]] <- predictions # store predictions in the empty vector
}

# find an individual bootstrap sample 
sample <- analysis(bootstraps$splits[[i]])
sample

         Y DOSE AGE SEX    WT       HT
1   826.43 25.0  30   1 105.1 1.879883
2  3593.55 50.0  23   1  96.3 1.820081
3  3593.55 50.0  23   1  96.3 1.820081
4  3609.33 50.0  29   1  68.8 1.810025
5  3622.80 50.0  32   1  92.2 1.799898
6  1948.80 50.0  39   1  75.7 1.780081
7  2345.50 50.0  31   1  85.9 1.719881
8   826.43 25.0  30   1 105.1 1.879883
9  2092.89 25.0  39   2  58.2 1.619872
10 1346.62 25.0  41   1  81.0 1.749966
11 2008.52 37.5  39   1  99.1 1.809982
12 2667.02 50.0  45   2  80.7 1.659881
13 1788.89 25.0  46   2  77.4 1.649993
14 1288.64 25.0  28   1  77.5 1.819881
15 2149.61 25.0  31   1  71.8 1.809847
16 2996.40 50.0  26   1  82.1 1.780073
17 4451.84 50.0  24   1  58.4 1.660126
18 1731.80 25.0  26   1  72.2 1.700092
19  826.43 25.0  30   1 105.1 1.879883
20 1108.17 25.0  48   2  79.5 1.620071
21 1423.70 25.0  39   1  82.1 1.810010
22 1800.79 37.5  50   2  69.8 1.640057
23 3306.15 50.0  38   1  83.4 1.819932
24 3774.00 50.0  35   1  81.6 1.650067
25 1666.10 25.0  45   1  92.0 1.690046
26 1490.93 25.0  41   1  85.8 1.789940
27 3644.37 50.0  28   1  96.8 1.900156
28 2532.10 25.0  30   1  71.4 1.679921
29 2345.50 50.0  31   1  85.9 1.719881
30 4493.01 50.0  38   1  70.4 1.640153
31 5606.58 50.0  29   1  85.7 1.770157
32 1886.48 25.0  47   1  80.3 1.699861
33 1853.91 25.0  24   1  70.7 1.780164
34 1800.79 37.5  50   2  69.8 1.640057
35 1424.00 25.0  44   2  85.4 1.640050
36 1625.46 25.0  30   1  90.6 1.789882
37 3458.43 50.0  39   2  69.8 1.520031
38 1451.50 25.0  36   2  88.2 1.710089
39 1288.64 25.0  28   1  77.5 1.819881
40 1800.79 37.5  50   2  69.8 1.640057
41 1800.79 37.5  50   2  69.8 1.640057
42 2036.20 50.0  18   1 102.7 1.809949
43 1935.24 25.0  24   1  70.4 1.740103
44 1288.64 25.0  28   1  77.5 1.819881
45 1712.00 25.0  19   1  72.7 1.710081
46 3751.90 50.0  42   1  92.9 1.760028
47 2532.10 25.0  30   1  71.4 1.679921
48 3306.15 50.0  38   1  83.4 1.819932
49 1067.56 25.0  37   1  85.4 1.820067
50 1175.69 25.0  33   1  99.3 1.799946
51 1712.00 25.0  19   1  72.7 1.710081
52 4451.84 50.0  24   1  58.4 1.660126
53 2638.81 25.0  24   1  80.4 1.759850
54 2149.61 25.0  31   1  71.8 1.809847
55 1666.10 25.0  45   1  92.0 1.690046
56 4984.57 50.0  47   1  79.5 1.749972
57 2423.89 37.5  25   1  73.3 1.690144
58 1800.79 37.5  50   2  69.8 1.640057
59 3733.10 50.0  34   1  88.0 1.840086
60  997.89 25.0  43   1  99.8 1.800072
61 2746.20 50.0  40   1  74.8 1.650143
62 2690.52 25.0  42   1  94.3 1.769997
63 2036.20 50.0  18   1 102.7 1.809949
64 1948.80 50.0  39   1  75.7 1.780081
65 2177.20 50.0  31   1  88.3 1.759875
66 2149.61 25.0  31   1  71.8 1.809847
67 1346.62 25.0  41   1  81.0 1.749966
68 2336.89 25.0  27   1  74.1 1.829862
69 2336.89 25.0  27   1  74.1 1.829862
70 1886.48 25.0  47   1  80.3 1.699861
71 2815.26 37.5  26   1  84.5 1.770060
72 3306.15 50.0  38   1  83.4 1.819932
73 3644.37 50.0  28   1  96.8 1.900156
74 3239.66 50.0  23   1  72.4 1.840202
75 3408.61 50.0  39   1  74.2 1.749948
76 2667.02 50.0  45   2  80.7 1.659881
77 1935.24 25.0  24   1  70.4 1.740103
78 2610.00 50.0  25   1  79.8 1.859849
79 1374.48 25.0  25   1  84.6 1.710058
80 2372.70 50.0  39   1  89.3 1.719864
81 1967.61 25.0  22   1  68.7 1.700058
82 4984.57 50.0  47   1  79.5 1.749972
83 2336.89 25.0  27   1  74.1 1.829862
84 1625.46 25.0  30   1  90.6 1.789882
85 4493.01 50.0  38   1  70.4 1.640153
86 2471.60 50.0  45   1  93.8 1.780146
87 1800.79 37.5  50   2  69.8 1.640057
88 2978.20 50.0  49   1  97.3 1.800026
89 1451.50 25.0  36   2  88.2 1.710089
90 3777.20 50.0  37   1 101.8 1.829941

Now that I have all of my predictions stored in a vector, I need to convert it to an array, so I can compute the median and 95% confidence intervals for each one.

# create an empty array to store the predictions
num_samples <- length(preds_bs)
num_datapoints <- length(preds_bs[[1]])  
preds_array <- array(NA, dim = c(num_samples, num_datapoints))

# fill the array with predictions from bootstrappping
for (i in 1:num_samples) {
  preds_array[i,] <- unlist(preds_bs[[i]])
}

# find the median and 95% confidence intervals of each prediction
preds <- preds_array %>%
          apply(2, quantile,  c(0.05, 0.5, 0.95)) %>%
          t()

Finally, I will graph the observed values on the x-axis and point estimates from the training data on the y-axis. I will add the median and confidence intervals for the predictions from the bootstrap sampling to the y-axis as well.

# create a data frame with all the necessary variables
finaldata <- data.frame(
  observed = c(train_data2$Y), 
  point_estimates = c(full_aug$.pred),
  medians = preds[, 2],
  lower_bound = preds[, 1],
  upper_bound = preds[, 3]
)

# create visualization with all 5 variables 
ggplot(finaldata, aes(x = observed, y = point_estimates)) +
  geom_point(color = "black") +  
  geom_point(aes(y = medians), color = "green") + 
  geom_errorbar(aes(ymin = lower_bound, ymax = upper_bound), width = 0.1, color = "blue") + 
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") +  # add 45 degree line
  xlim(0, 5000) +
  ylim(0, 5000) +
  labs(x = "Observed Values", y = "Predicted Values")

Warning: Removed 1 rows containing missing values (`geom_point()`).
Removed 1 rows containing missing values (`geom_point()`).

This plot shows that our model does a good job of predicted the observed values because most of the point estimates lie around the 45 degree line. I think some of the variation shown in the bootstrap estimates is accounted for by the fact that the DOSE variable has 3 discrete values, so those predicted values will follow 3 horizontal lines as shown in a previous graph. All of the median values from the bootstrap sampling, shown in green, are father away from the 45 degree line than the actual predicted values, shown in black. There are many instances where the confidence interval from the bootstrapping estimate does not contain the point estimate, but some of the confidence intervals contain the point estimates, especially where the point estimates are clustered. I think this graph is evidence that our model is a decent option for this data because the general pattern of point estimates follows the expected 45 degree line. # Rachel continued this section Now we will use the fitted full model to make predictions for the test data. This will allow us to assess the generalizability of the full model that was fitted. I will start by refitting the full linear regression model to both the training and testing data, using the seed set for the previous analysis.

# predictions based on testing data
test_pred_aug <- augment(full_lmfit, test_data2) %>% # use augment to make predictions
  select(Y, .pred) %>% # select columns for plotting
  rename(test_pred = .pred) # change the name of the predictions column

# predictions based on training data
train_pred_aug <- augment(full_lmfit, train_data2) %>%
  select(Y, .pred) %>%
  rename(train_pred = .pred)

# combine prediction columns into one data frame
combined_df <- full_join(trial_data5, test_pred_aug, by = "Y") %>%
  full_join(train_pred_aug, by = "Y") # use full join to retain all values for Y and trial_data 5 for all of the original Y values
str(combined_df) # check the structure

'data.frame':   120 obs. of  8 variables:
 $ Y         : num  2691 2639 2150 1789 3126 ...
 $ DOSE      : num  25 25 25 25 25 25 25 25 25 25 ...
 $ AGE       : num  42 24 31 46 41 27 23 20 23 28 ...
 $ SEX       : Factor w/ 2 levels "1","2": 1 1 1 2 2 1 1 1 1 1 ...
 $ WT        : num  94.3 80.4 71.8 77.4 64.3 ...
 $ HT        : num  1.77 1.76 1.81 1.65 1.56 ...
 $ test_pred : num  NA NA NA NA NA NA NA NA NA NA ...
 $ train_pred: num  1628 1962 2097 1576 1976 ...

Now, we will plot these two model predictions against the original Y observations to visualize if the testing data produced predictions similar to the training data (and if the model in generalizable).

# grap both predictions
train_test_plot <- ggplot(combined_df, aes(x = Y)) +
  geom_point(aes(y = test_pred, color = "testing predictions"), shape = 1) +
  geom_point(aes(y = train_pred, color = "training predicitons"), shape = 2) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") + #add the 45 degree line
  xlim(0, 5000) + #limit the x-axis values
  ylim(0, 5000) + #limit the y-axis values
  labs(x = "Observed Y Values", y = "Predicted Values for Train and Test data")
print(train_test_plot)

Warning: Removed 90 rows containing missing values (`geom_point()`).

Warning: Removed 31 rows containing missing values (`geom_point()`).

The chart above displays that the training data predictions and testing data predictions follow the same pattern. This means that the model captures the same trends that it did on the testing data as when it was fitted using the training data. This means that the model is generalizable when presented with “new” data. However, the pattern in both sets of predictions shows that a trend was not captured. Even though the model is generalizable, it is also under-fitted and may be too flexible. This means that more tuning, additional predictors, or a larger sample size, will help the model fit the unrepresented trend.

Conclusions

We want to make sure that any model we have performs better than the null model. Is that the case?

According to the RMSE computations for the single predictor model, full model, and null model, which are 560, 520, and 993 respectively, the models including any predictor are both better than the null model. However, we see in the plot of the predicted values of both models versus the null, that the model including the single predictor is not much better than the null model. The null model shows a single horizontal line of predictions, while the single predictor model shows three horizontal lines of predictions. The points in the single predictor model follow the trend (of the 45 degree line) slightly better than the null model, but with only three possible values for the predictor (DOSE), the overall trends of Y are not captured. When you add in all of the predictors, this allows more trends to be captured. This is shown in the graph of the full model predictions, where the points more closely follow the 45 degree trend line.

Does model 1 with only dose in the model improve results over the null model? Do the results make sense? Would you consider this model “usable” for any real purpose?

Model 1 with only DOSE as a predictor slightly improves the results compared to the null model. As discussed before, model 1 has a much lower RMSE value and follows the trend line slightly better than the null model. However, because there are only 3 options for dose, it leaves out many trends contributing the the distribution of Y. This is why the single predictor model would not be usable for any real purpose. Although one could use it to predict the Y given DOSE, we learn that many other predictors impact the trends in Y and DOSE might not be the best predictor.

Does model 2 with all predictors further improve results? Do the results make sense? Would you consider this model “usable” for any real purpose?

Model 2 with all of the predictors does improve the predictions compared to model 1 and the null model. First, the RMSE is lower for the full model than all of the others. Secondly, in the graph, you can see a clear cloud of points following the 45 degree trend line. The full model predictions follow the trend line more closely than all of the other model predictions, showing that the full model fits best. However, there is still a pattern, seen in the residual plot, where higher predicted values are more scattered and contain more outliers. This means that the full model still is not able to capture all of the trends contributing to the observed Y values. This model is still more usable for predicting Y given any of the predictors in the model, because it is more accurate. It must be kept in mind that for larger Y values, the predictive power of the full model likely decreases. This means that it can be used to make predictions for small Y values but not larger ones.