CDC Data Exercise

Author

Rachel Robertson

Published

February 6, 2024

Introduction

Data on Botulism in the U.S.

This data contains suspected botulism cases in the United States dating back to the year 1899. This data is provided by the CDC Botulism Consultation Service, the Alaska Division of Public Health, and California Department of Public Health.These organizations are the only sources of non-infant botulism anti-toxin in the U.S. and they collect epidemiological and medical information from suspected botulism cases all over the U.S. This data set contains 5 columns and 2280 rows. The variables in this data set include state, year, botulism transmission type (BotType), botulism toxin type (ToxinType), and frequency count (Count).

Link to the data set: https://data.cdc.gov/Foodborne-Waterborne-and-Related-Diseases/Botulism/66i6-hisz/about_data

Openning the Data

I will start by openning the libraries of the packages that I will need for data processing/cleaning and exploratory analysis. install.packages(readxl)

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(purrr)
library(tidyr)
library(skimr)
library(naniar)


Attaching package: 'naniar'

The following object is masked from 'package:skimr':

    n_complete

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.4
✔ ggplot2   3.4.4     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(here)

here() starts at C:/Users/rrsta/OneDrive/Desktop/MADAcourseexercises/MADAcourserepo

library(readr)
library(broom)
library(readxl)

Next, I will write code to read the csv file.

list.files() #confirm that the file is in the working directory

[1] "Botulism.csv"               "cdcdata-exercise.qmd"      
[3] "cdcdata-exercise.rmarkdown"

cdcdata <- read_csv("Botulism.csv") #read the csv file

Rows: 2280 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): State, BotType, ToxinType
dbl (2): Year, Count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Next I will look at the data structure and summary.

summary(cdcdata)

    State                Year        BotType           ToxinType        
 Length:2280        Min.   :1899   Length:2280        Length:2280       
 Class :character   1st Qu.:1976   Class :character   Class :character  
 Mode  :character   Median :1993   Mode  :character   Mode  :character  
                    Mean   :1986                                        
                    3rd Qu.:2006                                        
                    Max.   :2017                                        
     Count       
 Min.   : 1.000  
 1st Qu.: 1.000  
 Median : 1.000  
 Mean   : 3.199  
 3rd Qu.: 3.000  
 Max.   :59.000

str(cdcdata)

spc_tbl_ [2,280 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ State    : chr [1:2280] "Alaska" "Alaska" "Alaska" "Alaska" ...
 $ Year     : num [1:2280] 1947 1948 1950 1952 1956 ...
 $ BotType  : chr [1:2280] "Foodborne" "Foodborne" "Foodborne" "Foodborne" ...
 $ ToxinType: chr [1:2280] "Unknown" "Unknown" "E" "E" ...
 $ Count    : num [1:2280] 3 4 5 1 5 10 2 1 1 1 ...
 - attr(*, "spec")=
  .. cols(
  ..   State = col_character(),
  ..   Year = col_double(),
  ..   BotType = col_character(),
  ..   ToxinType = col_character(),
  ..   Count = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

Now I will check which unique variables are present in the BotType, ToxinType, and State Categories by using the unique function.

unique_bottype <- unique(cdcdata$BotType)
unique_toxintype <- unique(cdcdata$ToxinType)
unique_state <- unique(cdcdata$State) #Makes a new column of unique values for an existing column
print(unique_bottype)

[1] "Foodborne" "Infant"    "Wound"     "Other"

print(unique_toxintype)

 [1] "Unknown" "E"       "B"       "A"       "F"       "A&B"     "AB"     
 [8] "Ba"      "Bf"      "E,F"     "ABE"     "Ab"      "B/F"     "A/B/E"

print(unique_state) #Print the unique factors to read them

 [1] "Alaska"         "Alabama"        "Arkansas"       "Arizona"       
 [5] "California"     "Colorado"       "Connecticut"    NA              
 [9] "Delaware"       "Florida"        "Georgia"        "Hawaii"        
[13] "Iowa"           "Idaho"          "Illinois"       "Indiana"       
[17] "Kansas"         "Kentucky"       "Louisiana"      "Massachusetts" 
[21] "Maryland"       "Maine"          "Michigan"       "Minnesota"     
[25] "Missouri"       "Mississippi"    "Montana"        "North Carolina"
[29] "North Dakota"   "Nebraska"       "New Hampshire"  "New Jersey"    
[33] "New Mexico"     "Nevada"         "New York"       "Ohio"          
[37] "Oklahoma"       "Oregon"         "Pennsylvania"   "Texas"         
[41] "Rhode Island"   "South Carolina" "South Dakota"   "Tennessee"     
[45] "Utah"           "Virginia"       "Vermont"        "Washington"    
[49] "Wisconsin"      "West Virginia"  "Wyoming"

It seems that there are multiple values of the factor bottype that have the same meaning. These values that should be grouped together are “AB”, “Ba”, “A&B”, and “Ab”; “B/F” and “Bf”; “A/B/E” and “ABE”.

Cleaning the Data

First, I will rename the values of ToxinType that can be combined to a single value using mutate().

cdcdata2 <- cdcdata %>%
  mutate(ToxinType = case_when(
    ToxinType %in% c("AB", "Ba", "Ab", "A&B") ~ "AB",
    ToxinType %in% c("B/F", "Bf") ~ "BF",
    ToxinType %in% c("ABE", "A/B/E") ~ "ABE",
    TRUE ~ ToxinType
  )) #I found how to mutate multiple values of one variable by using the base function of mutate and then asking ChatGPT how to combine multiple values using mutate and not just one.
unique_toxintype2 <- unique(cdcdata2$ToxinType)
print(unique_toxintype2) #Use unique to check that the unique values have merged into a single value, in this case it was sucessful

[1] "Unknown" "E"       "B"       "A"       "F"       "AB"      "BF"     
[8] "E,F"     "ABE"

I will now change the character variables: State, BotType, and ToxinType into factor variables so they are simpler to analyze as levels.

cdcdata2$State <- as.factor(cdcdata2$State)
cdcdata2$ToxinType <- as.factor(cdcdata2$ToxinType)
cdcdata2$BotType <- as.factor(cdcdata2$BotType) #Use the as.factor function to convert each of these character variables to factors that will be simpler to work with
str(cdcdata2) #check the variable types of the data frame

tibble [2,280 × 5] (S3: tbl_df/tbl/data.frame)
 $ State    : Factor w/ 50 levels "Alabama","Alaska",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Year     : num [1:2280] 1947 1948 1950 1952 1956 ...
 $ BotType  : Factor w/ 4 levels "Foodborne","Infant",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ToxinType: Factor w/ 9 levels "A","AB","ABE",..: 9 9 6 6 6 6 6 6 9 6 ...
 $ Count    : num [1:2280] 3 4 5 1 5 10 2 1 1 1 ...

Now I will rename the E,F column to EF for consistency.

cdcdata3 <- cdcdata2 %>%
  mutate(ToxinType = as.character(ToxinType), #first you must change the factor toxin type to a character variable
         ToxinType = ifelse(ToxinType == "E,F", "EF", ToxinType), #Then, you convert the value "E,F" into "EF"
         ToxinType = as.factor(ToxinType)) #convert toxin type back into a factor
print(levels(cdcdata3$ToxinType)) #check the levels of the toxintype factor by using the print(levels()) function

[1] "A"       "AB"      "ABE"     "B"       "BF"      "E"       "EF"     
[8] "F"       "Unknown"

Now I will check for missing values in all of the variables. The “Unknown” values for ToxinTupe, BotType, or State are NOT missing values because they can be analyzed as a factor level and correspond to important data of case counts. I will determine which variables have the most missing data using a nanair package function called gg_miss_var.

gg_miss_var(cdcdata3)

There are over 30 missing values for state, but this is a relatively small percentage compared to the total of 2850 values, so I will delete these missing values.

cdcdata4 <- cdcdata3 %>% #Create a new data frame called cdcdata4
  drop_na(State) #Drop values of state that are NA
skimr::skim(cdcdata4) #Check the number of rows

Data summary
Name	cdcdata4
Number of rows	2246
Number of columns	5
_______________________
Column type frequency:
factor	3
numeric	2
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
State	1	FALSE	50	Cal: 343, Was: 143, Tex: 107, Col: 98
BotType	1	FALSE	4	Inf: 1124, Foo: 899, Wou: 151, Oth: 72
ToxinType	1	FALSE	9	A: 958, B: 778, Unk: 369, E: 72

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Year	0	1	1985.50	26.60	1899	1976	1992	2006	2017	▁▂▂▅▇
Count	0	1	3.22	4.66	1	1	1	3	59	▇▁▁▁▁

34 values were deleted as the number fo rows changed from 2280 to 2246. Now since all of the missing values are taken care of, we will explore to data to find outliers.

Exploratory Analysis

I will use exploratory analysis and create figures to summarize the data distribution and to identify any outliers.

Because the outcome of interest is botulism case count (Count), I will check the normalcy and distribution of the variable count. I used ChatGPT to ask what kind of plot I can use to show me the distribution of Count. If output a code for a histogram that shows a density distribution. This shows that the data is highly right-skewed.

ggplot(cdcdata4, aes(x = Count)) +
  geom_histogram(binwidth = 10, aes(y = ..density..)) +
  geom_density(alpha = 0.2) +
  labs(title = "Distribution of Counts", x = "Counts", y = "Density")

Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

I will now make a simple boxplot using ggplot to confirm the results in the density distribution given above.

ggplot(cdcdata4, aes(y = Count)) + #Create a boxplot of count 
  geom_boxplot(fill = "skyblue", color = "black") + #Fill colors are used as box is far too compressed to notice otherwise
  labs(title = "Box Plot of Counts", y = "Counts")

Even though the plot is highly skewed, there is a single outlier that stands out, close to a count of 60. I will try and find which year and state values are associated with this maximum count and very that there was an unusual outbreak, using online literature.

summary(cdcdata4$Count)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   1.000   3.223   3.000  59.000

I found the max count value to be 59, so I will identify the row of this value.

max_row <- cdcdata4$Count == 59 #create a data frame just including the max value of count
max_states <- cdcdata4$State[max_row]
max_years <- cdcdata4$Year[max_row] #Produce data frames with the year and state corresponding to the max count
print(max_states)

[1] Michigan
50 Levels: Alabama Alaska Arizona Arkansas California Colorado ... Wyoming

print(max_years) #print the data frames with the corresponding years and states

[1] 1977

Now seeing that this outlier is from 1977 Michigan, I will search for this outbreak. Reference: https://pubmed.ncbi.nlm.nih.gov/707476/ In 1977, there was the largest botulism outbreak in American history due to a foodborne outbreak at a Mexican restaurant, from improperly canned Jalapenos. This data point is important and will therefore be kept.

I will now check the frequency of the factor variables

Year

I will check the frequency of year using a histogram plot, similar to the distribution plot used for count.

ggplot(cdcdata4, aes(x = Year)) +
  geom_histogram(binwidth = 10, aes(y = ..density..)) +
  geom_density(alpha = 0.2) +
  labs(title = "Distribution of Years", x = "Year", y = "Density")

Most data has been collected in more recent years, so the data is left-skewed.

I will not plot count versus year. First I will make a total count variable that takes the sum of all state counts for a year

totcount_year <- aggregate(Count ~ Year, data = cdcdata4, FUN = sum) #Use aggregate() to find the sum count for each year value
ggplot(totcount_year, aes(x = Year, y = Count)) + #use geom_point() to create a scatterplot for the total year count data frame that was created
  geom_point() + 
  labs(title = "Botulism Count for each Year in the U.S.", x = "Year", y = "Total Case Count")

It looks like total botulism cases have greatly increased in recent years, botulism surveillance has greatly improved, or the suspected botulism case had changed around 1970 to become more broad. Whichever is the case, the total botulism case count per state has greatly increased starting around 1970.

Count versus State

First I will see the total cases per state. For this I will first aggregate the count values to get a total for each state. Next, I will make a histogram of the total case count versus state.

totcount_state <- aggregate(Count ~ State, data = cdcdata4, FUN = sum) #aggregate and sum the count by the state value
ggplot(totcount_state, aes(x = State, y = Count)) + #use geom_point() to create a scatterplot for the total year count data frame that was created
  theme(axis.text.x = element_text(angle = 90)) + #Rotate x axis labels
  geom_point() + 
  labs(title = "Botulism Count for each State in the U.S.", x = "State", y = "Total Case Count")

One state has an extremely high total case count. I will identify max value by using the same method I used when identifying the max count value.

summary(totcount_state) #summary stats for the total count by state data frame

        State        Count        
 Alabama   : 1   Min.   :   3.00  
 Alaska    : 1   1st Qu.:  29.25  
 Arizona   : 1   Median :  44.00  
 Arkansas  : 1   Mean   : 144.76  
 California: 1   3rd Qu.: 149.50  
 Colorado  : 1   Max.   :2598.00  
 (Other)   :44

I will find the row that this max takes place in.

max_row <- totcount_state$Count == 2598 #create a data frame just including the max value of count
max_state2 <- totcount_state$State[max_row] #find the row in whcih the state with the max total count occurs
print(max_state2) #print the data frames with the corresponding state

[1] California
50 Levels: Alabama Alaska Arizona Arkansas California Colorado ... Wyoming

This extreme value takes place in California. I will now fact check this with online literature. Reference 2: https://www.cdph.ca.gov/Programs/CID/DCDC/CDPH%20Document%20Library/IDBGuidanceforCALHJs-Botulism.pdf According to the California DPH, CA reports the highest proportion of wound botulism cases in the U.S.(approx. 26/yr from 2016 to 2019) These are likely related to drug injection. They have also have had 24 foodborne illness cases during this time period. However, this only accounts for 180 of the 2598 reported, suspected cases. I am unsure about including CA in the final analysis for this reason, as the cases may be due to unequal distribution of botulism outbreaks rather than a reporting bias, but it is unknown which one. To decide whether to exclude CA I will explore the distribution of count values based on the year and state.

I will now investigate the aggregate values of state and year counts.

totcount_stateandyear <- aggregate(Count ~ Year + State, data = cdcdata4, FUN = sum) # Use aggregate to calculate the sum of counts for each state and year
ggplot(totcount_stateandyear, aes(x = State, y = Count, color = Year)) + #use geom_point() to create a scatterplot for the total year count data frame that was created
  theme(axis.text.x = element_text(angle = 90)) + #Rotate x axis labels
  geom_point() + 
  labs(title = "Botulism Count for each State in the U.S.", x = "State", y = "Total Case Count")

Based on the colors of the scatter plot, California has began reporting the largest total case counts of botulism in more recent years, which suggests a change in case definition or reporting bias.

Because of this, I will go back to processing the data. First I will identify if there are duplicate rows in the data

dupcdcdata4 <- cdcdata4[duplicated(cdcdata4),] #Check for duplicated data in the original dataframe and create a new dataframe with duplicates
print(dupcdcdata4) #Print the duplicate rows

# A tibble: 0 × 5
# ℹ 5 variables: State <fct>, Year <dbl>, BotType <fct>, ToxinType <fct>,
#   Count <dbl>

Because they are zero duplicate rows, I believe that there is not duplicate data present for the California data. In this case, I will remove all of the rows with the value California.

cdcdata5 <- cdcdata4[cdcdata4$State != "California", ] #remove California values from the state variable
print(levels(cdcdata5$State)) #Check the remaining values

 [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
 [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
 [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
[13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
[17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
[21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
[25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
[29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
[33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
[37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
[41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
[45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
[49] "Wisconsin"      "Wyoming"

California is missing from the levels of the State factor, therefore the removal of the state value, “California” was successful.

I will now remake the graph comparing the total count values for each state, to reassess outlier state values.

totcount_state <- aggregate(Count ~ State, data = cdcdata5, FUN = sum) #aggregate and sum the count by the state value
ggplot(totcount_state, aes(x = State, y = Count)) + #use geom_point() to create a scatterplot for the total year count data frame that was created
  theme(axis.text.x = element_text(angle = 90)) + #Rotate x axis labels
  geom_point() + 
  labs(title = "Botulism Count for each State (except CA) in the U.S.", x = "State", y = "Total Case Count")

There are a few higher count values, such as for Oregon, but there seems to be no outstanding outliers. Because of this, we will move on.

BotType

Next, I will examine the total number of cases for each Botulism Type. I will start by aggregating the total count for each type of outbreak. Then I will make a bar plot showing each category.

totcount_type <- aggregate(Count ~ BotType, data = cdcdata5, FUN = sum) #Aggregate the count sum by the type of botulism
ggplot(totcount_type, aes(x = BotType, y = Count)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Botulism Type", x = "Type of Botulism Outbreak", y = "Total Count") #Make a bar plot with each differing identity of bot type listed on the x axis

Infant botulism seems slightly more frequent than foodborne botulism. Wound botulism is much less common, but has a frequency close to “other” types of botulism.

Count versus ToxinType

Lastly, I will examine the total number of cases for each Toxin Type. This analysis will be done in a similar way as botulism type. The total count will be aggregated for each toxin type and then

totcount_toxin <- aggregate(Count ~ ToxinType, data = cdcdata5, FUN = sum) #Aggregate the count sum by the type of toxin
ggplot(totcount_toxin, aes(x = ToxinType, y = Count)) +
  geom_bar(stat = "identity") + #Make a bar plot with each differing identity of bot type listed on the x axis
  labs(title = "Distribution of Toxin Type", x = "Type of Toxin", y = "Total Count")

It seems that the toxin type B is responsible for the highest case count, followed by A, unknown, and lastly, E. The toxin type is unknown for a significant chunk of cases in comparison to known types. The known types AB, ABE, BF, and F contribute to a very small portion of botulism cases in the U.S. compared to A, B, and E.

The toxin type corresponds to the strain of bacteria that produced the outbreak. This may mean that toxin type is correlated to the botulism outbreak type. To test BotType and ToxinType correlation, I will use a logistic regression model with these two variables. I use a logistic model with the outcome being botulism outbreak type and the predictor being toxin type.

botvtoxin <- glm(BotType ~ ToxinType, data = cdcdata5, family = binomial) #use glm() to produce a logistic regression with the bottype versus the toxintype variables; regression is binomial
summary(botvtoxin) #ptint the results table


Call:
glm(formula = BotType ~ ToxinType, family = binomial, data = cdcdata5)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0544  -0.4265   0.5987   0.9397   2.5042  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)        0.58857    0.07426   7.926 2.27e-15 ***
ToxinTypeAB       13.97749  394.77485   0.035  0.97176    
ToxinTypeABE      13.97749  882.74338   0.016  0.98737    
ToxinTypeB         1.03981    0.12616   8.242  < 2e-16 ***
ToxinTypeBF       13.97749  394.77485   0.035  0.97176    
ToxinTypeE        -3.67962    0.59498  -6.184 6.23e-10 ***
ToxinTypeF         1.39243    0.53851   2.586  0.00972 ** 
ToxinTypeUnknown  -2.93995    0.21826 -13.470  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2543.3  on 1902  degrees of freedom
Residual deviance: 1881.1  on 1895  degrees of freedom
AIC: 1897.1

Number of Fisher Scoring iterations: 13

It seems that toxin types A, B, E, and unknown are associated with the type of outbreak. Assuming that foodborne is the reference, as it is listed at the first factor level, this might mean there is an association between toxins A, B, E, and unknown with foodborne botulism outbreaks.

This was not confirmed by myself, but current literature suggests that foodborne botulism illness is associated with botulin toxin types A, B, and E. Refrence 3: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2094941/#:~:text=Botulism%20is%20a%20neuroparalytic%20illness,A%2C%20B%20or%20E).

Summary Stats

The summary statistics of the final data set is shown here.The values for California have not been removed, due to significant they might hold. However, note that California has the highest case counts of any U.S. state.

summary(cdcdata5)

        State           Year           BotType       ToxinType  
 Washington: 143   Min.   :1910   Foodborne: 740   A      :790  
 Texas     : 107   1st Qu.:1977   Infant   :1028   B      :701  
 Colorado  :  98   Median :1993   Other    :  55   Unknown:299  
 Oregon    :  95   Mean   :1987   Wound    :  80   E      : 69  
 Alaska    :  88   3rd Qu.:2006                    F      : 33  
 New York  :  81   Max.   :2017                    AB     :  5  
 (Other)   :1291                                   (Other):  6  
     Count       
 Min.   : 1.000  
 1st Qu.: 1.000  
 Median : 1.000  
 Mean   : 2.438  
 3rd Qu.: 3.000  
 Max.   :59.000

skim(cdcdata5)

Data summary
Name	cdcdata5
Number of rows	1903
Number of columns	5
_______________________
Column type frequency:
factor	3
numeric	2
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
State	1	FALSE	49	Was: 143, Tex: 107, Col: 98, Ore: 95
BotType	1	FALSE	4	Inf: 1028, Foo: 740, Wou: 80, Oth: 55
ToxinType	1	FALSE	8	A: 790, B: 701, Unk: 299, E: 69

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Year	0	1	1986.67	25.54	1910	1977	1993	2006	2017	▁▂▂▆▇
Count	0	1	2.44	3.21	1	1	1	3	59	▇▁▁▁▁

This section contributed by Cora Hirst

In this section, I will be generating a dataset that looks similar to the Botulism.csv dataset after Rachel’s processing.

Synthesis of Rachel’s description of the dataset

1903 cbservations across 5 variables:

State factor with 50 levels (States)
Year number with range 1910 - 2017
BotType factor with 4 levels (“Foodborne”, “Infant”, “Wound”, “Other”)
ToxType factor with 9 levels (“A”, “AB”, “ABE”, “B”, “BF”, “E”, “EF”, F”, “Unknown”) “Because the outcome of interest is botulism case count (Count), I will check the normalcy and distribution of the variable count. I used ChatGPT to ask what kind of plot I can use to show me the distribution of Count. If output a code for a histogram that shows a density distribution. This shows that the data is highly right-skewed.”

“I found the max count value to be 59, so I will identify the row of this value. Now seeing that this outlier is from 1977 Michigan, I will search for this outbreak. This data point is important and will therefore be kept.”

“I will check the frequency of year using a histogram plot, similar to the distribution plot used for count. Most data has been collected in more recent years, so the data is left-skewed.”

“It looks like total botulism cases have greatly increased in recent years, botulism surveillance has greatly improved, or the suspected botulism case had changed around 1970 to become more broad. Whichever is the case, the total botulism case count per state has greatly increased starting around 1970.”

“First I will see the total cases per state. For this I will first aggregate the count values to get a total for each state. Next, I will make a histogram of the total case count versus state. One state has an extremely high total case count. I will identify max value by using the same method I used when identifying the max count value. This extreme value takes place in California. I will now fact check this with online literature… I will remove all of the rows with the value California.”

“There are a few higher count values, such as for Oregon, but there seems to be no outstanding outliers. Because of this, we will move

“Infant botulism seems slightly more frequent than foodborne botulism. Wound botulism is much less common, but has a frequency close to”other” types of botulism.”

“It seems that the toxin type B is responsible for the highest case count, followed by A, unknown, and lastly, E. The toxin type is unknown for a significant chunk of cases in comparison to known types. The known types AB, ABE, BF, and F contribute to a very small portion of botulism cases in the U.S. compared to A, B, and E.”

“It seems that toxin types A, B, E, and unknown are associated with the type of outbreak. Assuming that foodborne is the reference, as it is listed at the first factor level, this might mean there is an association between toxins A, B, E, and unknown with foodborne botulism outbreaks.”

Generating synthetic dataset

First, I want to ensure that the distribution of the Counts variable is right-skewed, reflecting that there are many more yearly reports with few cases than reports with many.

set.seed(123)
#case_counts, right skewed 
Count = floor(rbeta(1900,1,160)*1000) #drawing random samples from a beta distribution rescaling, and using floor to ensure they are integers
ggplot() + geom_histogram(aes(x =Count))

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot() + #Create a boxplot of count 
  geom_boxplot(aes(y = Count), fill = "skyblue", color = "black") + #Fill colors are used as box is far too compressed to notice otherwise
  labs(title = "Box Plot of Counts", y = "Counts")

Now, I’d like to ensure that the distribution of the “Year” variable is left-skewed, reflecting that the practice of reporting has increased with time; however, there appear to be two little “humps” - a flatter distribution between 1910 and 1960, and another thinner distribution between 1960 - 2017.

set.seed(123)
#approximately normal distribution before 1960
before_1960 = cdcdata5 %>%
  filter(Year <=1960) %>%
  summarise(mean = floor(mean(Year)), 
          sd = sd(Year), 
          obs = length(Year))

#approximately normal distribution after 1960 
after_1960 = cdcdata5 %>%
  filter(Year > 1960) %>%
  summarise(mean = floor(mean(Year)), 
          sd = sd(Year), 
          obs = length(Year))

#distribution of years 
Year = ceiling(c(rnorm(n = before_1960$obs, mean = before_1960$mean, sd = before_1960$sd), rnorm(n = after_1960$obs, mean = after_1960$mean, sd = after_1960$sd)))


# we will need to delete observations outside of the range we are looking for
Year = Year[(Year >= 1911) & (Year <= 2017)]

# but we need to 1) ensure that we have 1900 observations, and 2) add a few more observation to the range between our distributions, say, 1945-1970
n_obs = length(Year)
Year = c(Year, runif(n = (1900-n_obs), min = 1945, max = 1965))


# How's it looking, boys? 
ggplot() + geom_histogram(aes(x = Year, y = ..density..)) + geom_density(aes(x = Year), alpha = 0.2)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now, we want to try to mimic the correlation between the total “Count” of the number of cases and “Year” of the reported counts - this correlation follows the shape of the distribution of “Year”, which makes sense. However, we may also want to associate lower count numbers with earlier years, as well.

## How close are we by nature of the year distribution alone?
df = data.frame(Year = Year,
                Count = Count)
totcount_year <- aggregate(Count ~ Year, data = df, FUN = sum) #Use aggregate() to find the sum count for each year value
ggplot() + geom_point(data = totcount_year, aes(x = Year, y = Count)) + labs(x = "Year", y = "total Count")

There’s a good amount of sums between 1925 and 1950 with low case counts, and a good number of low case counts in more recent years than Rachel observed. How do fix this? Lets take some of the case numbers from the middle (10-20) and randomly replace them with a random number between 0 and 5.

ggplot() + geom_point(data = df, aes(x = Year, y = Count)) + labs(title = "synthetic data")

ggplot() + geom_point(data = cdcdata5, aes(x = Year, y = Count)) + labs(title = "processed CDC data")

#lets take some of the case numbers from the middle (10-20) and randomly replace them with a random number between 0 and 5. 

range_toobig = range(which(sort(df$Count) >= 10 &  sort(df$Count) <= 40))
df$Count = sort(df$Count)
df$Count[sample(range_toobig[1]:range_toobig[2], size = diff(range_toobig)/1.5, replace = F)] = runif(diff(range_toobig)/1.5, 0,5)

df$Count = sample(df$Count)

ggplot() + geom_point(data = df, aes(x = Year, y = Count)) + labs(title = "synthetic data adjusted")

totcount_year <- aggregate(Count ~ Year, data = df, FUN = sum) #Use aggregate() to find the sum count for each year value
ggplot() + geom_point(data = totcount_year, aes(x = Year, y = Count)) + labs(y = "total Count", x = "year")

Now, I’d like to try to add in some state data. Without data from California, it looks like the total counts for states are somewhat uniform between 0 and 200 total counts. However, a few states have higher numbers of total cases - these are Alaska, Oregon, Washington, New Jersey, and Colorado.

set.seed(123)
#randomly generated a vector of state names whose frequencies are uniformly distributed
state_names = c(levels(factor(cdcdata5$State)))
State = sample(state_names, size = 1900, replace = T)

df$State = State

#reproduce plot
totcount_state <- aggregate(Count ~ State, data = df, FUN = sum) #aggregate and sum the count by the state value
ggplot(totcount_state, aes(x = State, y = Count)) + #use geom_point() to create a scatterplot for the total year count data frame that was created
  theme(axis.text.x = element_text(angle = 90)) + #Rotate x axis labels
  geom_point() + 
  labs(title = "Botulism Count for each State (except CA) in the U.S.", x = "State", y = "Total Case Count")

#so this clearly didn't work too well!

What i’d like to know is the frequencies of the observations of each state in the dataset, and to utilize these frequencies to determine whether the original observations were an artifact of reporting bias!

set.seed(123)
#frequencies of each state in a named vector
state_counts = cdcdata5 %>%
  group_by(State) %>%
  count()
state_freqs = state_counts$n/nrow(cdcdata5)
names(state_freqs) = state_counts$State

#use frequencies to generate a vector of states with the same frequencies
State = sample(names(state_freqs), size = nrow(df), prob = state_freqs, replace = T)

#add to df 
df$State = State

#let's plot frequency 
totcount_state <- aggregate(Count ~ State, data = df, FUN = sum) #aggregate and sum the count by the state value
ggplot(totcount_state, aes(x = State, y = Count)) + #use geom_point() to create a scatterplot for the total year count data frame that was created
  theme(axis.text.x = element_text(angle = 90)) + #Rotate x axis labels
  geom_point() + 
  labs(title = "Botulism Count for each State (except CA) in the U.S.", x = "State", y = "Total Case Count")

We’ve captured that states that report more have a higher OBSERVED total case count, and that doesn’t mean, necessarily, that botulism is more common there! So no further manipulation needed other than sampling states according to frequency of observations.

Lastly, I would like to add the BotTyp and ToxinType data to our synthetic data.

First, I will generate a vector of BotType using the frequencies of botulism types from the cdcdata5 processed dataset.

set.seed(123)
#determine counts of BotTypes
botType_counts = cdcdata5 %>%
  group_by(BotType) %>%
  count()

#determine frequencies of BotTypes
botType_freqs = botType_counts$n/nrow(cdcdata5)
names(botType_freqs) = botType_counts$BotType

#use frequencies to generate a vector of states with the same frequencies
BotType = sample(names(botType_freqs), size = nrow(df), prob = botType_freqs, replace = T) # for anyone reading this, notice how each of the arguments is dependent upon some other variable I've named, and I'm not direclty entering its value! It makes is a lot easier in the future to change something somewhere higher up in the pipeline, and be able to keep this code the same :)

#add to df 
df$BotType = BotType

#plot to be sure we've captured the nature of the true data!
totcount_type <- aggregate(Count ~ BotType, data = df, FUN = sum) #Aggregate the count sum by the type of botulism
ggplot(totcount_type, aes(x = BotType, y = Count)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Botulism Type", x = "Type of Botulism Outbreak", y = "Total Count") #Make a bar plot with each differing identity of bot type listed on the x axis

Infant type botulism is slightly greater in number than foodborne, but to a slightly greater degree than in the original dataset. However, our goal is grasp the NATURE of the dataset, not to recreate it in its entirety. Hense, this is a probable distribution of Botulism by type!

Similar to the frequency approach used to generate the entires for each State and BotType observation, I will use the frequencies of each toxin type in the cdcdata5 dataset to generate a synthetic, possible distribution of cases by toxin type.

set.seed(123)

#fount the number of each toxintype
toxType_counts = cdcdata5 %>%
  group_by(ToxinType) %>%
  count()

#determine frequencies of BotTypes
toxType_freqs = toxType_counts$n/nrow(cdcdata5)
names(toxType_freqs) = toxType_counts$ToxinType

#use frequencies to generate a vector of toxtypes with the same frequencies
ToxinType = sample(names(toxType_freqs), size = nrow(df), prob = toxType_freqs, replace = T)

#add to df 
df$ToxinType = ToxinType

#plot to be sure we've captured the nature of the true data!
totcount_type <- aggregate(Count ~ ToxinType, data = df, FUN = sum) #Aggregate the count sum by the type of toxin
ggplot(totcount_type, aes(x = ToxinType, y = Count)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Toxin Type", x = "Type of Toxin Outbreak", y = "Total Count") #Make a bar plot with each differing identity of bot type listed on the x axis

Here’s the potentially challenging bit - we’d like to be sure to capture the correlations between Botulism type and Toxin type (ToxinType). Rachel used a logistic regression to capture that toxin types A, B, E, and unknown are associated with the type of outbreak, and discovered in the literature that types A, B, and E are associated with foodborne outbreaks.

To try to capture this association, I will sort the BotType vector alphabetically. Then, I will randomly sample types “A”, “B”, and “E” according to their frequencies relative to the total number of these types, only, to generate a vector as long as the number of Foodboorne observations there are.

Then we will want to generate the rest of the Toxin_type data. They can be randomly selected according to their frequencies, but independent of the BotType with which they'll be associated.

However, now we need to adjust the probability of selecting “A”, “B”, or “E” according to however many were just selected, or their frequencies will be even higher!

set.seed(123)
#arrange BotType alphabetically - starting with Foodborne - in the dataframe, df
df = df %>%
  arrange(BotType)

#determine the range of observations with bottype "foodborne"
botType_range = range(which(df$BotType == "Foodborne"))

#now, we want to sample from A, B, and E with their frequencies relative to the total number of A, B, and E
ABE_freqs = toxType_counts %>%
  filter(ToxinType == "A" | ToxinType == "B" | ToxinType == "E") # | means "or"

foodborne_ToxTypes = sample(ABE_freqs$ToxinType, size = diff(botType_range), prob = ABE_freqs$n/sum(ABE_freqs$n), replace = T)

#Now, we want to generate the rest of the Toxin_type data. They can be randomly selected according to their frequencies, but independent of the bottype with which they'll be associated.


# first, we will determine their current frequencies with regard to the total number of observations:
foodborne_ToxTypes_counts = as.data.frame(table(foodborne_ToxTypes))
foodborne_ToxTypes_counts = foodborne_ToxTypes_counts %>%
  filter( foodborne_ToxTypes != "EF") #these toxtypes are not present in the processed cdc dataset and are an artifact of factoring 
foodborne_ToxTypes_counts = foodborne_ToxTypes_counts$Freq

#now we will determine how many As, Bs, and Es are included in the df$ToxinType variable 
ToxType_counts = df %>%
  group_by(ToxinType) %>%
  count()
ToxinType_counts = ToxType_counts$n
names(ToxinType_counts) = ToxType_counts$ToxinType

#now, we will recalutalte the expected frequencies of the toxin types for the remainder of the toxintype variable vector
remaining_freqs_to_prob = (ToxinType_counts - foodborne_ToxTypes_counts)/(nrow(df)-diff(botType_range)) #we are dividing the remaining number of each toxintype that need to be included in the dataset by the total number of "spaces" (non Foodborne observations) left to fill in the dataset

# finally! We get to replace the ToxinType variable with a vector of the foodborne probs (which will be listed first, next to the sorted foodborne recordings) and a sample for the other observations using these probabilities!
df$ToxinType = c(as.character(foodborne_ToxTypes), sample(names(ToxinType_counts), size = nrow(df)-diff(botType_range), prob = remaining_freqs_to_prob, replace = T))

#lets see if we've maintained the frequency distribution of the data: 
totcount_type <- aggregate(Count ~ ToxinType, data = df, FUN = sum) #Aggregate the count sum by the type of toxin
ggplot(totcount_type, aes(x = ToxinType, y = Count)) +
  geom_bar(stat = "identity") + 
  labs(title = "Distribution of Toxin Type", x = "Type of Toxin Outbreak", y = "Total Count") #Make a bar plot with each differing identity of bot type listed on the x axis

The distribution is nearly unchanged, but now most A, B, and E values occur in the same observations as foodborne BotTypes.