Data hunt and changed plans

Recently I came across a new R table package that was ideal for displaying categorical data across study arms and would be super helpful for my work with clinical trials data. I set out to find some light-hearted data to test the package out. I settled on using data from the General Social Survey (GSS) from the independent research organization NORC at the University of Chicago (full citation below). The GSS has a set of science related questions, including this gem: “Now, does the Earth go around the Sun, or does the Sun go around the Earth?” With the numerous other categorical questions asked during the GSS, I thought this would be a perfect set of data to use.

Unfortunately, I ran into some major roadblocks with the new table package. Since it is still in development, I decided to pivot to kableExtra and learn a few new formatting tweaks. Even so, I was feeling less than satisfied, so I took a step further and ran a random forest model.

Enough said, let’s dig in:

#Load data
GSS <- read_excel("GSS.xls")

#Check out dataframe
str(GSS)
## tibble [18,306 x 51] (S3: tbl_df/tbl/data.frame)
##  $ sei10   : num [1:18306] 24.2 52.4 35.4 38.8 -1 59.8 24.2 43.8 81.6 82.5 ...
##  $ toofast : chr [1:18306] "Agree" "Disagree" "Not applicable" "Not applicable" ...
##  $ nextgen : chr [1:18306] "Agree" "Strongly agree" "Not applicable" "Not applicable" ...
##  $ scifrom : chr [1:18306] "The internet" "The internet" "Not applicable" "Not applicable" ...
##  $ sprtprsn: chr [1:18306] "Very spiritual" "Slight spiritual" "Slight spiritual" "Very spiritual" ...
##  $ relpersn: chr [1:18306] "Very religious" "Slight religious" "Slight religious" "Slight religious" ...
##  $ reborn  : chr [1:18306] "Yes" "No" "No" "Yes" ...
##  $ god     : chr [1:18306] "Know god exists" "Believe but doubts" "Know god exists" "Know god exists" ...
##  $ comprend: chr [1:18306] "Fair" "Fair" "Fair" "Good" ...
##  $ class_  : chr [1:18306] "Working class" "Working class" "Middle class" "Working class" ...
##  $ consci  : chr [1:18306] "Only some" "A great deal" "Don't know" "Not applicable" ...
##  $ advfront: chr [1:18306] "Agree" "Agree" "Not applicable" "Not applicable" ...
##  $ astrosci: chr [1:18306] "Sort of scientific" "Not at all scientific" "Not applicable" "Not applicable" ...
##  $ scibnfts: chr [1:18306] "Dont know" "About equal if volunteered" "Not applicable" "Not applicable" ...
##  $ prestg10: num [1:18306] 48 43 31 47 0 46 48 44 63 64 ...
##  $ ballot  : chr [1:18306] "Ballot c" "Ballot b" "Ballot b" "Ballot a" ...
##  $ zodiac  : chr [1:18306] "No answer" "Leo" "Leo" "Libra" ...
##  $ hsphys  : chr [1:18306] "No" "No" "Not applicable" "Not applicable" ...
##  $ hschem  : chr [1:18306] "No" "Yes" "Not applicable" "Not applicable" ...
##  $ hsbio   : chr [1:18306] "Yes" "Yes" "Not applicable" "Not applicable" ...
##  $ hsmath  : chr [1:18306] "Pre-algebra" "General math, business, or vocational math" "Not applicable" "Not applicable" ...
##  $ colsci  : chr [1:18306] "No" "No" "Not applicable" "Not applicable" ...
##  $ coldeg1 : chr [1:18306] "Associate's" "Not applicable" "Not applicable" "Not applicable" ...
##  $ earthsun: chr [1:18306] "Earth around sun" "Sun around earth" "Not applicable" "Not applicable" ...
##  $ happy   : chr [1:18306] "Pretty happy" "Very happy" "Pretty happy" "Very happy" ...
##  $ bible   : chr [1:18306] "Word of god" "Don't know" "Word of god" "Inspired word" ...
##  $ postlife: chr [1:18306] "Yes" "Don't know" "No" "Yes" ...
##  $ sex     : chr [1:18306] "Female" "Male" "Female" "Female" ...
##  $ degree  : chr [1:18306] "High school" "High school" "Lt high school" "High school" ...
##  $ educ    : chr [1:18306] "13" "14" "9" "12" ...
##  $ age     : chr [1:18306] "50" "27" "67" "50" ...
##  $ childs  : chr [1:18306] "3" "1" "2" "1" ...
##  $ divorce : chr [1:18306] "Not applicable" "Not applicable" "Not applicable" "Not applicable" ...
##  $ marital : chr [1:18306] "Never married" "Never married" "Divorced" "Never married" ...
##  $ wrkgovt : chr [1:18306] "Private" "Private" "Private" "Private" ...
##  $ hrs2    : chr [1:18306] "Not applicable" "Not applicable" "Not applicable" "Not applicable" ...
##  $ id_     : num [1:18306] 1 2 3 4 5 6 7 8 9 10 ...
##  $ race    : chr [1:18306] "Black" "Other" "White" "Black" ...
##  $ reg16   : chr [1:18306] "Foreign" "Middle atlantic" "Middle atlantic" "Middle atlantic" ...
##  $ mobile16: chr [1:18306] "Different state" "Same city" "Same city" "Same city" ...
##  $ relig   : chr [1:18306] "None" "Catholic" "Catholic" "Protestant" ...
##  $ natsci  : chr [1:18306] "About right" "Too much" "Don't know" "Too little" ...
##  $ natenvir: chr [1:18306] "Too little" "Not applicable" "About right" "Not applicable" ...
##  $ natspac : chr [1:18306] "Too much" "Not applicable" "Too much" "Not applicable" ...
##  $ polviews: chr [1:18306] "Extremely liberal" "Slightly liberal" "Conservative" "Slightly liberal" ...
##  $ partyid : chr [1:18306] "Strong democrat" "Ind,near rep" "Strong democrat" "Strong democrat" ...
##  $ rincome : chr [1:18306] "$25000 or more" "$25000 or more" "Not applicable" "$1000 to 2999" ...
##  $ income  : chr [1:18306] "$25000 or more" "$25000 or more" "Refused" "$25000 or more" ...
##  $ born    : chr [1:18306] "No" "No" "No" "Yes" ...
##  $ family16: chr [1:18306] "Mother & father" "Mother & father" "Mother & father" "Mother" ...
##  $ year    : num [1:18306] 2006 2006 2006 2006 2006 ...
#Date range and distribution of downloaded data
GSS <- GSS %>% filter(!is.na(year))
GSS %>% count(year)
## # A tibble: 7 x 2
##    year     n
##   <dbl> <int>
## 1  2006  4510
## 2  2008  2023
## 3  2010  2044
## 4  2012  1974
## 5  2014  2538
## 6  2016  2867
## 7  2018  2348
#Count responses to "Now, does the Earth go around the Sun, or does the Sun go around the Earth?"
GSS %>% count(earthsun)
## # A tibble: 5 x 2
##   earthsun             n
##   <chr>            <int>
## 1 Dont know          615
## 2 Earth around sun  6600
## 3 No answer           34
## 4 Not applicable    9188
## 5 Sun around earth  1867

Reviewing the distribution of answers to the sun/earth question, I was honestly surprised that “Sun around the earth” was as high as it was. For those who chose definitively, 22% got it wrong. I would love to give the American public the benefit of the doubt and say that a lot of people where just tired or confused by the time they reached this question, but I fear that is not the case. To see if there are any differences between individuals who believe that the sun revolves around the earth and vice versa, I narrowed down the sample to include only those who provided definitive answers on this and other relevant questions (excluding records containing refusals and “don’t know” answers).

#Filter for complete responses for association testing and modeling
GSS <- GSS %>%  filter((earthsun == "Earth around sun" | 
                  earthsun == "Sun around earth") &
                    marital != "No answer" &
                    polviews != "Don't know" &
                    polviews != "No answer" &
                    born != "No answer" &
                    born != "Don't know" & 
                    family16 != "No answer" &
                    childs != "Dk na" &
                    class_ != "Don't know" &
                    class_ != "No answer" &  
                    mobile16 != "Don't know" &
                    mobile16 != "No answer" &
                    sprtprsn != "Dont know" &
                    sprtprsn != "No answer" &
                    relpersn != "Dont know" &
                    relpersn != "No answer" &
                    reborn != "Don't know" &
                    reborn != "No answer" &
                    god != "Don't know" &
                    god != "No answer" &
                    bible !="Don't know" & 
                    bible != "No answer" & 
                    postlife != "Don't know" &
                    postlife != "No answer" &
                    hsphys != "Dont know" & 
                    hsphys != "No answer" &
                    hschem != "Dont know" & 
                    hschem != "No answer" & 
                    hsbio != "Dont know" & 
                    hsbio != "No answer" &
                    colsci != "Dont know" &
                    colsci != "No answer" &
                    degree != "No answer" &
                    toofast != "Dont know" &
                    toofast != "No answer" &
                    toofast != "Not applicable" &
                    nextgen != "Dont know" &
                    nextgen != "No answer" &
                    nextgen != "Not applicable" &
                    scifrom != "Dont know" &
                    scifrom != "No answer" &
                    scifrom != "Not applicable" & 
                    consci != "Don't know" &
                    consci != "No answer" &
                    consci != "Not applicable" & 
                    advfront != "Dont know" &
                    advfront != "No answer" &
                    advfront != "Not applicable" &  
                    astrosci != "Dont know" &
                    astrosci != "No answer" &
                    astrosci != "Not applicable" &
                    scibnfts != "No answer" &
                    scibnfts != "Not applicable" & 
                    natsci != "No answer") 

Since the data downloaded from the GSS site came as character data, I set factor levels for each of the questions. This also allowed me to set the orders of the factors as I wished for the tables.

#Set factors for categorical variables of interest
GSS <- GSS %>% mutate(marital.factor = factor(marital, levels = 
                                                  c("Never married", "Married",
                                                    "Separated",  "Divorced", 
                                                    "Widowed"))) %>% 
  mutate(race.factor = factor(race, levels = 
                                c("White", "Black", "Other"))) %>% 
  mutate(polviews.factor = factor(polviews, levels = 
                                    c("Extrmly conservative", "Conservative", 
                                      "Slghtly conservative", "Moderate", "Slightly liberal", 
                                      "Liberal", "Extremely liberal"))) %>% 
  mutate(born.factor = factor(born, levels = 
                                c("Yes", "No"))) %>% 
  mutate(family16_rc = case_when(family16 == "Mother & father" ~ "Mother and father", 
                                     T ~ "Other")) %>% 
  mutate(family16_rc.factor = factor(family16_rc, levels = 
                                       c("Mother and father", "Other"))) %>% 
  mutate(age_rc = case_when(age >= 18 & age <= 34 ~ "18-34 yrs",
                            age >= 35 & age <= 49 ~ "35-49 yrs", 
                            age >= 50 & age <= 64 ~ "50-64 yrs",
                            age >= 65 & age <= 79 ~ "65-79 yrs",
                            age >= 80 ~ "80 yrs and above")) %>% 
  mutate(childs_rc = case_when(childs == "0" ~ "No children",
                               T ~ "Have children")) %>% 
  mutate(childs_rc.factor = factor(childs_rc, levels = 
                                     c("No children", "Have children"))) %>% 
  mutate(class_.factor = factor(class_, levels = 
                                  c("Lower class", "Working class", "Middle class",
                                    "Upper class"))) %>% 
  mutate(mobile16.factor = factor(mobile16, levels = 
                                    c("Same city", "Same st,dif city", "Different state"))) %>% 
  mutate(year = factor(year)) %>% 
  mutate(sprtprsn_rc = case_when(sprtprsn == "Modeate spirtual" ~ "Moderate spirtual",
                                  T ~ as.character(sprtprsn))) %>% 
  mutate(sprtprsn.factor = factor(sprtprsn_rc, levels = 
                                    c("Not spiritual", "Slight spiritual", "Moderate spirtual",
                                      "Very spiritual"))) %>% 
  mutate(relpersn_rc = case_when(relpersn == "Modrte religious" ~ "Moderate religious",
                                 T ~ as.character(relpersn))) %>% 
  mutate(relpersn.factor = factor(relpersn_rc, levels =
                                     c("Not religious", "Slight religious", "Moderate religious",
                                       "Very religious"))) %>% 
  mutate(reborn.factor = factor(reborn, levels=
                                  c("No", "Yes", "Don't know"))) %>% 
  mutate(god.factor = factor(god, levels = 
                               c("Dont believe", "No way to find out", 
                                 "Some higher power", "Believe sometimes", 
                                 "Believe but doubts", "Know god exists"))) %>% 
  mutate(bible.factor = factor(bible, levels = 
                                 c("Book of fables", "Inspired word", "Word of god",
                                   "Other"))) %>% 
  mutate(postlife.factor = factor(postlife, levels =
                                    c("No", "Yes", "Not applicable"))) %>% 
  mutate(hsphys.factor = factor(hsphys, levels = 
                                  c("No", "Yes", "Not applicable"))) %>% 
  mutate(hschem.factor = factor(hschem, levels = 
                                  c("No", "Yes", "Not applicable"))) %>% 
  mutate(hsbio.factor = factor(hsbio, levels = 
                                  c("No", "Yes", "Not applicable"))) %>% 
  mutate(colsci.factor = factor(colsci, levels = 
                                 c("No", "Yes", "Not applicable"))) %>% 
  mutate(degree.factor = factor(degree, levels = 
                                  c("Lt high school", "High school", "Junior college", 
                                    "Bachelor", "Graduate"))) %>% 
  mutate(toofast.factor = factor(toofast, levels =
                                   c("Strongly disagree", "Disagree", "Agree", 
                                     "Strongly agree"))) %>% 
  mutate(nextgen.factor = factor(nextgen, levels =
                                   c("Strongly disagree", "Disagree", "Agree", 
                                     "Strongly agree"))) %>% 
  mutate(scifrom_rc = case_when(scifrom == "Tv" ~ "TV",
                                scifrom == "The internet" ~ "Internet",
                                T ~ "Other")) %>% 
  mutate(scifrom_rc.factor = factor(scifrom_rc, levels = 
                                      c("TV", "Internet", "Other"))) %>% 
  mutate(consci.factor = factor(consci, levels = 
                                  c("Hardly any", "Only some", "A great deal"))) %>% 
  mutate(advfront.factor = factor(advfront , levels =
                                   c("Strongly disagree", "Disagree", "Agree", 
                                     "Strongly agree"))) %>% 
  mutate(astrosci.factor = factor(astrosci, levels = 
                                    c("Not at all scientific", "Sort of scientific",
                                      "Very scientific"))) %>% 
  mutate(scibnfts.factor = factor(scibnfts, levels = 
                                    c("Benefits greater", "About equal if volunteered",
                                      "Harmful results greater",  "Dont know"))) %>% 
  mutate(natsci.factor = factor(natsci, levels = 
                                  c("Too little", "About right", "Too much", "Don't know"))) %>% 
  mutate(earthsun = factor(earthsun, levels = 
                                  c("Earth around sun", "Sun around earth"))) %>% 
  mutate(sex = as.factor(sex)) %>% 
  mutate(age_rc = as.factor(age_rc))


#Narrow the dataset to just the variables of interest for modeling
GSS <- GSS %>% select(earthsun, sex, race.factor, age_rc, polviews.factor, 
                       class_.factor, marital.factor, childs_rc.factor, 
                       born.factor, family16_rc.factor, mobile16.factor,
                       sprtprsn.factor, relpersn.factor, postlife.factor, 
                       god.factor, reborn.factor, bible.factor,
                       degree.factor, hsphys.factor, hschem.factor, 
                       hsbio.factor, colsci.factor, 
                       toofast.factor, nextgen.factor, advfront.factor, natsci.factor, 
                       consci.factor, scibnfts.factor, astrosci.factor, scifrom_rc.factor)

Table Creation

My first step for putting together tables showing the differences between those who believe the earth orbits the sun and those who believe the sun orbits the earth is to setup a function using tabyl from the janitor package. This allowed me to get a simple n and % breakdown for each variable in each group of interest (finished tables are displayed further down.)

# Setup function to use for tables
tab_func <- function(df, var, var2) {
  var <- enquo(var)
  var2 <- enquo(var2)
  
  df %>% tabyl(!! var, earthsun) %>% 
    adorn_totals(c("col")) %>%
    adorn_percentages("col") %>%
    adorn_pct_formatting(digits = 2) %>%
    adorn_ns() %>% 
    rename("Variable" = !! var) %>% 
    add_row(Variable = !!var2, 'Earth around sun' = "", 'Sun around earth' = "", 
            Total = "",.before = 1)
}


##Setup Demographics, table 1
#Variables included: sex, race.factor, age_rc, polviews.factor, 
#class_.factor, marital.factor, childs_rc.factor, 
#born.factor, family16_rc.factor, mobile16.factor

Demographics <- tab_func(GSS, sex, "Sex") %>% 
  bind_rows(tab_func(GSS, race.factor, "Race")) %>% 
  bind_rows(tab_func(GSS, age_rc, "Age Group")) %>% 
  bind_rows(tab_func(GSS, polviews.factor, "Political Views")) %>% 
  bind_rows(tab_func(GSS, class_.factor, "Social Class")) %>% 
  bind_rows(tab_func(GSS, marital.factor, "Marital Status")) %>% 
  bind_rows(tab_func(GSS, childs_rc.factor, "Are Parents")) %>% 
  bind_rows(tab_func(GSS, born.factor, "Born in US")) %>% 
  bind_rows(tab_func(GSS, family16_rc.factor, "Lived w/ Both Parents at 16yrs Old")) %>% 
  bind_rows(tab_func(GSS, mobile16.factor, "Location at 16yrs Old"))

##Setup Religion, table 2
#Variables included: sprtprsn.factor, relpersn.factor, postlife.factor, 
#god.factor, reborn.factor, bible.factor
Religion <- tab_func(GSS, sprtprsn.factor, "Level of Spirtuality") %>% 
  bind_rows(tab_func(GSS, relpersn.factor, "Level of Religiousness")) %>% 
  bind_rows(tab_func(GSS, postlife.factor, "Life after Death")) %>%
  bind_rows(tab_func(GSS, god.factor, "Existence of God")) %>%
  bind_rows(tab_func(GSS, reborn.factor, "'Born Again' Experience")) %>%
  bind_rows(tab_func(GSS, bible.factor, "View of Bible")) 

##Setup Education, table 3
#Variables included: degree.factor, hsphys.factor, hschem.factor 
#hsbio.factor, colsci.factor 
Education <- tab_func(GSS, degree.factor, "Highest Degree") %>% 
  bind_rows(tab_func(GSS, hsphys.factor, "Taken High School Physics")) %>% 
  bind_rows(tab_func(GSS, hschem.factor, "Taken High School Chemistry")) %>% 
  bind_rows(tab_func(GSS, hsbio.factor, "Taken High School Biology")) %>% 
  bind_rows(tab_func(GSS, colsci.factor, "Taken Any College-Level Science")) 

##Setup Science, table 4
#Variables included: toofast.factor, nextgen.factor, advfront.factor, natsci.factor, 
#consci.factor, scibnfts.factor, astrosci.factor, scifrom_rc.factor 
Science <- tab_func(GSS, toofast.factor, "Science Changes Life Too Fast") %>% 
  bind_rows(tab_func(GSS, nextgen.factor, 
                     "Science Gives More Opportunities to Next Generation")) %>% 
  bind_rows(tab_func(GSS, advfront.factor, 
                     "Federal Goverment Should Support Science Research")) %>% 
  bind_rows(tab_func(GSS, natsci.factor, "Amount Spent Supporting Scientific Research")) %>% 
  bind_rows(tab_func(GSS, consci.factor, "Confidence in Scientific Community")) %>% 
  bind_rows(tab_func(GSS, scibnfts.factor, "Benifits of Science Research Outweigh Harms")) %>% 
  bind_rows(tab_func(GSS, astrosci.factor, "Astrology is Scientific")) %>% 
  bind_rows(tab_func(GSS, scifrom_rc.factor, "Source of Information about Science")) 

Next, I added in chi-square to see if the differences between the two groups for each variable was statistically significant. The p-values are added to the same tables for ease of reference. I included the regular chi-squared with simulation p-value, the Bonferroni corrected p-value, and the chi-squared without simulation p-value.

## Chi-Squared (29 tests used for Bonferroni correction)
#Demographics chisq 
Demo_chi <- rownames_to_column(as.data.frame(t(as.matrix(data.frame
                    (lapply(GSS[,c("sex", "race.factor", "age_rc", "polviews.factor", 
                                   "class_.factor", "marital.factor", "childs_rc.factor",
                                   "born.factor", "family16_rc.factor", "mobile16.factor" )], 
                    function(x) chisq.test(table(x,GSS$earthsun), 
                                           simulate.p.value = T)$p.value)))))) %>%
  mutate(bcpv = V1*29) %>% 
  mutate(bcpv = case_when(bcpv >= 1 ~ .99999999,
                          T ~ bcpv)) %>% 
  rename('Chisq p-value' = "V1") %>% 
  rename('Bonferroni corrected p-value' = "bcpv") %>% 
  left_join(rownames_to_column(as.data.frame(t(as.matrix(data.frame
                   (lapply(GSS[,c("sex", "race.factor", "age_rc", "polviews.factor", 
                                  "class_.factor", "marital.factor", "childs_rc.factor", 
                                 "born.factor", "family16_rc.factor", "mobile16.factor")], 
                    function(x) chisq.test(table(x,GSS$earthsun), 
                                           simulate.p.value = F)$p.value)))))) %>% 
             rename('Chisq w/out simulated p-value' = "V1")) %>% 
  add_column(Variable = c("Sex", "Race", "Age Group", "Political Views", 
                       "Social Class", "Marital Status", "Are Parents", 
                       "Born in US", "Lived w/ Both Parents at 16yrs Old", 
                       "Location at 16yrs Old"), .before = 1) %>% 
  select(-rowname)

#Religion chisq
Rel_chi <- rownames_to_column(as.data.frame(t(as.matrix(data.frame
                     (lapply(GSS[,c("sprtprsn.factor", "relpersn.factor", "postlife.factor", 
                                    "god.factor", "reborn.factor", "bible.factor")], 
                     function(x) chisq.test(table(x,GSS$earthsun), 
                                            simulate.p.value = T)$p.value)))))) %>%
  mutate(bcpv = V1*29) %>% 
  mutate(bcpv = case_when(bcpv >= 1 ~ .99999999,
                          T ~ bcpv)) %>% 
  rename('Chisq p-value' = "V1") %>% 
  rename('Bonferroni corrected p-value' = "bcpv") %>% 
  left_join(rownames_to_column(as.data.frame(t(as.matrix(data.frame
                     (lapply(GSS[,c("sprtprsn.factor", "relpersn.factor", "postlife.factor", 
                                    "god.factor", "reborn.factor", "bible.factor")], 
                     function(x) chisq.test(table(x,GSS$earthsun), 
                                            simulate.p.value = F)$p.value)))))) %>% 
              rename('Chisq w/out simulated p-value' = "V1")) %>% 
  add_column(Variable = c("Level of Spirtuality", "Level of Religiousness", "Life after Death",
                          "Existence of God", "'Born Again' Experience", 
                          "View of Bible" ), .before = 1) %>% 
  select(-rowname)


#Education chisq
Edu_chi <- rownames_to_column(as.data.frame(t(as.matrix(data.frame
                    (lapply(GSS[,c("degree.factor", "hsphys.factor", "hschem.factor", 
                                   "hsbio.factor", "colsci.factor")], 
                    function(x) chisq.test(table(x,GSS$earthsun), 
                                           simulate.p.value = T)$p.value)))))) %>%
  mutate(bcpv = V1*29) %>% 
  mutate(bcpv = case_when(bcpv >= 1 ~ .99999999,
                          T ~ bcpv)) %>% 
  rename('Chisq p-value' = "V1") %>% 
  rename('Bonferroni corrected p-value' = "bcpv") %>% 
  left_join(rownames_to_column(as.data.frame(t(as.matrix(data.frame
                    (lapply(GSS[,c("degree.factor", "hsphys.factor", "hschem.factor", 
                                   "hsbio.factor", "colsci.factor")], 
                    function(x) chisq.test(table(x,GSS$earthsun), 
                                           simulate.p.value = F)$p.value)))))) %>% 
              rename('Chisq w/out simulated p-value' = "V1")) %>% 
  add_column(Variable = c("Highest Degree", "Taken High School Physics", 
                          "Taken High School Chemistry", "Taken High School Biology", 
                          "Taken Any College-Level Science" ), .before = 1) %>% 
  select(-rowname)

#Science chisq
Sci_chi <- rownames_to_column(as.data.frame(t(as.matrix(data.frame
                    (lapply(GSS[,c("toofast.factor", "nextgen.factor", "advfront.factor",
                                   "natsci.factor", "consci.factor", "scibnfts.factor",
                                   "astrosci.factor", "scifrom_rc.factor")], 
                    function(x) chisq.test(table(x,GSS$earthsun), 
                                           simulate.p.value = T)$p.value)))))) %>%
  mutate(bcpv = V1*29) %>% 
  mutate(bcpv = case_when(bcpv >= 1 ~ .99999999,
                          T ~ bcpv)) %>% 
  rename('Chisq p-value' = "V1") %>% 
  rename('Bonferroni corrected p-value' = "bcpv") %>% 
  left_join(rownames_to_column(as.data.frame(t(as.matrix(data.frame
                    (lapply(GSS[,c("toofast.factor", "nextgen.factor", "advfront.factor",
                                   "natsci.factor", "consci.factor", "scibnfts.factor",
                                   "astrosci.factor", "scifrom_rc.factor")], 
                    function(x) chisq.test(table(x,GSS$earthsun), 
                                           simulate.p.value = F)$p.value)))))) %>% 
              rename('Chisq w/out simulated p-value' = "V1")) %>% 
  add_column(Variable = c("Science Changes Life Too Fast", 
                          "Science Gives More Opportunities to Next Generation", 
                          "Federal Goverment Should Support Science Research", 
                          "Amount Spent Supporting Scientific Research", 
                          "Confidence in Scientific Community", 
                          "Benifits of Science Research Outweigh Harms", 
                          "Astrology is Scientific", "Source of Information about Science"),
             .before = 1) %>% 
  select(-rowname)

Science2 <- Science %>% left_join(Sci_chi)

When you review the percentage breakdown for the two groups, you can see a strong difference for many of the variables. Given the sample size, it isn’t surprising that many of the differences are statically significant. However, just because a difference is statistically significant does not mean that it is meaningful. To get a better gauge of the association level between the earth/sun category and the other variables, I added Cramers V (ranges from 0 to 1, with 1 being the strongest association).

##Cramer's V

#Demographics Cramer's V
Demo_v <- rownames_to_column(as.data.frame(t(as.matrix(data.frame
                   (lapply(GSS[,c("sex", "race.factor", "age_rc", "polviews.factor", 
                                  "class_.factor", "marital.factor", "childs_rc.factor",
                                   "born.factor", "family16_rc.factor", "mobile16.factor" )], 
                   function(x) assocstats(table(x,GSS$earthsun))$cramer)))))) %>%
  rename('Cramers V' = "V1") %>% 
  add_column(Variable = c("Sex", "Race", "Age Group", "Political Views", 
                          "Social Class", "Marital Status", "Are Parents", 
                          "Born in US", "Lived w/ Both Parents at 16yrs Old", 
                          "Location at 16yrs Old"), .before = 1) %>% 
  select(-rowname)

Demographics2 <- Demographics %>% left_join(Demo_chi) %>% left_join(Demo_v)

#Religion Cramer's V
Rel_v <- rownames_to_column(as.data.frame(t(as.matrix(data.frame
                    (lapply(GSS[,c("sprtprsn.factor", "relpersn.factor", "postlife.factor", 
                                   "god.factor", "reborn.factor", "bible.factor")], 
                     function(x) assocstats(table(x,GSS$earthsun))$cramer)))))) %>%
  rename('Cramers V' = "V1") %>% 
  add_column(Variable = c("Level of Spirtuality", "Level of Religiousness", "Life after Death",
                          "Existence of God", "'Born Again' Experience", "View of Bible"),
             .before = 1) %>% 
  select(-rowname)

Religion2 <- Religion %>% left_join(Rel_chi) %>% left_join(Rel_v)

#Education Cramer's V
Edu_v <- rownames_to_column(as.data.frame(t(as.matrix(data.frame
                    (lapply(GSS[,c("degree.factor", "hsphys.factor", "hschem.factor", 
                                   "hsbio.factor", "colsci.factor")], 
                    function(x) assocstats(table(x,GSS$earthsun))$cramer)))))) %>%
  rename('Cramers V' = "V1") %>% 
  add_column(Variable = c("Highest Degree", "Taken High School Physics", 
                          "Taken High School Chemistry", "Taken High School Biology", 
                          "Taken Any College-Level Science" ), .before = 1) %>% 
  select(-rowname)

Education2 <- Education %>% left_join(Edu_chi) %>% left_join(Edu_v)

#Science Cramer's V
Sci_v <- rownames_to_column(as.data.frame(t(as.matrix(data.frame
                      (lapply(GSS[,c("toofast.factor", "nextgen.factor", "advfront.factor",
                                     "natsci.factor", "consci.factor", "scibnfts.factor",
                                     "astrosci.factor", "scifrom_rc.factor")], 
                      function(x) assocstats(table(x,GSS$earthsun))$cramer)))))) %>%
  rename('Cramers V' = "V1") %>% 
  add_column(Variable = c("Science Changes Life Too Fast", 
                          "Science Gives More Opportunities to Next Generation", 
                          "Federal Goverment Should Support Science Research", 
                          "Amount Spent Supporting Scientific Research", 
                          "Confidence in Scientific Community", 
                          "Benifits of Science Research Outweigh Harms", 
                          "Astrology is Scientific", 
                          "Source of Information about Science"), .before = 1) %>% 
  select(-rowname)

Science2 <- Science %>% left_join(Sci_chi) %>% left_join(Sci_v)

Not surprisingly, although the differences between the groups was significant for many variables, the strength of the associations were weak overall. This makes intuitive sense as many complicated social factors and past experiences would likely contribute to a person’s beliefs.

Formatted Tables

Although I have used the kableExtra package before, I haven’t had the opportunity to try out some of the available formatting options. Below, you can see I made some custom alignments on column headings, added bold formatting, and indented the labels under each category. My favorite though is the option to keep the column header floating at top as you scroll down the chart, which is so helpful when reviewing long tables.

kable(Demographics2,
      align = c("l", "c", "c", "c", "c", "c", "c", "c")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                fixed_thead = T) %>%
  column_spec(1, bold = T) %>%
  add_indent(c(2:3, 5:7, 9:13, 15:21, 23:26, 28:32, 34:35, 37:38,
               40:41, 43:45))
Variable Earth around sun Sun around earth Total Chisq p-value Bonferroni corrected p-value Chisq w/out simulated p-value Cramers V
Sex 0.0004998 0.0144928 0.0000000 0.1397256
Female 50.68% (1705) 67.84% (599) 54.25% (2304)
Male 49.32% (1659) 32.16% (284) 45.75% (1943)
Race 0.0004998 0.0144928 0.0000000 0.1324984
White 79.25% (2666) 68.63% (606) 77.04% (3272)
Black 11.83% (398) 23.22% (205) 14.20% (603)
Other 8.92% (300) 8.15% (72) 8.76% (372)
Age Group 0.0004998 0.0144928 0.0000001 0.0957567
18-34 yrs 28.51% (959) 23.22% (205) 27.41% (1164)
35-49 yrs 29.07% (978) 25.37% (224) 28.30% (1202)
50-64 yrs 28.33% (953) 29.45% (260) 28.56% (1213)
65-79 yrs 10.85% (365) 16.42% (145) 12.01% (510)
80 yrs and above 3.24% (109) 5.55% (49) 3.72% (158)
Political Views 0.0009995 0.0289855 0.0000507 0.0832247
Extrmly conservative 3.60% (121) 4.42% (39) 3.77% (160)
Conservative 16.68% (561) 15.52% (137) 16.44% (698)
Slghtly conservative 16.23% (546) 14.50% (128) 15.87% (674)
Moderate 34.57% (1163) 43.15% (381) 36.36% (1544)
Slightly liberal 11.92% (401) 9.85% (87) 11.49% (488)
Liberal 12.78% (430) 8.83% (78) 11.96% (508)
Extremely liberal 4.22% (142) 3.74% (33) 4.12% (175)
Social Class 0.0004998 0.0144928 0.0000000 0.1011017
Lower class 7.10% (239) 13.36% (118) 8.41% (357)
Working class 44.11% (1484) 45.98% (406) 44.50% (1890)
Middle class 45.24% (1522) 37.26% (329) 43.58% (1851)
Upper class 3.54% (119) 3.40% (30) 3.51% (149)
Marital Status 0.0004998 0.0144928 0.0000000 0.1071902
Never married 26.96% (907) 25.82% (228) 26.72% (1135)
Married 47.86% (1610) 40.88% (361) 46.41% (1971)
Separated 3.30% (111) 3.96% (35) 3.44% (146)
Divorced 16.62% (559) 18.01% (159) 16.91% (718)
Widowed 5.26% (177) 11.33% (100) 6.52% (277)
Are Parents 0.0004998 0.0144928 0.0000000 0.1046251
No children 31.27% (1052) 19.59% (173) 28.84% (1225)
Have children 68.73% (2312) 80.41% (710) 71.16% (3022)
Born in US 0.3733133 1.0000000 0.3866775 0.0142401
Yes 89.54% (3012) 90.60% (800) 89.76% (3812)
No 10.46% (352) 9.40% (83) 10.24% (435)
Lived w/ Both Parents at 16yrs Old 0.0004998 0.0144928 0.0001108 0.0599425
Mother and father 70.15% (2360) 63.31% (559) 68.73% (2919)
Other 29.85% (1004) 36.69% (324) 31.27% (1328)
Location at 16yrs Old 0.0004998 0.0144928 0.0000011 0.0804230
Same city 36.33% (1222) 45.30% (400) 38.19% (1622)
Same st,dif city 26.25% (883) 25.25% (223) 26.04% (1106)
Different state 37.43% (1259) 29.45% (260) 35.77% (1519)
kable(Religion2,
      align = c("l", "c", "c", "c", "c", "c", "c", "c")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                fixed_thead = T) %>%
  column_spec(1, bold = T) %>%
  add_indent(c(2:5, 7:10, 12:14, 16:21, 23:24, 26:29))
Variable Earth around sun Sun around earth Total Chisq p-value Bonferroni corrected p-value Chisq w/out simulated p-value Cramers V
Level of Spirtuality 0.8950525 1.0000000 0.8964461 0.0118854
Not spiritual 8.68% (292) 9.40% (83) 8.83% (375)
Slight spiritual 19.65% (661) 20.05% (177) 19.73% (838)
Moderate spirtual 39.98% (1345) 39.18% (346) 39.82% (1691)
Very spiritual 31.69% (1066) 31.37% (277) 31.62% (1343)
Level of Religiousness 0.0004998 0.0144928 0.0000367 0.0739062
Not religious 18.64% (627) 12.00% (106) 17.26% (733)
Slight religious 22.80% (767) 23.56% (208) 22.96% (975)
Moderate religious 40.46% (1361) 43.04% (380) 40.99% (1741)
Very religious 18.10% (609) 21.40% (189) 18.79% (798)
Life after Death
No 19.17% (645) 19.48% (172) 19.24% (817)
Yes 80.83% (2719) 80.52% (711) 80.76% (3430)
Not applicable 0.00% (0) 0.00% (0) 0.00% (0)
Existence of God 0.0004998 0.0144928 0.0000000 0.1012334
Dont believe 3.75% (126) 2.60% (23) 3.51% (149)
No way to find out 4.52% (152) 2.83% (25) 4.17% (177)
Some higher power 12.87% (433) 7.47% (66) 11.75% (499)
Believe sometimes 3.48% (117) 3.51% (31) 3.48% (148)
Believe but doubts 18.16% (611) 15.06% (133) 17.52% (744)
Know god exists 57.22% (1925) 68.52% (605) 59.57% (2530)
‘Born Again’ Experience
No 61.30% (2062) 50.96% (450) 59.15% (2512)
Yes 38.70% (1302) 49.04% (433) 40.85% (1735)
Don’t know 0.00% (0) 0.00% (0) 0.00% (0)
View of Bible 0.0004998 0.0144928 0.0000000 0.1386902
Book of fables 20.60% (693) 13.48% (119) 19.12% (812)
Inspired word 49.73% (1673) 42.92% (379) 48.32% (2052)
Word of god 28.15% (947) 43.15% (381) 31.27% (1328)
Other 1.52% (51) 0.45% (4) 1.30% (55)
kable(Education2,
      align = c("l", "c", "c", "c", "c", "c", "c", "c")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                fixed_thead = T) %>%
  column_spec(1, bold = T) %>%
  add_indent(c(2:6, 8:10, 12:14, 16:18, 20:22))
Variable Earth around sun Sun around earth Total Chisq p-value Bonferroni corrected p-value Chisq w/out simulated p-value Cramers V
Highest Degree 0.0004998 0.0144928 0 0.2160748
Lt high school 8.17% (275) 19.25% (170) 10.48% (445)
High school 47.47% (1597) 57.08% (504) 49.47% (2101)
Junior college 8.71% (293) 8.95% (79) 8.76% (372)
Bachelor 22.98% (773) 11.55% (102) 20.60% (875)
Graduate 12.66% (426) 3.17% (28) 10.69% (454)
Taken High School Physics 0.0004998 0.0144928 0 0.1093564
No 61.68% (2075) 69.88% (617) 63.39% (2692)
Yes 34.54% (1162) 23.22% (205) 32.19% (1367)
Not applicable 3.78% (127) 6.91% (61) 4.43% (188)
Taken High School Chemistry 0.0004998 0.0144928 0 0.1448228
No 38.05% (1280) 52.32% (462) 41.02% (1742)
Yes 58.17% (1957) 40.77% (360) 54.56% (2317)
Not applicable 3.78% (127) 6.91% (61) 4.43% (188)
Taken High School Biology 0.0004998 0.0144928 0 0.1397895
No 14.71% (495) 25.71% (227) 17.00% (722)
Yes 81.51% (2742) 67.38% (595) 78.57% (3337)
Not applicable 3.78% (127) 6.91% (61) 4.43% (188)
Taken Any College-Level Science 0.0004998 0.0144928 0 0.2037968
No 46.94% (1579) 71.80% (634) 52.11% (2213)
Yes 51.78% (1742) 26.84% (237) 46.60% (1979)
Not applicable 1.28% (43) 1.36% (12) 1.30% (55)
kable(Science2,
      align = c("l", "c", "c", "c", "c", "c", "c", "c")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                fixed_thead = T) %>%
  column_spec(1, bold = T) %>%
  add_indent(c(2:5, 7:10, 12:15, 17:20, 22:24, 26:29, 31:33, 35:37))
Variable Earth around sun Sun around earth Total Chisq p-value Bonferroni corrected p-value Chisq w/out simulated p-value Cramers V
Science Changes Life Too Fast 0.0004998 0.0144928 0.0000000 0.1109579
Strongly disagree 8.44% (284) 4.87% (43) 7.70% (327)
Disagree 48.66% (1637) 39.18% (346) 46.69% (1983)
Agree 33.65% (1132) 42.70% (377) 35.53% (1509)
Strongly agree 9.24% (311) 13.25% (117) 10.08% (428)
Science Gives More Opportunities to Next Generation 0.6631684 1.0000000 0.6574115 0.0194626
Strongly disagree 1.16% (39) 1.36% (12) 1.20% (51)
Disagree 7.34% (247) 8.38% (74) 7.56% (321)
Agree 53.15% (1788) 53.34% (471) 53.19% (2259)
Strongly agree 38.35% (1290) 36.92% (326) 38.05% (1616)
Federal Goverment Should Support Science Research 0.0004998 0.0144928 0.0000024 0.0824596
Strongly disagree 1.28% (43) 1.93% (17) 1.41% (60)
Disagree 10.17% (342) 14.04% (124) 10.97% (466)
Agree 57.70% (1941) 61.27% (541) 58.44% (2482)
Strongly agree 30.86% (1038) 22.76% (201) 29.17% (1239)
Amount Spent Supporting Scientific Research 0.0024988 0.0724638 0.0019679 0.0590920
Too little 42.36% (1425) 37.83% (334) 41.42% (1759)
About right 44.00% (1480) 44.05% (389) 44.01% (1869)
Too much 9.81% (330) 13.93% (123) 10.67% (453)
Don’t know 3.83% (129) 4.19% (37) 3.91% (166)
Confidence in Scientific Community 0.0004998 0.0144928 0.0000000 0.1236839
Hardly any 5.08% (171) 9.29% (82) 5.96% (253)
Only some 48.01% (1615) 57.76% (510) 50.04% (2125)
A great deal 46.91% (1578) 32.96% (291) 44.01% (1869)
Benifits of Science Research Outweigh Harms 0.0004998 0.0144928 0.0000000 0.1429284
Benefits greater 77.76% (2616) 63.08% (557) 74.71% (3173)
About equal if volunteered 12.22% (411) 17.89% (158) 13.40% (569)
Harmful results greater 7.49% (252) 13.25% (117) 8.69% (369)
Dont know 2.53% (85) 5.78% (51) 3.20% (136)
Astrology is Scientific 0.0004998 0.0144928 0.0000000 0.1303334
Not at all scientific 67.51% (2271) 52.32% (462) 64.35% (2733)
Sort of scientific 26.66% (897) 37.83% (334) 28.99% (1231)
Very scientific 5.83% (196) 9.85% (87) 6.66% (283)
Source of Information about Science 0.0004998 0.0144928 0.0000000 0.1225353
TV 30.38% (1022) 43.37% (383) 33.08% (1405)
Internet 40.52% (1363) 28.20% (249) 37.96% (1612)
Other 29.10% (979) 28.43% (251) 28.96% (1230)

Machine Learning with Random Forest Modeling

Looking over the charts, you can see large number (29) of variables, most which do differ significantly between the two groups, but which likely only weakly affect the variable of interest. It sets up an interesting situation to see if machine learning could the use breadth of variables to help predict if a person would answer the question correctly.

Dipping my toes into the machine learning world, I went with a random forest modeling approach. I started by subsetting the data into a training set and a testing set.

##Random Forest Prediction


###Modeling

#Separate out training and test datasets
set.seed(135)

sep <- sample(2, nrow(GSS), replace = T, prob = c(.7, .3))

train <- GSS[sep==1,]
test <- GSS[sep==2,]

From here I setup a basic model with the training data and plotted the out-of-bag error rate.

#random forest

set.seed(246)

model <- randomForest(earthsun ~ ., train, proximity = T)

model
## 
## Call:
##  randomForest(formula = earthsun ~ ., data = train, proximity = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 20.72%
## Confusion matrix:
##                  Earth around sun Sun around earth class.error
## Earth around sun             2333               30  0.01269573
## Sun around earth              588               31  0.94991922
oob_error <- data.frame(Trees=rep(1:nrow(model$err.rate), times=3),
                        Type=rep(c("OOB", "Earth around sun", "Sun around earth"),
                                 each=nrow(model$err.rate)), 
                        Error=c(model$err.rate[,"OOB"],
                                model$err.rate[, "Earth around sun"], 
                                model$err.rate[, "Sun around earth"]))

oob_error$Type[oob_error$Type == "OOB"] <- "Overall"


ggplot(oob_error, aes(x=Trees, y=Error))+
  geom_line(aes(color=Type))+
  labs(title = "OOB Error Rate by Number of Trees",
       y = "Error Rate",
       x = "Number of Trees") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.line.x.bottom = element_line(color = "grey30"),
        axis.line.y.left = element_line(color = "grey30"),
        panel.background = element_blank())+ 
  guides(color = guide_legend(reverse = TRUE))

An out-of-bag error rate of 20.72% isn’t something to write home about, and even worse is the plot with the error rate for “Sun around earth” stabilizing at the very top. Thankfully, no one is actually in need of a model to predict the correct understanding of the setup of the solar system.

Next step is to run the model on the test data and measure the performance.

#Predict
predict2 <- predict(model, test)
confusionMatrix(table(predict2, test$earthsun)) 
## Confusion Matrix and Statistics
## 
##                   
## predict2           Earth around sun Sun around earth
##   Earth around sun              990              255
##   Sun around earth               11                9
##                                           
##                Accuracy : 0.7897          
##                  95% CI : (0.7662, 0.8119)
##     No Information Rate : 0.7913          
##     P-Value [Acc > NIR] : 0.5712          
##                                           
##                   Kappa : 0.035           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.98901         
##             Specificity : 0.03409         
##          Pos Pred Value : 0.79518         
##          Neg Pred Value : 0.45000         
##              Prevalence : 0.79130         
##          Detection Rate : 0.78261         
##    Detection Prevalence : 0.98419         
##       Balanced Accuracy : 0.51155         
##                                           
##        'Positive' Class : Earth around sun
## 

When I tested the model, I received an accuracy of 0.7897. More disheartening is the placement of almost the entire sample in the “Earth around sun” group. I could have placed them in this category before the whole process started since approximately 75% of the sample falls into it.

I ran the tuneRF function to see if any small improvement of the model could be made by adjusting the mtry (number of variables tried after each split). The previous, automatically selected mtry was 5. I set the limit to 300 trees as not much improvement was seen after that on the plot.

t <- tuneRF(train[,-1], train$earthsun,
       stepFactor = 0.5, 
       plot= T, 
       ntreeTry = 300,
       trace= T,
       improve = 0.05)
## mtry = 5  OOB error = 20.69% 
## Searching left ...
## mtry = 10    OOB error = 21.16% 
## -0.02269044 0.05 
## Searching right ...
## mtry = 2     OOB error = 20.66% 
## 0.001620746 0.05

It looks like the out-of-bag error rate was lowest when the mtry = 2, so I have updated the model with this and a tree limit of 300.

#updating the model again
model_v2 <- randomForest(earthsun ~ ., train, proximity = T,
                         ntree= 300, 
                         mtry = 2,
                         importance = T)

model_v2
## 
## Call:
##  randomForest(formula = earthsun ~ ., data = train, proximity = T,      ntree = 300, mtry = 2, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.42%
## Confusion matrix:
##                  Earth around sun Sun around earth class.error
## Earth around sun             2358                5 0.002115954
## Sun around earth              604               15 0.975767367

The out-of-bag error rate is now down to 20.42%. I will take any improvement I can get. Now to see how the updated model preforms with the testing data.

predict2_v2 <- predict(model_v2, test)
confusionMatrix(table(predict2_v2, test$earthsun))  
## Confusion Matrix and Statistics
## 
##                   
## predict2_v2        Earth around sun Sun around earth
##   Earth around sun              997              262
##   Sun around earth                4                2
##                                           
##                Accuracy : 0.7897          
##                  95% CI : (0.7662, 0.8119)
##     No Information Rate : 0.7913          
##     P-Value [Acc > NIR] : 0.5712          
##                                           
##                   Kappa : 0.0056          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.996004        
##             Specificity : 0.007576        
##          Pos Pred Value : 0.791898        
##          Neg Pred Value : 0.333333        
##              Prevalence : 0.791304        
##          Detection Rate : 0.788142        
##    Detection Prevalence : 0.995257        
##       Balanced Accuracy : 0.501790        
##                                           
##        'Positive' Class : Earth around sun
## 

Unfortunately, the accuracy rate stayed the same. Upside of all of this process is that one day when I run a model for work that may have some actual consequence to the world, I will be that much more excited when the model preforms well (a girl can dream).

This does still provide a great opportunity to dig into the model and understand it a bit more. Below is a frequency plot of the number of nodes for the random forest trees. The peak is hovering around 575 nodes.

#number of nodes
data_frame(treesize(model_v2)) %>% 
  ggplot() + 
  geom_bar(aes(x=treesize(model_v2))) +
  labs(title = "Number of Nodes Used in Random Forests",
       y = "Count",
       x = "Number of Nodes Used") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.line.x.bottom = element_line(color = "grey30"),
        axis.line.y.left = element_line(color = "grey30"),
        panel.background = element_blank())+ 
  guides(color = guide_legend(reverse = TRUE))

Another item to review is which variables were most important in the model. Below I have called the top variables.

#variables of importance 
varImpPlot(model_v2,
           sort = T,
           n.var = 10,
           main= "Top 10 of Variable Importance")

Ideally it would be great for models to be as simple as they can, and since the performance on this one is lower, why not try to at least simplify it by only using some of the top variables?

#simplified model 
#remove other variables
train2 <- train %>% select(earthsun, polviews.factor, age_rc, marital.factor, degree.factor,
                           god.factor, sprtprsn.factor, class_.factor, toofast.factor, 
                           relpersn.factor, scibnfts.factor)

test2 <- test %>% select(earthsun, polviews.factor, age_rc, marital.factor, degree.factor,
                           god.factor, sprtprsn.factor,  class_.factor, toofast.factor, 
                           relpersn.factor, scibnfts.factor)

#rerun model with less variables, remove mtry
model_v3 <- randomForest(earthsun ~ ., train2, proximity = T)

model_v3
## 
## Call:
##  randomForest(formula = earthsun ~ ., data = train2, proximity = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 22%
## Confusion matrix:
##                  Earth around sun Sun around earth class.error
## Earth around sun             2268               95  0.04020313
## Sun around earth              561               58  0.90630048
#Predict
predict3_v3 <- predict(model_v3, test2)
confusionMatrix(table(predict3_v3, test2$earthsun))  
## Confusion Matrix and Statistics
## 
##                   
## predict3_v3        Earth around sun Sun around earth
##   Earth around sun              962              245
##   Sun around earth               39               19
##                                           
##                Accuracy : 0.7755          
##                  95% CI : (0.7515, 0.7982)
##     No Information Rate : 0.7913          
##     P-Value [Acc > NIR] : 0.921           
##                                           
##                   Kappa : 0.0463          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.96104         
##             Specificity : 0.07197         
##          Pos Pred Value : 0.79702         
##          Neg Pred Value : 0.32759         
##              Prevalence : 0.79130         
##          Detection Rate : 0.76047         
##    Detection Prevalence : 0.95415         
##       Balanced Accuracy : 0.51650         
##                                           
##        'Positive' Class : Earth around sun
## 

The simplified model increased the out-of-bag error rate to 22% with the training data and brought the accuracy down to 0.7755 with the test data.

Wrapping up

All around this was an interesting exercise allowing me to try a few new things. Many thanks to my friend Phil, https://github.com/probertswsu, who helped me see the trees from the forest during this process (pun intended).

The General Social Survey (GSS) is a project of the independent research organization NORC at the University of Chicago, with principal funding from the National Science Foundation. Smith, Tom W., Davern, Michael, Freese, Jeremy, and Morgan, Stephen, General Social Surveys, 1972-2018 [machine-readable data file] /Principal Investigator, Smith, Tom W.; Co-Principal Investigators, Michael Davern, Jeremy Freese, and Stephen Morgan; Sponsored by National Science Foundation. –NORC ed.– Chicago: NORC, 2018: NORC at the University of Chicago [producer and distributor]. Data accessed from the GSS Data Explorer website at gssdataexplorer.norc.org.