Recently I came across a new R table package that was ideal for displaying categorical data across study arms and would be super helpful for my work with clinical trials data. I set out to find some light-hearted data to test the package out. I settled on using data from the General Social Survey (GSS) from the independent research organization NORC at the University of Chicago (full citation below). The GSS has a set of science related questions, including this gem: “Now, does the Earth go around the Sun, or does the Sun go around the Earth?” With the numerous other categorical questions asked during the GSS, I thought this would be a perfect set of data to use.
Unfortunately, I ran into some major roadblocks with the new table package. Since it is still in development, I decided to pivot to kableExtra and learn a few new formatting tweaks. Even so, I was feeling less than satisfied, so I took a step further and ran a random forest model.
Enough said, let’s dig in:
#Load data
GSS <- read_excel("GSS.xls")
#Check out dataframe
str(GSS)
## tibble [18,306 x 51] (S3: tbl_df/tbl/data.frame)
## $ sei10 : num [1:18306] 24.2 52.4 35.4 38.8 -1 59.8 24.2 43.8 81.6 82.5 ...
## $ toofast : chr [1:18306] "Agree" "Disagree" "Not applicable" "Not applicable" ...
## $ nextgen : chr [1:18306] "Agree" "Strongly agree" "Not applicable" "Not applicable" ...
## $ scifrom : chr [1:18306] "The internet" "The internet" "Not applicable" "Not applicable" ...
## $ sprtprsn: chr [1:18306] "Very spiritual" "Slight spiritual" "Slight spiritual" "Very spiritual" ...
## $ relpersn: chr [1:18306] "Very religious" "Slight religious" "Slight religious" "Slight religious" ...
## $ reborn : chr [1:18306] "Yes" "No" "No" "Yes" ...
## $ god : chr [1:18306] "Know god exists" "Believe but doubts" "Know god exists" "Know god exists" ...
## $ comprend: chr [1:18306] "Fair" "Fair" "Fair" "Good" ...
## $ class_ : chr [1:18306] "Working class" "Working class" "Middle class" "Working class" ...
## $ consci : chr [1:18306] "Only some" "A great deal" "Don't know" "Not applicable" ...
## $ advfront: chr [1:18306] "Agree" "Agree" "Not applicable" "Not applicable" ...
## $ astrosci: chr [1:18306] "Sort of scientific" "Not at all scientific" "Not applicable" "Not applicable" ...
## $ scibnfts: chr [1:18306] "Dont know" "About equal if volunteered" "Not applicable" "Not applicable" ...
## $ prestg10: num [1:18306] 48 43 31 47 0 46 48 44 63 64 ...
## $ ballot : chr [1:18306] "Ballot c" "Ballot b" "Ballot b" "Ballot a" ...
## $ zodiac : chr [1:18306] "No answer" "Leo" "Leo" "Libra" ...
## $ hsphys : chr [1:18306] "No" "No" "Not applicable" "Not applicable" ...
## $ hschem : chr [1:18306] "No" "Yes" "Not applicable" "Not applicable" ...
## $ hsbio : chr [1:18306] "Yes" "Yes" "Not applicable" "Not applicable" ...
## $ hsmath : chr [1:18306] "Pre-algebra" "General math, business, or vocational math" "Not applicable" "Not applicable" ...
## $ colsci : chr [1:18306] "No" "No" "Not applicable" "Not applicable" ...
## $ coldeg1 : chr [1:18306] "Associate's" "Not applicable" "Not applicable" "Not applicable" ...
## $ earthsun: chr [1:18306] "Earth around sun" "Sun around earth" "Not applicable" "Not applicable" ...
## $ happy : chr [1:18306] "Pretty happy" "Very happy" "Pretty happy" "Very happy" ...
## $ bible : chr [1:18306] "Word of god" "Don't know" "Word of god" "Inspired word" ...
## $ postlife: chr [1:18306] "Yes" "Don't know" "No" "Yes" ...
## $ sex : chr [1:18306] "Female" "Male" "Female" "Female" ...
## $ degree : chr [1:18306] "High school" "High school" "Lt high school" "High school" ...
## $ educ : chr [1:18306] "13" "14" "9" "12" ...
## $ age : chr [1:18306] "50" "27" "67" "50" ...
## $ childs : chr [1:18306] "3" "1" "2" "1" ...
## $ divorce : chr [1:18306] "Not applicable" "Not applicable" "Not applicable" "Not applicable" ...
## $ marital : chr [1:18306] "Never married" "Never married" "Divorced" "Never married" ...
## $ wrkgovt : chr [1:18306] "Private" "Private" "Private" "Private" ...
## $ hrs2 : chr [1:18306] "Not applicable" "Not applicable" "Not applicable" "Not applicable" ...
## $ id_ : num [1:18306] 1 2 3 4 5 6 7 8 9 10 ...
## $ race : chr [1:18306] "Black" "Other" "White" "Black" ...
## $ reg16 : chr [1:18306] "Foreign" "Middle atlantic" "Middle atlantic" "Middle atlantic" ...
## $ mobile16: chr [1:18306] "Different state" "Same city" "Same city" "Same city" ...
## $ relig : chr [1:18306] "None" "Catholic" "Catholic" "Protestant" ...
## $ natsci : chr [1:18306] "About right" "Too much" "Don't know" "Too little" ...
## $ natenvir: chr [1:18306] "Too little" "Not applicable" "About right" "Not applicable" ...
## $ natspac : chr [1:18306] "Too much" "Not applicable" "Too much" "Not applicable" ...
## $ polviews: chr [1:18306] "Extremely liberal" "Slightly liberal" "Conservative" "Slightly liberal" ...
## $ partyid : chr [1:18306] "Strong democrat" "Ind,near rep" "Strong democrat" "Strong democrat" ...
## $ rincome : chr [1:18306] "$25000 or more" "$25000 or more" "Not applicable" "$1000 to 2999" ...
## $ income : chr [1:18306] "$25000 or more" "$25000 or more" "Refused" "$25000 or more" ...
## $ born : chr [1:18306] "No" "No" "No" "Yes" ...
## $ family16: chr [1:18306] "Mother & father" "Mother & father" "Mother & father" "Mother" ...
## $ year : num [1:18306] 2006 2006 2006 2006 2006 ...
#Date range and distribution of downloaded data
GSS <- GSS %>% filter(!is.na(year))
GSS %>% count(year)
## # A tibble: 7 x 2
## year n
## <dbl> <int>
## 1 2006 4510
## 2 2008 2023
## 3 2010 2044
## 4 2012 1974
## 5 2014 2538
## 6 2016 2867
## 7 2018 2348
#Count responses to "Now, does the Earth go around the Sun, or does the Sun go around the Earth?"
GSS %>% count(earthsun)
## # A tibble: 5 x 2
## earthsun n
## <chr> <int>
## 1 Dont know 615
## 2 Earth around sun 6600
## 3 No answer 34
## 4 Not applicable 9188
## 5 Sun around earth 1867
Reviewing the distribution of answers to the sun/earth question, I was honestly surprised that “Sun around the earth” was as high as it was. For those who chose definitively, 22% got it wrong. I would love to give the American public the benefit of the doubt and say that a lot of people where just tired or confused by the time they reached this question, but I fear that is not the case. To see if there are any differences between individuals who believe that the sun revolves around the earth and vice versa, I narrowed down the sample to include only those who provided definitive answers on this and other relevant questions (excluding records containing refusals and “don’t know” answers).
#Filter for complete responses for association testing and modeling
GSS <- GSS %>% filter((earthsun == "Earth around sun" |
earthsun == "Sun around earth") &
marital != "No answer" &
polviews != "Don't know" &
polviews != "No answer" &
born != "No answer" &
born != "Don't know" &
family16 != "No answer" &
childs != "Dk na" &
class_ != "Don't know" &
class_ != "No answer" &
mobile16 != "Don't know" &
mobile16 != "No answer" &
sprtprsn != "Dont know" &
sprtprsn != "No answer" &
relpersn != "Dont know" &
relpersn != "No answer" &
reborn != "Don't know" &
reborn != "No answer" &
god != "Don't know" &
god != "No answer" &
bible !="Don't know" &
bible != "No answer" &
postlife != "Don't know" &
postlife != "No answer" &
hsphys != "Dont know" &
hsphys != "No answer" &
hschem != "Dont know" &
hschem != "No answer" &
hsbio != "Dont know" &
hsbio != "No answer" &
colsci != "Dont know" &
colsci != "No answer" &
degree != "No answer" &
toofast != "Dont know" &
toofast != "No answer" &
toofast != "Not applicable" &
nextgen != "Dont know" &
nextgen != "No answer" &
nextgen != "Not applicable" &
scifrom != "Dont know" &
scifrom != "No answer" &
scifrom != "Not applicable" &
consci != "Don't know" &
consci != "No answer" &
consci != "Not applicable" &
advfront != "Dont know" &
advfront != "No answer" &
advfront != "Not applicable" &
astrosci != "Dont know" &
astrosci != "No answer" &
astrosci != "Not applicable" &
scibnfts != "No answer" &
scibnfts != "Not applicable" &
natsci != "No answer")
Since the data downloaded from the GSS site came as character data, I set factor levels for each of the questions. This also allowed me to set the orders of the factors as I wished for the tables.
#Set factors for categorical variables of interest
GSS <- GSS %>% mutate(marital.factor = factor(marital, levels =
c("Never married", "Married",
"Separated", "Divorced",
"Widowed"))) %>%
mutate(race.factor = factor(race, levels =
c("White", "Black", "Other"))) %>%
mutate(polviews.factor = factor(polviews, levels =
c("Extrmly conservative", "Conservative",
"Slghtly conservative", "Moderate", "Slightly liberal",
"Liberal", "Extremely liberal"))) %>%
mutate(born.factor = factor(born, levels =
c("Yes", "No"))) %>%
mutate(family16_rc = case_when(family16 == "Mother & father" ~ "Mother and father",
T ~ "Other")) %>%
mutate(family16_rc.factor = factor(family16_rc, levels =
c("Mother and father", "Other"))) %>%
mutate(age_rc = case_when(age >= 18 & age <= 34 ~ "18-34 yrs",
age >= 35 & age <= 49 ~ "35-49 yrs",
age >= 50 & age <= 64 ~ "50-64 yrs",
age >= 65 & age <= 79 ~ "65-79 yrs",
age >= 80 ~ "80 yrs and above")) %>%
mutate(childs_rc = case_when(childs == "0" ~ "No children",
T ~ "Have children")) %>%
mutate(childs_rc.factor = factor(childs_rc, levels =
c("No children", "Have children"))) %>%
mutate(class_.factor = factor(class_, levels =
c("Lower class", "Working class", "Middle class",
"Upper class"))) %>%
mutate(mobile16.factor = factor(mobile16, levels =
c("Same city", "Same st,dif city", "Different state"))) %>%
mutate(year = factor(year)) %>%
mutate(sprtprsn_rc = case_when(sprtprsn == "Modeate spirtual" ~ "Moderate spirtual",
T ~ as.character(sprtprsn))) %>%
mutate(sprtprsn.factor = factor(sprtprsn_rc, levels =
c("Not spiritual", "Slight spiritual", "Moderate spirtual",
"Very spiritual"))) %>%
mutate(relpersn_rc = case_when(relpersn == "Modrte religious" ~ "Moderate religious",
T ~ as.character(relpersn))) %>%
mutate(relpersn.factor = factor(relpersn_rc, levels =
c("Not religious", "Slight religious", "Moderate religious",
"Very religious"))) %>%
mutate(reborn.factor = factor(reborn, levels=
c("No", "Yes", "Don't know"))) %>%
mutate(god.factor = factor(god, levels =
c("Dont believe", "No way to find out",
"Some higher power", "Believe sometimes",
"Believe but doubts", "Know god exists"))) %>%
mutate(bible.factor = factor(bible, levels =
c("Book of fables", "Inspired word", "Word of god",
"Other"))) %>%
mutate(postlife.factor = factor(postlife, levels =
c("No", "Yes", "Not applicable"))) %>%
mutate(hsphys.factor = factor(hsphys, levels =
c("No", "Yes", "Not applicable"))) %>%
mutate(hschem.factor = factor(hschem, levels =
c("No", "Yes", "Not applicable"))) %>%
mutate(hsbio.factor = factor(hsbio, levels =
c("No", "Yes", "Not applicable"))) %>%
mutate(colsci.factor = factor(colsci, levels =
c("No", "Yes", "Not applicable"))) %>%
mutate(degree.factor = factor(degree, levels =
c("Lt high school", "High school", "Junior college",
"Bachelor", "Graduate"))) %>%
mutate(toofast.factor = factor(toofast, levels =
c("Strongly disagree", "Disagree", "Agree",
"Strongly agree"))) %>%
mutate(nextgen.factor = factor(nextgen, levels =
c("Strongly disagree", "Disagree", "Agree",
"Strongly agree"))) %>%
mutate(scifrom_rc = case_when(scifrom == "Tv" ~ "TV",
scifrom == "The internet" ~ "Internet",
T ~ "Other")) %>%
mutate(scifrom_rc.factor = factor(scifrom_rc, levels =
c("TV", "Internet", "Other"))) %>%
mutate(consci.factor = factor(consci, levels =
c("Hardly any", "Only some", "A great deal"))) %>%
mutate(advfront.factor = factor(advfront , levels =
c("Strongly disagree", "Disagree", "Agree",
"Strongly agree"))) %>%
mutate(astrosci.factor = factor(astrosci, levels =
c("Not at all scientific", "Sort of scientific",
"Very scientific"))) %>%
mutate(scibnfts.factor = factor(scibnfts, levels =
c("Benefits greater", "About equal if volunteered",
"Harmful results greater", "Dont know"))) %>%
mutate(natsci.factor = factor(natsci, levels =
c("Too little", "About right", "Too much", "Don't know"))) %>%
mutate(earthsun = factor(earthsun, levels =
c("Earth around sun", "Sun around earth"))) %>%
mutate(sex = as.factor(sex)) %>%
mutate(age_rc = as.factor(age_rc))
#Narrow the dataset to just the variables of interest for modeling
GSS <- GSS %>% select(earthsun, sex, race.factor, age_rc, polviews.factor,
class_.factor, marital.factor, childs_rc.factor,
born.factor, family16_rc.factor, mobile16.factor,
sprtprsn.factor, relpersn.factor, postlife.factor,
god.factor, reborn.factor, bible.factor,
degree.factor, hsphys.factor, hschem.factor,
hsbio.factor, colsci.factor,
toofast.factor, nextgen.factor, advfront.factor, natsci.factor,
consci.factor, scibnfts.factor, astrosci.factor, scifrom_rc.factor)
My first step for putting together tables showing the differences between those who believe the earth orbits the sun and those who believe the sun orbits the earth is to setup a function using tabyl from the janitor package. This allowed me to get a simple n and % breakdown for each variable in each group of interest (finished tables are displayed further down.)
# Setup function to use for tables
tab_func <- function(df, var, var2) {
var <- enquo(var)
var2 <- enquo(var2)
df %>% tabyl(!! var, earthsun) %>%
adorn_totals(c("col")) %>%
adorn_percentages("col") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_ns() %>%
rename("Variable" = !! var) %>%
add_row(Variable = !!var2, 'Earth around sun' = "", 'Sun around earth' = "",
Total = "",.before = 1)
}
##Setup Demographics, table 1
#Variables included: sex, race.factor, age_rc, polviews.factor,
#class_.factor, marital.factor, childs_rc.factor,
#born.factor, family16_rc.factor, mobile16.factor
Demographics <- tab_func(GSS, sex, "Sex") %>%
bind_rows(tab_func(GSS, race.factor, "Race")) %>%
bind_rows(tab_func(GSS, age_rc, "Age Group")) %>%
bind_rows(tab_func(GSS, polviews.factor, "Political Views")) %>%
bind_rows(tab_func(GSS, class_.factor, "Social Class")) %>%
bind_rows(tab_func(GSS, marital.factor, "Marital Status")) %>%
bind_rows(tab_func(GSS, childs_rc.factor, "Are Parents")) %>%
bind_rows(tab_func(GSS, born.factor, "Born in US")) %>%
bind_rows(tab_func(GSS, family16_rc.factor, "Lived w/ Both Parents at 16yrs Old")) %>%
bind_rows(tab_func(GSS, mobile16.factor, "Location at 16yrs Old"))
##Setup Religion, table 2
#Variables included: sprtprsn.factor, relpersn.factor, postlife.factor,
#god.factor, reborn.factor, bible.factor
Religion <- tab_func(GSS, sprtprsn.factor, "Level of Spirtuality") %>%
bind_rows(tab_func(GSS, relpersn.factor, "Level of Religiousness")) %>%
bind_rows(tab_func(GSS, postlife.factor, "Life after Death")) %>%
bind_rows(tab_func(GSS, god.factor, "Existence of God")) %>%
bind_rows(tab_func(GSS, reborn.factor, "'Born Again' Experience")) %>%
bind_rows(tab_func(GSS, bible.factor, "View of Bible"))
##Setup Education, table 3
#Variables included: degree.factor, hsphys.factor, hschem.factor
#hsbio.factor, colsci.factor
Education <- tab_func(GSS, degree.factor, "Highest Degree") %>%
bind_rows(tab_func(GSS, hsphys.factor, "Taken High School Physics")) %>%
bind_rows(tab_func(GSS, hschem.factor, "Taken High School Chemistry")) %>%
bind_rows(tab_func(GSS, hsbio.factor, "Taken High School Biology")) %>%
bind_rows(tab_func(GSS, colsci.factor, "Taken Any College-Level Science"))
##Setup Science, table 4
#Variables included: toofast.factor, nextgen.factor, advfront.factor, natsci.factor,
#consci.factor, scibnfts.factor, astrosci.factor, scifrom_rc.factor
Science <- tab_func(GSS, toofast.factor, "Science Changes Life Too Fast") %>%
bind_rows(tab_func(GSS, nextgen.factor,
"Science Gives More Opportunities to Next Generation")) %>%
bind_rows(tab_func(GSS, advfront.factor,
"Federal Goverment Should Support Science Research")) %>%
bind_rows(tab_func(GSS, natsci.factor, "Amount Spent Supporting Scientific Research")) %>%
bind_rows(tab_func(GSS, consci.factor, "Confidence in Scientific Community")) %>%
bind_rows(tab_func(GSS, scibnfts.factor, "Benifits of Science Research Outweigh Harms")) %>%
bind_rows(tab_func(GSS, astrosci.factor, "Astrology is Scientific")) %>%
bind_rows(tab_func(GSS, scifrom_rc.factor, "Source of Information about Science"))
Next, I added in chi-square to see if the differences between the two groups for each variable was statistically significant. The p-values are added to the same tables for ease of reference. I included the regular chi-squared with simulation p-value, the Bonferroni corrected p-value, and the chi-squared without simulation p-value.
## Chi-Squared (29 tests used for Bonferroni correction)
#Demographics chisq
Demo_chi <- rownames_to_column(as.data.frame(t(as.matrix(data.frame
(lapply(GSS[,c("sex", "race.factor", "age_rc", "polviews.factor",
"class_.factor", "marital.factor", "childs_rc.factor",
"born.factor", "family16_rc.factor", "mobile16.factor" )],
function(x) chisq.test(table(x,GSS$earthsun),
simulate.p.value = T)$p.value)))))) %>%
mutate(bcpv = V1*29) %>%
mutate(bcpv = case_when(bcpv >= 1 ~ .99999999,
T ~ bcpv)) %>%
rename('Chisq p-value' = "V1") %>%
rename('Bonferroni corrected p-value' = "bcpv") %>%
left_join(rownames_to_column(as.data.frame(t(as.matrix(data.frame
(lapply(GSS[,c("sex", "race.factor", "age_rc", "polviews.factor",
"class_.factor", "marital.factor", "childs_rc.factor",
"born.factor", "family16_rc.factor", "mobile16.factor")],
function(x) chisq.test(table(x,GSS$earthsun),
simulate.p.value = F)$p.value)))))) %>%
rename('Chisq w/out simulated p-value' = "V1")) %>%
add_column(Variable = c("Sex", "Race", "Age Group", "Political Views",
"Social Class", "Marital Status", "Are Parents",
"Born in US", "Lived w/ Both Parents at 16yrs Old",
"Location at 16yrs Old"), .before = 1) %>%
select(-rowname)
#Religion chisq
Rel_chi <- rownames_to_column(as.data.frame(t(as.matrix(data.frame
(lapply(GSS[,c("sprtprsn.factor", "relpersn.factor", "postlife.factor",
"god.factor", "reborn.factor", "bible.factor")],
function(x) chisq.test(table(x,GSS$earthsun),
simulate.p.value = T)$p.value)))))) %>%
mutate(bcpv = V1*29) %>%
mutate(bcpv = case_when(bcpv >= 1 ~ .99999999,
T ~ bcpv)) %>%
rename('Chisq p-value' = "V1") %>%
rename('Bonferroni corrected p-value' = "bcpv") %>%
left_join(rownames_to_column(as.data.frame(t(as.matrix(data.frame
(lapply(GSS[,c("sprtprsn.factor", "relpersn.factor", "postlife.factor",
"god.factor", "reborn.factor", "bible.factor")],
function(x) chisq.test(table(x,GSS$earthsun),
simulate.p.value = F)$p.value)))))) %>%
rename('Chisq w/out simulated p-value' = "V1")) %>%
add_column(Variable = c("Level of Spirtuality", "Level of Religiousness", "Life after Death",
"Existence of God", "'Born Again' Experience",
"View of Bible" ), .before = 1) %>%
select(-rowname)
#Education chisq
Edu_chi <- rownames_to_column(as.data.frame(t(as.matrix(data.frame
(lapply(GSS[,c("degree.factor", "hsphys.factor", "hschem.factor",
"hsbio.factor", "colsci.factor")],
function(x) chisq.test(table(x,GSS$earthsun),
simulate.p.value = T)$p.value)))))) %>%
mutate(bcpv = V1*29) %>%
mutate(bcpv = case_when(bcpv >= 1 ~ .99999999,
T ~ bcpv)) %>%
rename('Chisq p-value' = "V1") %>%
rename('Bonferroni corrected p-value' = "bcpv") %>%
left_join(rownames_to_column(as.data.frame(t(as.matrix(data.frame
(lapply(GSS[,c("degree.factor", "hsphys.factor", "hschem.factor",
"hsbio.factor", "colsci.factor")],
function(x) chisq.test(table(x,GSS$earthsun),
simulate.p.value = F)$p.value)))))) %>%
rename('Chisq w/out simulated p-value' = "V1")) %>%
add_column(Variable = c("Highest Degree", "Taken High School Physics",
"Taken High School Chemistry", "Taken High School Biology",
"Taken Any College-Level Science" ), .before = 1) %>%
select(-rowname)
#Science chisq
Sci_chi <- rownames_to_column(as.data.frame(t(as.matrix(data.frame
(lapply(GSS[,c("toofast.factor", "nextgen.factor", "advfront.factor",
"natsci.factor", "consci.factor", "scibnfts.factor",
"astrosci.factor", "scifrom_rc.factor")],
function(x) chisq.test(table(x,GSS$earthsun),
simulate.p.value = T)$p.value)))))) %>%
mutate(bcpv = V1*29) %>%
mutate(bcpv = case_when(bcpv >= 1 ~ .99999999,
T ~ bcpv)) %>%
rename('Chisq p-value' = "V1") %>%
rename('Bonferroni corrected p-value' = "bcpv") %>%
left_join(rownames_to_column(as.data.frame(t(as.matrix(data.frame
(lapply(GSS[,c("toofast.factor", "nextgen.factor", "advfront.factor",
"natsci.factor", "consci.factor", "scibnfts.factor",
"astrosci.factor", "scifrom_rc.factor")],
function(x) chisq.test(table(x,GSS$earthsun),
simulate.p.value = F)$p.value)))))) %>%
rename('Chisq w/out simulated p-value' = "V1")) %>%
add_column(Variable = c("Science Changes Life Too Fast",
"Science Gives More Opportunities to Next Generation",
"Federal Goverment Should Support Science Research",
"Amount Spent Supporting Scientific Research",
"Confidence in Scientific Community",
"Benifits of Science Research Outweigh Harms",
"Astrology is Scientific", "Source of Information about Science"),
.before = 1) %>%
select(-rowname)
Science2 <- Science %>% left_join(Sci_chi)
When you review the percentage breakdown for the two groups, you can see a strong difference for many of the variables. Given the sample size, it isn’t surprising that many of the differences are statically significant. However, just because a difference is statistically significant does not mean that it is meaningful. To get a better gauge of the association level between the earth/sun category and the other variables, I added Cramers V (ranges from 0 to 1, with 1 being the strongest association).
##Cramer's V
#Demographics Cramer's V
Demo_v <- rownames_to_column(as.data.frame(t(as.matrix(data.frame
(lapply(GSS[,c("sex", "race.factor", "age_rc", "polviews.factor",
"class_.factor", "marital.factor", "childs_rc.factor",
"born.factor", "family16_rc.factor", "mobile16.factor" )],
function(x) assocstats(table(x,GSS$earthsun))$cramer)))))) %>%
rename('Cramers V' = "V1") %>%
add_column(Variable = c("Sex", "Race", "Age Group", "Political Views",
"Social Class", "Marital Status", "Are Parents",
"Born in US", "Lived w/ Both Parents at 16yrs Old",
"Location at 16yrs Old"), .before = 1) %>%
select(-rowname)
Demographics2 <- Demographics %>% left_join(Demo_chi) %>% left_join(Demo_v)
#Religion Cramer's V
Rel_v <- rownames_to_column(as.data.frame(t(as.matrix(data.frame
(lapply(GSS[,c("sprtprsn.factor", "relpersn.factor", "postlife.factor",
"god.factor", "reborn.factor", "bible.factor")],
function(x) assocstats(table(x,GSS$earthsun))$cramer)))))) %>%
rename('Cramers V' = "V1") %>%
add_column(Variable = c("Level of Spirtuality", "Level of Religiousness", "Life after Death",
"Existence of God", "'Born Again' Experience", "View of Bible"),
.before = 1) %>%
select(-rowname)
Religion2 <- Religion %>% left_join(Rel_chi) %>% left_join(Rel_v)
#Education Cramer's V
Edu_v <- rownames_to_column(as.data.frame(t(as.matrix(data.frame
(lapply(GSS[,c("degree.factor", "hsphys.factor", "hschem.factor",
"hsbio.factor", "colsci.factor")],
function(x) assocstats(table(x,GSS$earthsun))$cramer)))))) %>%
rename('Cramers V' = "V1") %>%
add_column(Variable = c("Highest Degree", "Taken High School Physics",
"Taken High School Chemistry", "Taken High School Biology",
"Taken Any College-Level Science" ), .before = 1) %>%
select(-rowname)
Education2 <- Education %>% left_join(Edu_chi) %>% left_join(Edu_v)
#Science Cramer's V
Sci_v <- rownames_to_column(as.data.frame(t(as.matrix(data.frame
(lapply(GSS[,c("toofast.factor", "nextgen.factor", "advfront.factor",
"natsci.factor", "consci.factor", "scibnfts.factor",
"astrosci.factor", "scifrom_rc.factor")],
function(x) assocstats(table(x,GSS$earthsun))$cramer)))))) %>%
rename('Cramers V' = "V1") %>%
add_column(Variable = c("Science Changes Life Too Fast",
"Science Gives More Opportunities to Next Generation",
"Federal Goverment Should Support Science Research",
"Amount Spent Supporting Scientific Research",
"Confidence in Scientific Community",
"Benifits of Science Research Outweigh Harms",
"Astrology is Scientific",
"Source of Information about Science"), .before = 1) %>%
select(-rowname)
Science2 <- Science %>% left_join(Sci_chi) %>% left_join(Sci_v)
Not surprisingly, although the differences between the groups was significant for many variables, the strength of the associations were weak overall. This makes intuitive sense as many complicated social factors and past experiences would likely contribute to a person’s beliefs.
Although I have used the kableExtra package before, I haven’t had the opportunity to try out some of the available formatting options. Below, you can see I made some custom alignments on column headings, added bold formatting, and indented the labels under each category. My favorite though is the option to keep the column header floating at top as you scroll down the chart, which is so helpful when reviewing long tables.
kable(Demographics2,
align = c("l", "c", "c", "c", "c", "c", "c", "c")) %>%
kable_styling(bootstrap_options = c("hover", "condensed"),
fixed_thead = T) %>%
column_spec(1, bold = T) %>%
add_indent(c(2:3, 5:7, 9:13, 15:21, 23:26, 28:32, 34:35, 37:38,
40:41, 43:45))
Variable | Earth around sun | Sun around earth | Total | Chisq p-value | Bonferroni corrected p-value | Chisq w/out simulated p-value | Cramers V |
---|---|---|---|---|---|---|---|
Sex | 0.0004998 | 0.0144928 | 0.0000000 | 0.1397256 | |||
Female | 50.68% (1705) | 67.84% (599) | 54.25% (2304) | ||||
Male | 49.32% (1659) | 32.16% (284) | 45.75% (1943) | ||||
Race | 0.0004998 | 0.0144928 | 0.0000000 | 0.1324984 | |||
White | 79.25% (2666) | 68.63% (606) | 77.04% (3272) | ||||
Black | 11.83% (398) | 23.22% (205) | 14.20% (603) | ||||
Other | 8.92% (300) | 8.15% (72) | 8.76% (372) | ||||
Age Group | 0.0004998 | 0.0144928 | 0.0000001 | 0.0957567 | |||
18-34 yrs | 28.51% (959) | 23.22% (205) | 27.41% (1164) | ||||
35-49 yrs | 29.07% (978) | 25.37% (224) | 28.30% (1202) | ||||
50-64 yrs | 28.33% (953) | 29.45% (260) | 28.56% (1213) | ||||
65-79 yrs | 10.85% (365) | 16.42% (145) | 12.01% (510) | ||||
80 yrs and above | 3.24% (109) | 5.55% (49) | 3.72% (158) | ||||
Political Views | 0.0009995 | 0.0289855 | 0.0000507 | 0.0832247 | |||
Extrmly conservative | 3.60% (121) | 4.42% (39) | 3.77% (160) | ||||
Conservative | 16.68% (561) | 15.52% (137) | 16.44% (698) | ||||
Slghtly conservative | 16.23% (546) | 14.50% (128) | 15.87% (674) | ||||
Moderate | 34.57% (1163) | 43.15% (381) | 36.36% (1544) | ||||
Slightly liberal | 11.92% (401) | 9.85% (87) | 11.49% (488) | ||||
Liberal | 12.78% (430) | 8.83% (78) | 11.96% (508) | ||||
Extremely liberal | 4.22% (142) | 3.74% (33) | 4.12% (175) | ||||
Social Class | 0.0004998 | 0.0144928 | 0.0000000 | 0.1011017 | |||
Lower class | 7.10% (239) | 13.36% (118) | 8.41% (357) | ||||
Working class | 44.11% (1484) | 45.98% (406) | 44.50% (1890) | ||||
Middle class | 45.24% (1522) | 37.26% (329) | 43.58% (1851) | ||||
Upper class | 3.54% (119) | 3.40% (30) | 3.51% (149) | ||||
Marital Status | 0.0004998 | 0.0144928 | 0.0000000 | 0.1071902 | |||
Never married | 26.96% (907) | 25.82% (228) | 26.72% (1135) | ||||
Married | 47.86% (1610) | 40.88% (361) | 46.41% (1971) | ||||
Separated | 3.30% (111) | 3.96% (35) | 3.44% (146) | ||||
Divorced | 16.62% (559) | 18.01% (159) | 16.91% (718) | ||||
Widowed | 5.26% (177) | 11.33% (100) | 6.52% (277) | ||||
Are Parents | 0.0004998 | 0.0144928 | 0.0000000 | 0.1046251 | |||
No children | 31.27% (1052) | 19.59% (173) | 28.84% (1225) | ||||
Have children | 68.73% (2312) | 80.41% (710) | 71.16% (3022) | ||||
Born in US | 0.3733133 | 1.0000000 | 0.3866775 | 0.0142401 | |||
Yes | 89.54% (3012) | 90.60% (800) | 89.76% (3812) | ||||
No | 10.46% (352) | 9.40% (83) | 10.24% (435) | ||||
Lived w/ Both Parents at 16yrs Old | 0.0004998 | 0.0144928 | 0.0001108 | 0.0599425 | |||
Mother and father | 70.15% (2360) | 63.31% (559) | 68.73% (2919) | ||||
Other | 29.85% (1004) | 36.69% (324) | 31.27% (1328) | ||||
Location at 16yrs Old | 0.0004998 | 0.0144928 | 0.0000011 | 0.0804230 | |||
Same city | 36.33% (1222) | 45.30% (400) | 38.19% (1622) | ||||
Same st,dif city | 26.25% (883) | 25.25% (223) | 26.04% (1106) | ||||
Different state | 37.43% (1259) | 29.45% (260) | 35.77% (1519) |
kable(Religion2,
align = c("l", "c", "c", "c", "c", "c", "c", "c")) %>%
kable_styling(bootstrap_options = c("hover", "condensed"),
fixed_thead = T) %>%
column_spec(1, bold = T) %>%
add_indent(c(2:5, 7:10, 12:14, 16:21, 23:24, 26:29))
Variable | Earth around sun | Sun around earth | Total | Chisq p-value | Bonferroni corrected p-value | Chisq w/out simulated p-value | Cramers V |
---|---|---|---|---|---|---|---|
Level of Spirtuality | 0.8950525 | 1.0000000 | 0.8964461 | 0.0118854 | |||
Not spiritual | 8.68% (292) | 9.40% (83) | 8.83% (375) | ||||
Slight spiritual | 19.65% (661) | 20.05% (177) | 19.73% (838) | ||||
Moderate spirtual | 39.98% (1345) | 39.18% (346) | 39.82% (1691) | ||||
Very spiritual | 31.69% (1066) | 31.37% (277) | 31.62% (1343) | ||||
Level of Religiousness | 0.0004998 | 0.0144928 | 0.0000367 | 0.0739062 | |||
Not religious | 18.64% (627) | 12.00% (106) | 17.26% (733) | ||||
Slight religious | 22.80% (767) | 23.56% (208) | 22.96% (975) | ||||
Moderate religious | 40.46% (1361) | 43.04% (380) | 40.99% (1741) | ||||
Very religious | 18.10% (609) | 21.40% (189) | 18.79% (798) | ||||
Life after Death | |||||||
No | 19.17% (645) | 19.48% (172) | 19.24% (817) | ||||
Yes | 80.83% (2719) | 80.52% (711) | 80.76% (3430) | ||||
Not applicable | 0.00% (0) | 0.00% (0) | 0.00% (0) | ||||
Existence of God | 0.0004998 | 0.0144928 | 0.0000000 | 0.1012334 | |||
Dont believe | 3.75% (126) | 2.60% (23) | 3.51% (149) | ||||
No way to find out | 4.52% (152) | 2.83% (25) | 4.17% (177) | ||||
Some higher power | 12.87% (433) | 7.47% (66) | 11.75% (499) | ||||
Believe sometimes | 3.48% (117) | 3.51% (31) | 3.48% (148) | ||||
Believe but doubts | 18.16% (611) | 15.06% (133) | 17.52% (744) | ||||
Know god exists | 57.22% (1925) | 68.52% (605) | 59.57% (2530) | ||||
‘Born Again’ Experience | |||||||
No | 61.30% (2062) | 50.96% (450) | 59.15% (2512) | ||||
Yes | 38.70% (1302) | 49.04% (433) | 40.85% (1735) | ||||
Don’t know | 0.00% (0) | 0.00% (0) | 0.00% (0) | ||||
View of Bible | 0.0004998 | 0.0144928 | 0.0000000 | 0.1386902 | |||
Book of fables | 20.60% (693) | 13.48% (119) | 19.12% (812) | ||||
Inspired word | 49.73% (1673) | 42.92% (379) | 48.32% (2052) | ||||
Word of god | 28.15% (947) | 43.15% (381) | 31.27% (1328) | ||||
Other | 1.52% (51) | 0.45% (4) | 1.30% (55) |
kable(Education2,
align = c("l", "c", "c", "c", "c", "c", "c", "c")) %>%
kable_styling(bootstrap_options = c("hover", "condensed"),
fixed_thead = T) %>%
column_spec(1, bold = T) %>%
add_indent(c(2:6, 8:10, 12:14, 16:18, 20:22))
Variable | Earth around sun | Sun around earth | Total | Chisq p-value | Bonferroni corrected p-value | Chisq w/out simulated p-value | Cramers V |
---|---|---|---|---|---|---|---|
Highest Degree | 0.0004998 | 0.0144928 | 0 | 0.2160748 | |||
Lt high school | 8.17% (275) | 19.25% (170) | 10.48% (445) | ||||
High school | 47.47% (1597) | 57.08% (504) | 49.47% (2101) | ||||
Junior college | 8.71% (293) | 8.95% (79) | 8.76% (372) | ||||
Bachelor | 22.98% (773) | 11.55% (102) | 20.60% (875) | ||||
Graduate | 12.66% (426) | 3.17% (28) | 10.69% (454) | ||||
Taken High School Physics | 0.0004998 | 0.0144928 | 0 | 0.1093564 | |||
No | 61.68% (2075) | 69.88% (617) | 63.39% (2692) | ||||
Yes | 34.54% (1162) | 23.22% (205) | 32.19% (1367) | ||||
Not applicable | 3.78% (127) | 6.91% (61) | 4.43% (188) | ||||
Taken High School Chemistry | 0.0004998 | 0.0144928 | 0 | 0.1448228 | |||
No | 38.05% (1280) | 52.32% (462) | 41.02% (1742) | ||||
Yes | 58.17% (1957) | 40.77% (360) | 54.56% (2317) | ||||
Not applicable | 3.78% (127) | 6.91% (61) | 4.43% (188) | ||||
Taken High School Biology | 0.0004998 | 0.0144928 | 0 | 0.1397895 | |||
No | 14.71% (495) | 25.71% (227) | 17.00% (722) | ||||
Yes | 81.51% (2742) | 67.38% (595) | 78.57% (3337) | ||||
Not applicable | 3.78% (127) | 6.91% (61) | 4.43% (188) | ||||
Taken Any College-Level Science | 0.0004998 | 0.0144928 | 0 | 0.2037968 | |||
No | 46.94% (1579) | 71.80% (634) | 52.11% (2213) | ||||
Yes | 51.78% (1742) | 26.84% (237) | 46.60% (1979) | ||||
Not applicable | 1.28% (43) | 1.36% (12) | 1.30% (55) |
kable(Science2,
align = c("l", "c", "c", "c", "c", "c", "c", "c")) %>%
kable_styling(bootstrap_options = c("hover", "condensed"),
fixed_thead = T) %>%
column_spec(1, bold = T) %>%
add_indent(c(2:5, 7:10, 12:15, 17:20, 22:24, 26:29, 31:33, 35:37))
Variable | Earth around sun | Sun around earth | Total | Chisq p-value | Bonferroni corrected p-value | Chisq w/out simulated p-value | Cramers V |
---|---|---|---|---|---|---|---|
Science Changes Life Too Fast | 0.0004998 | 0.0144928 | 0.0000000 | 0.1109579 | |||
Strongly disagree | 8.44% (284) | 4.87% (43) | 7.70% (327) | ||||
Disagree | 48.66% (1637) | 39.18% (346) | 46.69% (1983) | ||||
Agree | 33.65% (1132) | 42.70% (377) | 35.53% (1509) | ||||
Strongly agree | 9.24% (311) | 13.25% (117) | 10.08% (428) | ||||
Science Gives More Opportunities to Next Generation | 0.6631684 | 1.0000000 | 0.6574115 | 0.0194626 | |||
Strongly disagree | 1.16% (39) | 1.36% (12) | 1.20% (51) | ||||
Disagree | 7.34% (247) | 8.38% (74) | 7.56% (321) | ||||
Agree | 53.15% (1788) | 53.34% (471) | 53.19% (2259) | ||||
Strongly agree | 38.35% (1290) | 36.92% (326) | 38.05% (1616) | ||||
Federal Goverment Should Support Science Research | 0.0004998 | 0.0144928 | 0.0000024 | 0.0824596 | |||
Strongly disagree | 1.28% (43) | 1.93% (17) | 1.41% (60) | ||||
Disagree | 10.17% (342) | 14.04% (124) | 10.97% (466) | ||||
Agree | 57.70% (1941) | 61.27% (541) | 58.44% (2482) | ||||
Strongly agree | 30.86% (1038) | 22.76% (201) | 29.17% (1239) | ||||
Amount Spent Supporting Scientific Research | 0.0024988 | 0.0724638 | 0.0019679 | 0.0590920 | |||
Too little | 42.36% (1425) | 37.83% (334) | 41.42% (1759) | ||||
About right | 44.00% (1480) | 44.05% (389) | 44.01% (1869) | ||||
Too much | 9.81% (330) | 13.93% (123) | 10.67% (453) | ||||
Don’t know | 3.83% (129) | 4.19% (37) | 3.91% (166) | ||||
Confidence in Scientific Community | 0.0004998 | 0.0144928 | 0.0000000 | 0.1236839 | |||
Hardly any | 5.08% (171) | 9.29% (82) | 5.96% (253) | ||||
Only some | 48.01% (1615) | 57.76% (510) | 50.04% (2125) | ||||
A great deal | 46.91% (1578) | 32.96% (291) | 44.01% (1869) | ||||
Benifits of Science Research Outweigh Harms | 0.0004998 | 0.0144928 | 0.0000000 | 0.1429284 | |||
Benefits greater | 77.76% (2616) | 63.08% (557) | 74.71% (3173) | ||||
About equal if volunteered | 12.22% (411) | 17.89% (158) | 13.40% (569) | ||||
Harmful results greater | 7.49% (252) | 13.25% (117) | 8.69% (369) | ||||
Dont know | 2.53% (85) | 5.78% (51) | 3.20% (136) | ||||
Astrology is Scientific | 0.0004998 | 0.0144928 | 0.0000000 | 0.1303334 | |||
Not at all scientific | 67.51% (2271) | 52.32% (462) | 64.35% (2733) | ||||
Sort of scientific | 26.66% (897) | 37.83% (334) | 28.99% (1231) | ||||
Very scientific | 5.83% (196) | 9.85% (87) | 6.66% (283) | ||||
Source of Information about Science | 0.0004998 | 0.0144928 | 0.0000000 | 0.1225353 | |||
TV | 30.38% (1022) | 43.37% (383) | 33.08% (1405) | ||||
Internet | 40.52% (1363) | 28.20% (249) | 37.96% (1612) | ||||
Other | 29.10% (979) | 28.43% (251) | 28.96% (1230) |
Looking over the charts, you can see large number (29) of variables, most which do differ significantly between the two groups, but which likely only weakly affect the variable of interest. It sets up an interesting situation to see if machine learning could the use breadth of variables to help predict if a person would answer the question correctly.
Dipping my toes into the machine learning world, I went with a random forest modeling approach. I started by subsetting the data into a training set and a testing set.
##Random Forest Prediction
###Modeling
#Separate out training and test datasets
set.seed(135)
sep <- sample(2, nrow(GSS), replace = T, prob = c(.7, .3))
train <- GSS[sep==1,]
test <- GSS[sep==2,]
From here I setup a basic model with the training data and plotted the out-of-bag error rate.
#random forest
set.seed(246)
model <- randomForest(earthsun ~ ., train, proximity = T)
model
##
## Call:
## randomForest(formula = earthsun ~ ., data = train, proximity = T)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 20.72%
## Confusion matrix:
## Earth around sun Sun around earth class.error
## Earth around sun 2333 30 0.01269573
## Sun around earth 588 31 0.94991922
oob_error <- data.frame(Trees=rep(1:nrow(model$err.rate), times=3),
Type=rep(c("OOB", "Earth around sun", "Sun around earth"),
each=nrow(model$err.rate)),
Error=c(model$err.rate[,"OOB"],
model$err.rate[, "Earth around sun"],
model$err.rate[, "Sun around earth"]))
oob_error$Type[oob_error$Type == "OOB"] <- "Overall"
ggplot(oob_error, aes(x=Trees, y=Error))+
geom_line(aes(color=Type))+
labs(title = "OOB Error Rate by Number of Trees",
y = "Error Rate",
x = "Number of Trees") +
theme(plot.title = element_text(hjust = 0.5),
axis.line.x.bottom = element_line(color = "grey30"),
axis.line.y.left = element_line(color = "grey30"),
panel.background = element_blank())+
guides(color = guide_legend(reverse = TRUE))
An out-of-bag error rate of 20.72% isn’t something to write home about, and even worse is the plot with the error rate for “Sun around earth” stabilizing at the very top. Thankfully, no one is actually in need of a model to predict the correct understanding of the setup of the solar system.
Next step is to run the model on the test data and measure the performance.
#Predict
predict2 <- predict(model, test)
confusionMatrix(table(predict2, test$earthsun))
## Confusion Matrix and Statistics
##
##
## predict2 Earth around sun Sun around earth
## Earth around sun 990 255
## Sun around earth 11 9
##
## Accuracy : 0.7897
## 95% CI : (0.7662, 0.8119)
## No Information Rate : 0.7913
## P-Value [Acc > NIR] : 0.5712
##
## Kappa : 0.035
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.98901
## Specificity : 0.03409
## Pos Pred Value : 0.79518
## Neg Pred Value : 0.45000
## Prevalence : 0.79130
## Detection Rate : 0.78261
## Detection Prevalence : 0.98419
## Balanced Accuracy : 0.51155
##
## 'Positive' Class : Earth around sun
##
When I tested the model, I received an accuracy of 0.7897. More disheartening is the placement of almost the entire sample in the “Earth around sun” group. I could have placed them in this category before the whole process started since approximately 75% of the sample falls into it.
I ran the tuneRF function to see if any small improvement of the model could be made by adjusting the mtry (number of variables tried after each split). The previous, automatically selected mtry was 5. I set the limit to 300 trees as not much improvement was seen after that on the plot.
t <- tuneRF(train[,-1], train$earthsun,
stepFactor = 0.5,
plot= T,
ntreeTry = 300,
trace= T,
improve = 0.05)
## mtry = 5 OOB error = 20.69%
## Searching left ...
## mtry = 10 OOB error = 21.16%
## -0.02269044 0.05
## Searching right ...
## mtry = 2 OOB error = 20.66%
## 0.001620746 0.05
It looks like the out-of-bag error rate was lowest when the mtry = 2, so I have updated the model with this and a tree limit of 300.
#updating the model again
model_v2 <- randomForest(earthsun ~ ., train, proximity = T,
ntree= 300,
mtry = 2,
importance = T)
model_v2
##
## Call:
## randomForest(formula = earthsun ~ ., data = train, proximity = T, ntree = 300, mtry = 2, importance = T)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 20.42%
## Confusion matrix:
## Earth around sun Sun around earth class.error
## Earth around sun 2358 5 0.002115954
## Sun around earth 604 15 0.975767367
The out-of-bag error rate is now down to 20.42%. I will take any improvement I can get. Now to see how the updated model preforms with the testing data.
predict2_v2 <- predict(model_v2, test)
confusionMatrix(table(predict2_v2, test$earthsun))
## Confusion Matrix and Statistics
##
##
## predict2_v2 Earth around sun Sun around earth
## Earth around sun 997 262
## Sun around earth 4 2
##
## Accuracy : 0.7897
## 95% CI : (0.7662, 0.8119)
## No Information Rate : 0.7913
## P-Value [Acc > NIR] : 0.5712
##
## Kappa : 0.0056
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.996004
## Specificity : 0.007576
## Pos Pred Value : 0.791898
## Neg Pred Value : 0.333333
## Prevalence : 0.791304
## Detection Rate : 0.788142
## Detection Prevalence : 0.995257
## Balanced Accuracy : 0.501790
##
## 'Positive' Class : Earth around sun
##
Unfortunately, the accuracy rate stayed the same. Upside of all of this process is that one day when I run a model for work that may have some actual consequence to the world, I will be that much more excited when the model preforms well (a girl can dream).
This does still provide a great opportunity to dig into the model and understand it a bit more. Below is a frequency plot of the number of nodes for the random forest trees. The peak is hovering around 575 nodes.
#number of nodes
data_frame(treesize(model_v2)) %>%
ggplot() +
geom_bar(aes(x=treesize(model_v2))) +
labs(title = "Number of Nodes Used in Random Forests",
y = "Count",
x = "Number of Nodes Used") +
theme(plot.title = element_text(hjust = 0.5),
axis.line.x.bottom = element_line(color = "grey30"),
axis.line.y.left = element_line(color = "grey30"),
panel.background = element_blank())+
guides(color = guide_legend(reverse = TRUE))
Another item to review is which variables were most important in the model. Below I have called the top variables.
#variables of importance
varImpPlot(model_v2,
sort = T,
n.var = 10,
main= "Top 10 of Variable Importance")
Ideally it would be great for models to be as simple as they can, and since the performance on this one is lower, why not try to at least simplify it by only using some of the top variables?
#simplified model
#remove other variables
train2 <- train %>% select(earthsun, polviews.factor, age_rc, marital.factor, degree.factor,
god.factor, sprtprsn.factor, class_.factor, toofast.factor,
relpersn.factor, scibnfts.factor)
test2 <- test %>% select(earthsun, polviews.factor, age_rc, marital.factor, degree.factor,
god.factor, sprtprsn.factor, class_.factor, toofast.factor,
relpersn.factor, scibnfts.factor)
#rerun model with less variables, remove mtry
model_v3 <- randomForest(earthsun ~ ., train2, proximity = T)
model_v3
##
## Call:
## randomForest(formula = earthsun ~ ., data = train2, proximity = T)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 22%
## Confusion matrix:
## Earth around sun Sun around earth class.error
## Earth around sun 2268 95 0.04020313
## Sun around earth 561 58 0.90630048
#Predict
predict3_v3 <- predict(model_v3, test2)
confusionMatrix(table(predict3_v3, test2$earthsun))
## Confusion Matrix and Statistics
##
##
## predict3_v3 Earth around sun Sun around earth
## Earth around sun 962 245
## Sun around earth 39 19
##
## Accuracy : 0.7755
## 95% CI : (0.7515, 0.7982)
## No Information Rate : 0.7913
## P-Value [Acc > NIR] : 0.921
##
## Kappa : 0.0463
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.96104
## Specificity : 0.07197
## Pos Pred Value : 0.79702
## Neg Pred Value : 0.32759
## Prevalence : 0.79130
## Detection Rate : 0.76047
## Detection Prevalence : 0.95415
## Balanced Accuracy : 0.51650
##
## 'Positive' Class : Earth around sun
##
The simplified model increased the out-of-bag error rate to 22% with the training data and brought the accuracy down to 0.7755 with the test data.
All around this was an interesting exercise allowing me to try a few new things. Many thanks to my friend Phil, https://github.com/probertswsu, who helped me see the trees from the forest during this process (pun intended).
The General Social Survey (GSS) is a project of the independent research organization NORC at the University of Chicago, with principal funding from the National Science Foundation. Smith, Tom W., Davern, Michael, Freese, Jeremy, and Morgan, Stephen, General Social Surveys, 1972-2018 [machine-readable data file] /Principal Investigator, Smith, Tom W.; Co-Principal Investigators, Michael Davern, Jeremy Freese, and Stephen Morgan; Sponsored by National Science Foundation. –NORC ed.– Chicago: NORC, 2018: NORC at the University of Chicago [producer and distributor]. Data accessed from the GSS Data Explorer website at gssdataexplorer.norc.org.