Survival Analysis

aka Time-to-Event Analysis

My main goal of this mini-project was to learn more about survival analysis (also called time-to-event analysis). While considering what kind of data set I might look for, my partner and I got into a discussion about how many hours of our lives are spent waiting. We wait at the airport, wait in traffic, wait in line at the grocery store, and wait for test results. Working from home has added a new flavor to waiting: waiting for video calls to start.

Over the past few weeks, I have recorded the amount of time it took for a work call to begin after the official start time. This includes all of the time staring at a message that the host will start the meeting soon, as well as the time spent in polite conversation while the entire group waits for that one person who is most important for the planned discussion, but apparently forgot about the meeting.

#Load data
zoom <- read.csv("Surviving Waiting - Zoom.csv")

#Check out dataframe
str(zoom)

## 'data.frame':    29 obs. of  6 variables:
##  $ Mins             : int  0 0 8 3 3 4 7 6 0 4 ...
##  $ Secs             : int  26 44 4 35 6 6 31 40 15 50 ...
##  $ Weekday          : chr  "Friday" "Monday" "Monday" "Monday" ...
##  $ Before_After_Noon: chr  "After" "Before" "After" "After" ...
##  $ Num_People       : int  2 9 5 2 10 8 4 4 4 7 ...
##  $ Meeting_Started  : int  1 1 1 1 1 1 1 1 1 1 ...

#Create a combined time variable
zoom <- zoom %>% 
  mutate(Time = Mins + (Secs/60))

#Look and the min and max for Time variable
summary(zoom$Time)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.250   1.600   3.033   3.377   4.100  13.050

Kaplan Meier Model

The first step was to run a basic Kaplan Meier Model and plot the curve.

#Fit basic Kaplan Meier Model
kmm1 <- survfit(Surv(zoom$Time, zoom$Meeting_Started) ~ 1,
               type = "kaplan-meier")
kmm1

## Call: survfit(formula = Surv(zoom$Time, zoom$Meeting_Started) ~ 1, 
##     type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##   29.00   29.00    3.03    1.85    4.05

ggsurvplot(kmm1, 
           data = zoom, 
           conf.int = F,
           palette = c("#D41159"),
           xlab = "Minutes")

Reviewing the plot, you can see that there was a rather steep drop in the curve as calls tended to start within the first few minutes. However, several video calls were significantly delayed in starting.

Next I wanted to see if the curve differed based on whether the call was held in the morning or afternoon (determined by my time zone at the beginning of the call).

#Fit Kaplan Meier Model for morning vs afternoon
kmm2 <- survfit(Surv(zoom$Time, zoom$Meeting_Started) ~ zoom$Before_After_Noon,
               type = "kaplan-meier")
kmm2

## Call: survfit(formula = Surv(zoom$Time, zoom$Meeting_Started) ~ zoom$Before_After_Noon, 
##     type = "kaplan-meier")
## 
##                                n events median 0.95LCL 0.95UCL
## zoom$Before_After_Noon=After  18     18   3.34    1.63    4.92
## zoom$Before_After_Noon=Before 11     11   1.98    1.77      NA

ggsurvplot(kmm2, 
           data = zoom, 
           conf.int = F,
           palette = c("#D41159", "1a85ff"),
           xlab = "Minutes", 
           legend.labs = c("Afternoon", "Morning"))

#Seeing if wait time differs significantly for morning vs afternoon meetings
survdiff(Surv(zoom$Time, zoom$Meeting_Started) ~ zoom$Before_After_Noon)

## Call:
## survdiff(formula = Surv(zoom$Time, zoom$Meeting_Started) ~ zoom$Before_After_Noon)
## 
##                                N Observed Expected (O-E)^2/E (O-E)^2/V
## zoom$Before_After_Noon=After  18       18    20.67     0.344       1.3
## zoom$Before_After_Noon=Before 11       11     8.33     0.853       1.3
## 
##  Chisq= 1.3  on 1 degrees of freedom, p= 0.3

rm(kmm1, kmm2)

While the plot shows that afternoon calls were more likely to be delayed in starting, the difference was not significant.

Exponential Regression Model

For fun, I ran it again using an exponential model, and again the time of day was not significant.

#Fit Exponential Regression Model   
erm1 <- survreg(Surv(zoom$Time, zoom$Meeting_Started) ~ 1,
               dist="exponential")
erm1

## Call:
## survreg(formula = Surv(zoom$Time, zoom$Meeting_Started) ~ 1, 
##     dist = "exponential")
## 
## Coefficients:
## (Intercept) 
##    1.216991 
## 
## Scale fixed at 1 
## 
## Loglik(model)= -64.3   Loglik(intercept only)= -64.3
## n= 29

summary(erm1)

## 
## Call:
## survreg(formula = Surv(zoom$Time, zoom$Meeting_Started) ~ 1, 
##     dist = "exponential")
##             Value Std. Error    z       p
## (Intercept) 1.217      0.186 6.55 5.6e-11
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= -64.3   Loglik(intercept only)= -64.3
## Number of Newton-Raphson Iterations: 4 
## n= 29

#Fit Exponential Regression Model for morning vs afternoon
erm2 <- survreg(Surv(zoom$Time, zoom$Meeting_Started) ~ zoom$Before_After_Noon,
               dist="exponential")
summary(erm2)

## 
## Call:
## survreg(formula = Surv(zoom$Time, zoom$Meeting_Started) ~ zoom$Before_After_Noon, 
##     dist = "exponential")
##                               Value Std. Error     z       p
## (Intercept)                   1.330      0.236  5.64 1.7e-08
## zoom$Before_After_NoonBefore -0.331      0.383 -0.87    0.39
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= -63.9   Loglik(intercept only)= -64.3
##  Chisq= 0.73 on 1 degrees of freedom, p= 0.39 
## Number of Newton-Raphson Iterations: 4 
## n= 29

rm(erm1, erm2)

Cox Proportional Hazards Model

More commonly used is the Cox Proportional Hazard Model, so I reran it with this approach.

#Drop incomplete cases to be able to compare models
zoom2 <- zoom %>% drop_na()

#Fit Cox Proportional Hazards Model for morning vs afternoon
cphm2 <- coxph(Surv(zoom2$Time, zoom2$Meeting_Started) ~ zoom2$Before_After_Noon)
summary(cphm2)

## Call:
## coxph(formula = Surv(zoom2$Time, zoom2$Meeting_Started) ~ zoom2$Before_After_Noon)
## 
##   n= 28, number of events= 28 
## 
##                                 coef exp(coef) se(coef)     z Pr(>|z|)
## zoom2$Before_After_NoonBefore 0.5208    1.6834   0.4121 1.264    0.206
## 
##                               exp(coef) exp(-coef) lower .95 upper .95
## zoom2$Before_After_NoonBefore     1.683      0.594    0.7506     3.776
## 
## Concordance= 0.542  (se = 0.054 )
## Likelihood ratio test= 1.55  on 1 df,   p=0.2
## Wald test            = 1.6  on 1 df,   p=0.2
## Score (logrank) test = 1.63  on 1 df,   p=0.2

The test provides a hazard ratio of 1.68 for morning calls. This means at any given time, a morning call is 1.68 times more likely to start than an afternoon call at that time. It could also be said that a morning call has a 68% higher probability of starting.

The c statistic was 0.54 which produced a p-value of 0.2 for all 3 tests.

The next step was to add in a continuous variable to see if the model could be improved. While collecting data I noted the number of people who attended the meetings. I could imagine that the more people who attended the longer it would take to get the meeting started so as to allow late-comers to join. Alternatively, I wondered if meetings would have more pressure to get started on time if too many people were waiting.

#Fit Kaplan Meier Model for morning vs afternoon & for number of meeting participants
cphm3 <- coxph(Surv(zoom2$Time, zoom2$Meeting_Started) ~ zoom2$Before_After_Noon + zoom2$Num_People)

summary(cphm3)

## Call:
## coxph(formula = Surv(zoom2$Time, zoom2$Meeting_Started) ~ zoom2$Before_After_Noon + 
##     zoom2$Num_People)
## 
##   n= 28, number of events= 28 
## 
##                                   coef exp(coef) se(coef)     z Pr(>|z|)
## zoom2$Before_After_NoonBefore 0.502748  1.653258 0.462537 1.087    0.277
## zoom2$Num_People              0.004724  1.004735 0.054151 0.087    0.930
## 
##                               exp(coef) exp(-coef) lower .95 upper .95
## zoom2$Before_After_NoonBefore     1.653     0.6049    0.6678     4.093
## zoom2$Num_People                  1.005     0.9953    0.9036     1.117
## 
## Concordance= 0.524  (se = 0.064 )
## Likelihood ratio test= 1.56  on 2 df,   p=0.5
## Wald test            = 1.6  on 2 df,   p=0.4
## Score (logrank) test = 1.64  on 2 df,   p=0.4

anova(cphm2, cphm3)

## Analysis of Deviance Table
##  Cox model: response is  Surv(zoom2$Time, zoom2$Meeting_Started)
##  Model 1: ~ zoom2$Before_After_Noon
##  Model 2: ~ zoom2$Before_After_Noon + zoom2$Num_People
##    loglik  Chisq Df P(>|Chi|)
## 1 -67.113                    
## 2 -67.109 0.0076  1    0.9307

At any given instant in time, the probability of the call starting was only 0.5% higher for a call with one more person attending.

The anova comparing the two models also showed that the additional variable contributed almost nothing to the model.

Regardless of the plots and the model results, one of the biggest insights I gathered from this project was how quickly many meetings do get started. Maybe it is just easier to remember the meetings that take a long time to get started, or maybe 1 minute waiting can feel like 3 minutes. Either way, I did not gather the evidence to support my casual complaining and I may need to divert my comments back to other traditional venues still requiring enormous wait times (I am looking at you, airport with only one security line open).

Surviving Work Video Calls

Kendra Blalock

5/26/2021

Survival Analysis

aka Time-to-Event Analysis

Kaplan Meier Model

Exponential Regression Model

Cox Proportional Hazards Model