A Shift in Trend

Recently I saw a presentation of a Joinpoint Trend Analysis using software put out by the National Cancer Institute: https://surveillance.cancer.gov/joinpoint/. I was curious to try out a similar analysis and brainstormed several recent trends that may have very specific time points where rates shifted.

The FDA has an API setup where you can pull adverse event reports for medications: https://open.fda.gov/apis/. I decided to get adverse event reports for ivermectin. Given the increased clinical trials, prescribing, and off-label use of ivermectin during the COVID pandemic, I imagined it may be a good candidate for this type of analysis.

# Define API parameters
query <- "ivermectin" 
base_url <- "https://api.fda.gov/drug/event.json?search=patient.drug.medicinalproduct:"
limit <- 100  
max_records <- 1000000  
all_results <- list()  

# Pagination loop
for (skip in seq(0, max_records, by = limit)) {
  
  # Construct API URL with pagination
  url <- paste0(base_url, query, "&limit=", limit, "&skip=", skip)
  
  # Send API request
  response <- GET(url)
  
  # Check if the request is successful
  if (status_code(response) == 200) {
    # Parse JSON response
    data <- content(response, "text", encoding = "UTF-8")
    json_data <- fromJSON(data, flatten = TRUE)
    
    # Ensure results exist and are a data frame before appending
    if (!is.null(json_data$results) && length(json_data$results) > 0) {
      all_results <- append(all_results, list(as.data.frame(json_data$results)))
    } else {
      break  # Exit loop if no more data
    }
  } else {
    break  # Stop execution if API request fails
  }
}

# Create a dataframe with results
adverse_events <- bind_rows(all_results)
adverse_events <- adverse_events %>% dplyr::select(safetyreportid, receivedate, patient.reaction, patient.drug)

With data in-hand, I wanted to get a quick overview.

# Build time variables
adverse_events_2 <- adverse_events %>% 
  mutate(receivedate_dt = ymd(receivedate),
         year = year(receivedate_dt),
         year_month = format(receivedate_dt, "%Y-%m"),
         year_quater = paste0(year(receivedate_dt), "0", quarter(receivedate_dt)))

# Get min and max dates
adverse_events_2 %>%  summarise(
  min_report_date = min(receivedate_dt), 
  max_report_date = max(receivedate_dt)
)
##   min_report_date max_report_date
## 1      2004-02-11      2024-12-31
# Create yearly trend dataframe
yr_trend <- adverse_events_2 %>% group_by(year) %>% count() %>% arrange(year) %>% ungroup()
yr_trend$year <- as.numeric(yr_trend$year)

# Plot yearly trend
ggplot(yr_trend, aes(x = as.numeric(year), y = n)) +  # Ensure x-axis is numeric
  geom_point() +
  geom_line() +
  labs(title = "Ivermectin Adverse Event Reports by Year", x = "Year", y = "Count") +
  scale_x_continuous(limits = c(2004, 2024), breaks = seq(2004, 2024, 2)) + 
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),  
    panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank()  
  )

Regression Options

Only a causal glance at the plot is needed to see that the number of adverse reports jumped markedly around 2020. However, it appears that increased rates of reports may have started a few years before that.

My aim was to run a segmented regression to see if there was a particular breakpoint where the trend changed. Before that though, I wanted to see what a standard linear regression and a polynomial regression would look like.

# Standard lm
lm_fit <- lm(n ~ year, data = yr_trend)

# Linear regression plotted
ggplot(yr_trend, aes(x = year, y = n)) +
  geom_point() +
  geom_line() +
  geom_smooth(method = "lm", 
              se = FALSE, 
              color = "darkred", 
              linetype = "dashed") +  
  labs(title = "Ivermectin Adverse Event Reports by Year", x = "Year", y = "Count") +
  scale_x_continuous(limits = c(2004, 2024), breaks = seq(2004, 2024, 2)) + 
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),  
    panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank()  
  )

# Polynomial Regression
poly_fit <- lm(n ~ poly(year, degree = 3), data = yr_trend)

# Third-degree polynomial regression line
 ggplot(yr_trend, aes(x = year, y = n)) +
  geom_point() +
  geom_line() +
  geom_smooth(method = "lm", 
              formula = y ~ poly(x, degree = 3), 
              se = FALSE, 
              color = "darkred", 
              linetype = "dashed") +  
  labs(title = "Ivermectin Adverse Event Reports by Year", x = "Year", y = "Count") +
  scale_x_continuous(limits = c(2004, 2024), breaks = seq(2004, 2024, 2)) + 
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),  
    panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank()  
  )

The polynomial regression line looks like a much better fit, so this was a good test to see if a segmented regression may be more appropriate. I went ahead and selected 2020 as an initial suggested breakpoint.

# Perform segmented regression
segmented_fit <- segmented(lm_fit, seg.Z = ~year, psi = 2020)  

# Extract segmented regression lines
segmented_data <- data.frame(
  x = yr_trend$year,
  y_fit = predict(segmented_fit)
)

# Extract the estimated breakpoint
breakpoint <- segmented_fit$psi[, "Est."]
print(paste("Breakpoint Year =", floor(breakpoint)))
## [1] "Breakpoint Year = 2016"
# Segmented regression line
 ggplot(yr_trend, aes(x = year, y = n)) +
  geom_point() +
   geom_line(data = segmented_data, aes(x, y_fit), color = "red", size = 1) + 
   geom_vline(xintercept = breakpoint, linetype = "dashed", color = "darkred", size = 1) +
  labs(title = "Ivermectin Adverse Event Reports by Year", 
       subtitle = "2016 Breakpoint",
       x = "Year",
       y = "Count") +
  scale_x_continuous(limits = c(2004, 2024), breaks = seq(2004, 2024, 2)) + 
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5), 
    plot.subtitle = element_text(hjust = 0.5), 
    panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank()  
  )

It was interesting to see that the inflection point occurred much earlier than I anticipated. The question that remained was if the segmented line or the polynomial line better represented the data.

# Function to compute model performance metrics
evaluate_model <- function(model, data, response) {
  predictions <- predict(model, newdata = data)
  residuals <- data[[response]] - predictions
  r_squared <- summary(model)$r.squared
  aic <- AIC(model)
  bic <- BIC(model)
  rmse <- sqrt(mean(residuals^2))  

  return(data.frame(R2 = r_squared, AIC = aic, BIC = bic, RMSE = rmse))
}

# Evaluate all models
lm_metrics <- evaluate_model(lm_fit, yr_trend, "n")
poly_metrics <- evaluate_model(poly_fit, yr_trend, "n")

# Segmented model doesn't have direct summary R^2, so we calculate manually
seg_pred <- predict(segmented_fit, newdata = yr_trend)
seg_residuals <- yr_trend$n - seg_pred
seg_r_squared <- 1 - (sum(seg_residuals^2) / sum((yr_trend$n - mean(yr_trend$n))^2))
seg_rmse <- sqrt(mean(seg_residuals^2))

seg_metrics <- data.frame(R2 = seg_r_squared, AIC = AIC(segmented_fit), BIC = BIC(segmented_fit), RMSE = seg_rmse)

# Combine results
results <- rbind(
  data.frame(Model = "Linear Regression", lm_metrics),
  data.frame(Model = "Polynomial Regression (3rd)", poly_metrics),
  data.frame(Model = "Segmented Regression", seg_metrics)
)

# Print results
print(results)
##                         Model        R2      AIC      BIC     RMSE
## 1           Linear Regression 0.7121409 242.0621 245.1956 66.79224
## 2 Polynomial Regression (3rd) 0.8731443 228.8542 234.0768 44.33952
## 3        Segmented Regression 0.8943537 225.0122 230.2348 40.46347

Looking over the R2, (AIC), (BIC), and (RMSE), all results indicate that the Segmented Regression best fit the trend seen. It is always so pleasing to have a project come together so nicely.