Skip to contents

This vignette demonstrates how to use multibias to adjust for uncontrolled confounding in a real-world dataset from the Evans County Heart Study. It specifically showcases how to reason and derive bias parameters that can be used to adjust for the uncontrolled confounding. We’ll examine the relationship between smoking and coronary heart disease (CHD), showing how failing to account for age as a confounder can bias our estimates and how multibias can be used to arrive at the unbiased effect estimate despite missing data on the confounder.

The Evans County Heart Study was a prospective cohort study conducted in Evans County, Georgia, from 1960 to 1969. The study aimed to investigate risk factors for cardiovascular disease in a rural population. For this example, we’ll use a subset of the data focusing on 609 participants aged 40 and older.

The key variables in our analysis are:

  • SMK: Smoking status (1 = smoker, 0 = non-smoker)
  • CHD: Coronary heart disease status (1 = present, 0 = absent)
  • HPT: Hypertension status (1 = present, 0 = absent)
  • AGE: Age in years

Let’s load and examine the data:

evans <- read.csv("evans.csv")

summary_stats <- evans %>%
  summarise(
    n = n(),
    mean_age = mean(AGE),
    sd_age = sd(AGE),
    prop_smokers = mean(SMK),
    prop_chd = mean(CHD),
    prop_hpt = mean(HPT)
  )

print(summary_stats)
#>     n mean_age   sd_age prop_smokers  prop_chd  prop_hpt
#> 1 609 53.70608 9.258388     0.635468 0.1165846 0.4187192
head(evans)
#>   ID CHD AGE CHL SMK ECG DBP SBP HPT
#> 1 21   0  56 270   0   0  80 138   0
#> 2 31   0  43 159   1   0  74 128   0
#> 3 51   1  56 201   1   1 112 164   1
#> 4 71   0  64 179   1   0 100 200   1
#> 5 74   0  49 243   1   0  82 145   0
#> 6 91   0  46 252   1   0  88 142   0

This is clearly not the most comprehensive data, but, for purposes of demonstrating multibias, let’s proceed by pretending that our data was missing information on AGE. Let’s inspect the resulting smoking-CHD effect estimate, adjusted for hypertension (HPT).

biased_model <- glm(CHD ~ SMK + HPT,
  family = binomial(link = "logit"),
  data = evans
)
or <- round(exp(coef(biased_model)[2]), 2)
or_ci_low <- round(
  exp(coef(biased_model)[2] - 1.96 * summary(biased_model)$coef[2, 2]), 2
)
or_ci_high <- round(
  exp(coef(biased_model)[2] + 1.96 * summary(biased_model)$coef[2, 2]), 2
)

print(paste0("Biased Odds Ratio: ", or))
#> [1] "Biased Odds Ratio: 1.99"
print(paste0("95% CI: (", or_ci_low, ", ", or_ci_high, ")"))
#> [1] "95% CI: (1.12, 3.55)"

Can we anticipate whether this odds ratio without age-adjustment is biased towards or away from the null? Let’s consider the association of the uncontrolled confounder with the exposure and outcome.

cor(evans$SMK, evans$AGE)
#> [1] -0.1391298
cor(evans$CHD, evans$AGE)
#> [1] 0.1393077

In our data, AGE has a negative association with SMK (older people are less likely to be smokers) and a positive association with CHD (older people are more likely to have CHD). These opposite associations must be biasing the odds ratio towards the null, creating a distortion where those who are less likely to smoke are more likely to experience the outcome.

We’ll treat AGE as a binary indicator of over (1) or under (0) age 60. To adjust for the uncontrolled confounding from AGE, let’s refer to the appropriate bias model for a binary uncontrolled confounder: logit(P(U=1)) = α0 + α1X + α2Y + α2+jCj.

To derive the necessary bias parameters, let’s make the following assumption:

  • the odds of an age>60 is half as likely in smokers than non-smokers
  • the odds of an age>60 is 2.5x in those with CHD than those without CHD
  • the odds of an age>60 is 2x in those with HPT than those without HPT

To convert these relationships as parameters in the model, we’ll log-transform them from odds ratios. For the model intercept, we can use the following reasoning: what is the probability that a non-smoker (X=0) without CHD (Y=0) and HPT (C=0) is over age 60 in this population? We’ll assume this is a 25% probability. We’ll use the inverse logit function qlogis() from the stats package to convert this from a probability to the intercept coefficient of the logistic regression model.

u_0 <- qlogis(0.25)
u_x <- log(0.5)
u_y <- log(2.5)
u_c <- log(2)

u_coefs <- list(u = c(u_0, u_x, u_y, u_c))

Now let’s plug these bias parameters into multibias_adjust() along with our data_observed object to obtain a bias-adjusted effect estimate.

df_obs <- data_observed(
  data = evans,
  bias = "uc",
  exposure = "SMK",
  outcome = "CHD",
  confounders = "HPT"
)

set.seed(1234)
multibias_adjust(
  data_observed = df_obs,
  bias_params = bias_params(coef_list = u_coefs),
  bootstrap = TRUE,
  bootstrap_reps = 100
)
#> $estimate
#> [1] 2.229563
#> 
#> $ci
#> [1] 1.364743 3.962324

We get an odds ratio of 2.2. This matches our expectation that the bias adjustment would pull the odds ratio away from the null. How does this result compare to the result we would get if age wasn’t missing in the data and was incorporated in the outcome regression?

full_model <- glm(CHD ~ SMK + HPT + AGE,
  family = binomial(link = "logit"),
  data = evans
)
or <- round(exp(coef(full_model)[2]), 2)
or_ci_low <- round(
  exp(coef(biased_model)[2] - 1.96 * summary(full_model)$coef[2, 2]), 2
)
or_ci_high <- round(
  exp(coef(biased_model)[2] + 1.96 * summary(full_model)$coef[2, 2]), 2
)

print(paste0("Odds Ratio: ", or))
#> [1] "Odds Ratio: 2.31"
print(paste0("95% CI: (", or_ci_low, ", ", or_ci_high, ")"))
#> [1] "95% CI: (1.1, 3.6)"

Based on these results, it appears that the bias-adjusted odds ratio obtained via multibias is close to this complete-data odds ratio of 2.3.