Topic 6 IP weighting in practice

Learning Goals

Practice an analysis workflow from start to finish
Formulate generalized causal research questions using marginal structural models
Interpret results from fitting MSMs

Discussion

Last time, we derived the following result:

If treatment $A$ and outcome $Y$ are d-separated given $Z$ under the null, we can estimate desired “do” probabilities with a weighted average of the outcomes:

\[ P(Y = 1 \mid \hbox{do}(A = a)) = \frac{\sum_i w_i y_i}{\sum_i w_i} \]

where the weights $w_i$ are the inverse propensity scores:

$w_i = 1/P(A = 1 \mid Z)$ if $a = 1$
$w_i = 1/P(A = 0 \mid Z)$ if $a = 0$

When $Z$ contains many variables, we need to model treatment as a function of the variables in $Z$ to obtain estimates of the propensity scores. Can use:

Logistic regression
Other techniques from Statistical Machine Learning

In particular, this weighted average

\[ P(Y = 1 \mid \hbox{do}(A = a)) = \frac{\sum_i w_i y_i}{\sum_i w_i} \]

is specific to the situation where $Y$ is a binary outcome variable, and we want to know $P(Y = 1)$ (which also happens to be equal to the expected value of $Y$: $E[Y] = P(Y = 1)$).

Inverse probability weighting works more generally:

Treatment $A$ does not have to be binary
- A continuous treatment is generally difficult to work with
- Categorical treatments with 3+ categories are naturally handled with extensions of logistic regression (e.g., multinomial regression, other statistical machine learning methods)
Outcome $Y$ does not have to be binary

Recall from our tree diagrams that inverse probability weighting (see our worksheet with solutions here)

upweights the treated to create a population where everyone had been treated
upweights the untreated to create a population where everyone had been untreated

In these two populations, we counted up the number of instances where $Y=1$ to obtain our “do” probabilities of interest.

In essence, the IP weighting created these populations of all treated and all untreated.
- Considered together, these two populations are called a pseudopopulation because everyone exists twice (as a treated and as an untreated individual).
In this pseudopopulation, we essentially fit the simple logistic regression model:

\[ \log\hbox{odds}(Y = 1) = \beta_0 + \beta_1 A \]

So in summary: we fit a logistic regression model using inverse probability weights.
- Each individual received the appropriate $1/P(A=1\mid Z)$ or $1/P(A=0\mid Z)$ weight.
- Normally with ordinary unweighted logistic regression, every individual receives a weight of 1. (They appear exactly once in the dataset.)

We can make this idea more general:

The inverse probability weights effectively repeat certain individuals a number of times so that their data “represents” the other individuals who had received another value of the treatment variable.
- The IP weights create the pseudopopulation where each individual receives every version of the treatment variable.
We can fit any model we want (e.g., linear regression, logistic regression) in this pseudopopulation to relate the outcome to the treatment variable.
- The process of fitting such a model (i.e., fitting a standard regression model supplemented with IP weights) is the process of fitting marginal structural models

New notation:

The $Y$ that results when we $\hbox{do}(A=1)$ can also be notated as $Y^{a=1}$, called the potential outcome under treatment.
The $Y$ that results when we $\hbox{do}(A=0)$ can also be notated as $Y^{a=0}$, called the potential outcome under no treatment.
In general the potential outcome when we $\hbox{do}(A=a)$ is denoted $Y^a$.
- In words, this potential outcome means: “the outcome that would result if we intervened to set $A=a$”

A marginal structural model (MSM) is a model that relates potential outcomes $Y^a$ to the treatment variable (as opposed to the naturally observed outcomes $Y$):

A linear regression MSM for continuous/quantitative outcomes $Y$:

\[ E[Y^a] = \beta_0 + \beta_1 a \]

A logistic regression MSM for binary outcomes $Y$:

\[ \log\left( \frac{P(Y^a = 1)}{1-P(Y^a = 1)} \right) = \beta_0 + \beta_1 a \]

Where does the name “marginal structural model” come from?

Marginal: refers to the fact that the marginal mean (as opposed to a conditional mean, which is a mean in a subgroup) is being modeled (much like how it’s used in “marginally dependent/independent”)
Structural: In general models in causal inference that describe potential outcomes are referred to as “structural” to indicate a describing of relationships beyond only associations

Analysis

A template Rmd is available here.

Our data come from the National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study (NHEFS).

library(readr)
library(dplyr)
library(ggplot2)

nhefs <- read_csv("https://cdn1.sph.harvard.edu/wp-content/uploads/sites/1268/1268/20/nhefs.csv")

Download the codebook for the dataset here, and have it open for reference.

Research goal: What is the average causal effect of smoking cessation on weight gain at a follow-up visit about 10 years later?

Part 1: Getting a feel for the data

Tabulate the treatment variable qsmk, visualize the outcome variable, and get summary statistics for the outcome variable.

How many missing values are there for these variables? What key implication does this have for our DAG?

# Tabulate with the count() function from dplyr
# count() does show NAs, if present
nhefs %>%
    count(qsmk)

# Construct a visualization of the weight gain variable


# Summary statistics for the weight gain variable
summary(nhefs$wt82_71)

We will need to exclude cases with missing data for the outcome. Work with the nhefs_subs for the remainder of this analysis:

nhefs_subs <- nhefs %>%
    filter(!is.na(wt82_71))

Let’s look at the variable distributions within the treated (qsmk = 1: the quitters) and the untreated (qsmk = 0).

# Here we enumerate the quantitative and the categorical variables
quant_vars <- c("age", "alcoholhowmuch", "cholesterol", "ht", "pregnancies", "price71", "smokeintensity", "smokeyrs", "tax71", "wt71")

categ_vars <- c("active", "alcoholfreq", "alcoholpy", "alcoholtype", "allergies", "asthma", "boweltrouble", "bronch", "chroniccough", "colitis", "diabetes", "education", "exercise", "hayfever", "hbp", "hbpmed", "headache", "hepatitis", "hf", "hightax82", "income", "infection", "lackpep", "marital", "nerves", "nervousbreak", "otherpain", "pepticulcer", "pica", "polio", "race", "school", "sex", "tb", "tumor", "weakheart", "wtloss")

# Compare the means of the quantitative variables in the treated and untreated
nhefs_subs %>%
    group_by(qsmk) %>%
    summarize_at(.vars = quant_vars, .funs = mean) %>%
    as.data.frame()

# Compare the distributions of the categorical variables in the treated and untreated
# First row numbers are P(covariate | qsmk=0)
# Second row numbers are P(covariate | qsmk=1)
for (c_var in categ_vars) {
    cat(c_var, ":\n")
    table(qsmk = nhefs_subs$qsmk, nhefs_subs[[c_var]], useNA = "ifany") %>% prop.table(margin = 1) %>% print()
    cat("\n")
}

How could you use these data summaries to help with DAG construction (coming next)?

Part 2: Causal DAG construction

Building the causal DAG from start to finish would normally be a much longer process. So that we can get to the data analysis practice, we’ll abbreviate this crucial step. We’ll start with the following d-separating set:

sex
age
race
education
smokeintensity
smokeyrs
active
exercise
wt71

Open up the web version of DAGitty, which we’ll use to draw our causal diagram.

In the top menu bar, click “Model” > “New model”.
Also go through the “How to …”
Add the treatment and outcome nodes as well as the above variables. Also add the relevant edges.
Looking briefly through the codebook and your data summaries above, do you think that there are other key variables to include in the DAG?
There is potential selection bias at play. (There were missing values for the response variable.) Add a selection node and give it the “adjusted” status (under the left “Variable” menu). What factors do you think are related to not showing up for the second study visit (and thus not having weight gain measured)? Based on this add edges between the selection node and other variables.
What is your d-separating set?

So that you can come back to your DAG easily, copy and paste the “Model code” in the right side menu into a separate text document.

Part 3: Propensity score modeling

In order to fit marginal structural models (MSMs) that allow us to estimate average causal effects, we need to estimate the propensity scores.

Make visualizations to inform the nature of the relationship between treatment and the quantitative predictors. For example, you can use code like the following to see if the observed probabilities (in blue) match up with those predicted by a logistic regression model with a quadratic function of age (in red). The y~poly(x,2) could be changed to y~x to see the results of having age linearly related to the log odds of treatment.

State your conclusions from these visualizations.

ggplot(nhefs, aes(x = age, y = qsmk)) +
    geom_point() +
    geom_smooth(method = "loess", se = FALSE, color = "blue") +
    geom_smooth(formula = y~poly(x,2), method="glm", method.args=list(family="binomial"), se = FALSE, color = "red")

Based on your visualizations, construct an appropriate propensity score model. To include polynomial relationships in the model you can use +poly(age, 2), for example, in the model formula.

Compute the estimated propensity scores and inverse probability weights as we did in our simulations. Call the weight variable weight1. Here you’ll want to check if nhefs_subs$qsmk==1 or nhefs_subs$qsmk==0. Store the IP weights in your nhefs_subs dataset.

Part 4: Fitting marginal structural models

Install the geepack package which we’ll be using to fit MSMs. This package fits linear and logistic regression models like lm() and glm() do, but it uses a better estimate of the standard error in the presence of weights.

The code below fits the MSM:

\[ E[Y^a] = \beta_0 + \beta_1 \hbox{qsmk} \]

library(geepack)
msm_fit1 <- geeglm(
    wt82_71 ~ qsmk,
    data = nhefs_subs,
    weights = weight1,
    id = seqn,
    corstr = "independence"
)
summary(msm_fit1)

Interpret both the intercept and qsmk coefficients.
Confidence intervals are not displayed in the output, so we’ll compute them by hand. (p-values are displayed, but it’ll be nice to have confidence intervals because they nicely convey both the estimate itself and its uncertainty.) The coefficient estimates are expected to be normally distributed with mean equal to the true population value and standard deviation equal to the standard error. Use the qnorm() function to obtain the 95% confidence interval for the qsmk coefficient.
Summarize your results from this analysis.

Adapt the above code to fit an effect modification MSM:

\[ E[Y^a] = \beta_0 + \beta_1\,\hbox{qsmk} + \beta_2\,\hbox{sex} + \beta_3\,\hbox{qsmk}\times \hbox{sex} \]

Interpret both the qsmk and interaction coefficients. Compute confidence intervals for those coefficients, and summarize your results.