Topic 6 IP weighting in practice
Learning Goals
- Practice an analysis workflow from start to finish
- Formulate generalized causal research questions using marginal structural models
- Interpret results from fitting MSMs
Discussion
Last time, we derived the following result:
If treatment \(A\) and outcome \(Y\) are d-separated given \(Z\) under the null, we can estimate desired “do” probabilities with a weighted average of the outcomes:
\[ P(Y = 1 \mid \hbox{do}(A = a)) = \frac{\sum_i w_i y_i}{\sum_i w_i} \]
where the weights \(w_i\) are the inverse propensity scores:
- \(w_i = 1/P(A = 1 \mid Z)\) if \(a = 1\)
- \(w_i = 1/P(A = 0 \mid Z)\) if \(a = 0\)
When \(Z\) contains many variables, we need to model treatment as a function of the variables in \(Z\) to obtain estimates of the propensity scores. Can use:
- Logistic regression
- Other techniques from Statistical Machine Learning
In particular, this weighted average
\[ P(Y = 1 \mid \hbox{do}(A = a)) = \frac{\sum_i w_i y_i}{\sum_i w_i} \]
is specific to the situation where \(Y\) is a binary outcome variable, and we want to know \(P(Y = 1)\) (which also happens to be equal to the expected value of \(Y\): \(E[Y] = P(Y = 1)\)).
Inverse probability weighting works more generally:
- Treatment \(A\) does not have to be binary
- A continuous treatment is generally difficult to work with
- Categorical treatments with 3+ categories are naturally handled with extensions of logistic regression (e.g., multinomial regression, other statistical machine learning methods)
- Outcome \(Y\) does not have to be binary
Recall from our tree diagrams that inverse probability weighting (see our worksheet with solutions here)
- upweights the treated to create a population where everyone had been treated
- upweights the untreated to create a population where everyone had been untreated
In these two populations, we counted up the number of instances where \(Y=1\) to obtain our “do” probabilities of interest.
- In essence, the IP weighting created these populations of all treated and all untreated.
- Considered together, these two populations are called a pseudopopulation because everyone exists twice (as a treated and as an untreated individual).
- In this pseudopopulation, we essentially fit the simple logistic regression model:
\[ \log\hbox{odds}(Y = 1) = \beta_0 + \beta_1 A \]
- So in summary: we fit a logistic regression model using inverse probability weights.
- Each individual received the appropriate \(1/P(A=1\mid Z)\) or \(1/P(A=0\mid Z)\) weight.
- Normally with ordinary unweighted logistic regression, every individual receives a weight of 1. (They appear exactly once in the dataset.)
We can make this idea more general:
- The inverse probability weights effectively repeat certain individuals a number of times so that their data “represents” the other individuals who had received another value of the treatment variable.
- The IP weights create the pseudopopulation where each individual receives every version of the treatment variable.
- We can fit any model we want (e.g., linear regression, logistic regression) in this pseudopopulation to relate the outcome to the treatment variable.
- The process of fitting such a model (i.e., fitting a standard regression model supplemented with IP weights) is the process of fitting marginal structural models
New notation:
- The \(Y\) that results when we \(\hbox{do}(A=1)\) can also be notated as \(Y^{a=1}\), called the potential outcome under treatment.
- The \(Y\) that results when we \(\hbox{do}(A=0)\) can also be notated as \(Y^{a=0}\), called the potential outcome under no treatment.
- In general the potential outcome when we \(\hbox{do}(A=a)\) is denoted \(Y^a\).
- In words, this potential outcome means: “the outcome that would result if we intervened to set \(A=a\)”
A marginal structural model (MSM) is a model that relates potential outcomes \(Y^a\) to the treatment variable (as opposed to the naturally observed outcomes \(Y\)):
- A linear regression MSM for continuous/quantitative outcomes \(Y\):
\[ E[Y^a] = \beta_0 + \beta_1 a \]
- A logistic regression MSM for binary outcomes \(Y\):
\[ \log\left( \frac{P(Y^a = 1)}{1-P(Y^a = 1)} \right) = \beta_0 + \beta_1 a \]
Where does the name “marginal structural model” come from?
- Marginal: refers to the fact that the marginal mean (as opposed to a conditional mean, which is a mean in a subgroup) is being modeled (much like how it’s used in “marginally dependent/independent”)
- Structural: In general models in causal inference that describe potential outcomes are referred to as “structural” to indicate a describing of relationships beyond only associations
Analysis
A template Rmd is available here.
Our data come from the National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study (NHEFS).
library(readr)
library(dplyr)
library(ggplot2)
read_csv("https://cdn1.sph.harvard.edu/wp-content/uploads/sites/1268/1268/20/nhefs.csv") nhefs <-
Download the codebook for the dataset here, and have it open for reference.
Research goal: What is the average causal effect of smoking cessation on weight gain at a follow-up visit about 10 years later?
Part 1: Getting a feel for the data
Tabulate the treatment variable qsmk
, visualize the outcome variable, and get summary statistics for the outcome variable.
How many missing values are there for these variables? What key implication does this have for our DAG?
# Tabulate with the count() function from dplyr
# count() does show NAs, if present
%>%
nhefs count(qsmk)
# Construct a visualization of the weight gain variable
# Summary statistics for the weight gain variable
summary(nhefs$wt82_71)
We will need to exclude cases with missing data for the outcome. Work with the nhefs_subs
for the remainder of this analysis:
nhefs %>%
nhefs_subs <- filter(!is.na(wt82_71))
Let’s look at the variable distributions within the treated (qsmk = 1
: the quitters) and the untreated (qsmk = 0
).
# Here we enumerate the quantitative and the categorical variables
c("age", "alcoholhowmuch", "cholesterol", "ht", "pregnancies", "price71", "smokeintensity", "smokeyrs", "tax71", "wt71")
quant_vars <-
c("active", "alcoholfreq", "alcoholpy", "alcoholtype", "allergies", "asthma", "boweltrouble", "bronch", "chroniccough", "colitis", "diabetes", "education", "exercise", "hayfever", "hbp", "hbpmed", "headache", "hepatitis", "hf", "hightax82", "income", "infection", "lackpep", "marital", "nerves", "nervousbreak", "otherpain", "pepticulcer", "pica", "polio", "race", "school", "sex", "tb", "tumor", "weakheart", "wtloss")
categ_vars <-
# Compare the means of the quantitative variables in the treated and untreated
%>%
nhefs_subs group_by(qsmk) %>%
summarize_at(.vars = quant_vars, .funs = mean) %>%
as.data.frame()
# Compare the distributions of the categorical variables in the treated and untreated
# First row numbers are P(covariate | qsmk=0)
# Second row numbers are P(covariate | qsmk=1)
for (c_var in categ_vars) {
cat(c_var, ":\n")
table(qsmk = nhefs_subs$qsmk, nhefs_subs[[c_var]], useNA = "ifany") %>% prop.table(margin = 1) %>% print()
cat("\n")
}
How could you use these data summaries to help with DAG construction (coming next)?
Part 2: Causal DAG construction
Building the causal DAG from start to finish would normally be a much longer process. So that we can get to the data analysis practice, we’ll abbreviate this crucial step. We’ll start with the following d-separating set:
sex
age
race
education
smokeintensity
smokeyrs
active
exercise
wt71
Open up the web version of DAGitty, which we’ll use to draw our causal diagram.
- In the top menu bar, click “Model” > “New model”.
- Also go through the “How to …”
- Add the treatment and outcome nodes as well as the above variables. Also add the relevant edges.
- Looking briefly through the codebook and your data summaries above, do you think that there are other key variables to include in the DAG?
- There is potential selection bias at play. (There were missing values for the response variable.) Add a selection node and give it the “adjusted” status (under the left “Variable” menu). What factors do you think are related to not showing up for the second study visit (and thus not having weight gain measured)? Based on this add edges between the selection node and other variables.
- What is your d-separating set?
So that you can come back to your DAG easily, copy and paste the “Model code” in the right side menu into a separate text document.
Part 3: Propensity score modeling
In order to fit marginal structural models (MSMs) that allow us to estimate average causal effects, we need to estimate the propensity scores.
Make visualizations to inform the nature of the relationship between treatment and the quantitative predictors. For example, you can use code like the following to see if the observed probabilities (in blue) match up with those predicted by a logistic regression model with a quadratic function of age (in red). The y~poly(x,2)
could be changed to y~x
to see the results of having age
linearly related to the log odds of treatment.
State your conclusions from these visualizations.
ggplot(nhefs, aes(x = age, y = qsmk)) +
geom_point() +
geom_smooth(method = "loess", se = FALSE, color = "blue") +
geom_smooth(formula = y~poly(x,2), method="glm", method.args=list(family="binomial"), se = FALSE, color = "red")
Based on your visualizations, construct an appropriate propensity score model. To include polynomial relationships in the model you can use +poly(age, 2)
, for example, in the model formula.
Compute the estimated propensity scores and inverse probability weights as we did in our simulations. Call the weight variable weight1
. Here you’ll want to check if nhefs_subs$qsmk==1
or nhefs_subs$qsmk==0
. Store the IP weights in your nhefs_subs
dataset.
Part 4: Fitting marginal structural models
Install the geepack
package which we’ll be using to fit MSMs. This package fits linear and logistic regression models like lm()
and glm()
do, but it uses a better estimate of the standard error in the presence of weights.
The code below fits the MSM:
\[ E[Y^a] = \beta_0 + \beta_1 \hbox{qsmk} \]
library(geepack)
geeglm(
msm_fit1 <-71 ~ qsmk,
wt82_data = nhefs_subs,
weights = weight1,
id = seqn,
corstr = "independence"
)summary(msm_fit1)
- Interpret both the intercept and
qsmk
coefficients. - Confidence intervals are not displayed in the output, so we’ll compute them by hand. (p-values are displayed, but it’ll be nice to have confidence intervals because they nicely convey both the estimate itself and its uncertainty.) The coefficient estimates are expected to be normally distributed with mean equal to the true population value and standard deviation equal to the standard error. Use the
qnorm()
function to obtain the 95% confidence interval for theqsmk
coefficient. - Summarize your results from this analysis.
Adapt the above code to fit an effect modification MSM:
\[ E[Y^a] = \beta_0 + \beta_1\,\hbox{qsmk} + \beta_2\,\hbox{sex} + \beta_3\,\hbox{qsmk}\times \hbox{sex} \]
Interpret both the qsmk
and interaction coefficients. Compute confidence intervals for those coefficients, and summarize your results.