Topic 7 IP weighting for censoring
Learning Goals
- Understand how IP weighting can be used to address the lack of generalizability caused by censoring
Discussion
Language for interpreting coefficients in MSMs
E[Ya]=β0+β1a
- Context: Y is cholesterol, A is medication (a = 1: on medication)
- β0: The average potential cholesterol level when all are not on medication
- β1: The average causal effect: the change in average potential cholesterol level if everyone were on medication compared to if no one were on medication
Do we really care about the ACE?
- Depends on research goals, but yes, sometimes ACE is of interest.
- Is it a nationwide policy? If so, then enacting the policy means “treating” everyone and not enacting means treating no one.
- May care about the ACE within subgroups. We’ve seen this with effect modification MSMs:
E[Ya]=β0+β1a+β2M+β3a×M
- Say M indicates history of heart disease (1: history. 0: no history)
- Question: What are the interpretation of β1 and β3 in this model?
What is the difference between regular and IP-weighted GLMs?
- Our tree diagrams showed that weighting allows us to create “populations” of all treated and all untreated.
- Our WHATIF book calls the combined population of the all treated and the all untreated a pseudopopulation
- We can model the outcomes in this pseudopopulation with standard regression models (marginal structural models).
- Allows us to estimate causal quantities: P(Y∣do(A=a)), E[Y∣do(A=a)] (in potential outcome notation: P(Ya), E[Ya])
- Standard regression models without IP weights do not estimate these causal quantities.
Censoring and selection bias
Reminder: NHEFS investigation from last time
Research goal: What is the average causal effect of smoking cessation on weight gain at a follow-up visit about 10 years later?
- There were some people who did not have their weight measured at Visit 2. They must be excluded from our analysis.
- Define a censoring indicator C where C=1 means censored/excluded and C=0 means uncensored/included.
- Our analysis only generalizes to uncensored individuals–the types of people who would stay in the study!
- Before: just wanted to find Z such that conditional on Z (within subsets defined by Z), the outcomes of the treated and untreated were comparable.
- Could appropriately upweight the treated and untreated to get “populations” of all treated and all untreated.
- Now: also want to ensure that the outcomes in the censored and uncensored are comparable.
- Can appropriately upweight the uncensored to represent everyone had they not been censored.
- The censored get zero weight. (We don’t observe them!)
- When are the outcomes in the censored and uncensored comparable?
- When variables in A and Z2 together d-separate C and the outcome Y.
- e.g., The DAG below. View C as the new “treatment” variable.
- Another way to put it: when we put a box around C in the DAG, Z must d-separate A and Y under the null.
Same condition as before, now explicit about paths with conditioned-on colliders (non-backdoor paths).
Questions: C is inherently conditioned on in our analysis (C=0).
- What set Z d-separates the treatment and outcome under the null?
- What is a set Z2 such that Z2∪A (read Z2 “union” A, or Z2 and A together) d-separate C and the outcome?
Conclusion:
- Z should include:
sex
,age
,race
,education
,smokeintensity
,smokeyrs
,active
,exercise
,wt71
, andVidGames
(Same as before with the addition ofVidGames
) - Z blocks spurious, non-causal paths between A (
qsmk
) and Y (wt82_71
)
- Z2 should include
VidGames
- Together with A, Z2 blocks spurious, non-causal paths between C and Y
- Note that Z2 is contained within Z.
- This will be the case when A is a cause of C (often true).
Ultimate goal: Want to upweight individuals using P(A,C=0∣Z)
- Why? Same rationale as our tree diagram with additional C=1 and C=0 branches.
- Allows us to estimate E[Ya,c=0]: expected potential outcome under treatment a and when no one has been censored (C=0)
We can express P(A,C=0∣Z) as the following:
P(A,C=0∣Z)=P(A,C=0,Z)/P(Z)=P(A∣Z)P(C=0∣A,Z)
- Based on the last line, we see (1) the propensity score and (2) another “propensity” of being uncensored.
- Will want models for both.
- (The censoring probability is not typically called a propensity score.)
A little extra theory (don’t worry about the math part below if you haven’t taken Probability):
Z may contain extra variables not in Z2 if there are, say, many common causes of A and Y, but only a few of these are a cause of C. Let’s say that these extra variables are in the set Z1. (Z=Z1∪Z2)
If the variables in Z1 are not causes of C, then C and Z1 are conditionally independent given A and Z2.
Going through the work below…
P(A,C=0∣Z)=P(A,C=0∣Z1,Z2)=P(A,C=0,Z1,Z2)/P(Z1,Z2)=P(A∣Z1,Z2)P(C=0∣A,Z1,Z2)=P(A∣Z1,Z2)P(C=0∣A,Z2)
…this implies that the propensity score model (for A) should depend on all variables in Z (the set Z1 and Z2 together) and that the censoring probability model only need depend on Z2.
Analysis plan:
Notation:
- Z is the set that d-separates A and Y when C is conditioned on.
- Z2 (a subset of Z) that, together with A, d-separates C and Y
Model the propensity scores P(A∣Z) using all variables Z in the d-separating set. Create inverse probability weights WA=1/P(A∣Z) (treatment weights).
Model the probability of censoring P(C∣A,Z2). Create inverse probability weights WC=1/P(C=0∣A,Z2) (censoring weights).
Create final weights (for treatment and censoring) WA,C=WA×WC.
Fit the desired marginal structural model with these final weights.
Exercises
A template Rmd is available here.
Setup
library(readr)
library(dplyr)
library(ggplot2)
library(geepack)
read_csv("https://cdn1.sph.harvard.edu/wp-content/uploads/sites/1268/1268/20/nhefs.csv") nhefs <-
Create a censoring indicator variable that will be TRUE
if the individual was censored and FALSE
otherwise.
$cens <- is.na(nhefs$wt82_71) nhefs
Throughout we’ll use the full nhefs
dataset, and we’ll work from the DAG below (C has been conditioned on):
Part 1: Modeling to obtain weights
Based on the DAG above, identify the set of variables Z that d-separates
qsmk
and weight gain.Fit an appropriate treatment propensity score model, and call this
ps_treat_mod
. For simplicity, assume that a quadratic relationship for the quantitative variables gives a good fit. Make sure to wrap categorical variables insidefactor()
in your model formula.Add a new variable to your dataset called
weights_treat
that contains the IP weights for treatment.Is it believable that quitting smoking (treatment) is a cause of censoring? How do the tabulations/calculations below help answer this?
# P(censored | quitters) and P(censored | nonquitter) table(qsmk = nhefs$qsmk, cens = nhefs$cens) table(qsmk = nhefs$qsmk, cens = nhefs$cens) %>% prop.table(margin = 1)
How would you see if the data support the
age --> C
andeducation --> C
arrows? Describe, but don’t actually perform this analysis.Fit an appropriate model for the probability of censoring, and call this
prob_cens_mod
. For simplicity, assume that a quadratic relationship for the quantitative variables gives a good fit. Make sure to wrap categorical variables insidefactor()
in your model formula.Add a new variable to your dataset called
weights_cens
that contains the IP weights for censoring. (Note that in R, the default for logistic regression is to model the probability of the outcome variable equaling 1.)Add a final weight variable to your dataset called
weight_TC
.
Part 2: Fitting MSMs
Fit the following two MSMs (A is qsmk
, and for sex
, 0 indicates males and 1 indicates females):
E[Ya,c=0]=β0+β1a E[Ya,c=0]=β0+β1a+β2sex+β3a×sex
Note: In our previous analysis where we worked with nhefs_subs
, we had actually fit the following:
E[Ya∣C=0]=β0+β1a E[Ya∣C=0]=β0+β1a+β2sex+β3a×sex
Refit these models here using the relevant weights. How do your results compare?