Topic 7 IP weighting for censoring

Learning Goals

  1. Understand how IP weighting can be used to address the lack of generalizability caused by censoring





Discussion

Language for interpreting coefficients in MSMs

\[ E[Y^a] = \beta_0 + \beta_1 a \]

  • Context: Y is cholesterol, A is medication (a = 1: on medication)
  • \(\beta_0\): The average potential cholesterol level when all are not on medication
  • \(\beta_1\): The average causal effect: the change in average potential cholesterol level if everyone were on medication compared to if no one were on medication





Do we really care about the ACE?

  • Depends on research goals, but yes, sometimes ACE is of interest.
    • Is it a nationwide policy? If so, then enacting the policy means “treating” everyone and not enacting means treating no one.
  • May care about the ACE within subgroups. We’ve seen this with effect modification MSMs:

\[ E[Y^a] = \beta_0 + \beta_1 a + \beta_2 M + \beta_3 a\times M \]

  • Say \(M\) indicates history of heart disease (1: history. 0: no history)
  • Question: What are the interpretation of \(\beta_1\) and \(\beta_3\) in this model?





What is the difference between regular and IP-weighted GLMs?

  • Our tree diagrams showed that weighting allows us to create “populations” of all treated and all untreated.
    • Our WHATIF book calls the combined population of the all treated and the all untreated a pseudopopulation
  • We can model the outcomes in this pseudopopulation with standard regression models (marginal structural models).
    • Allows us to estimate causal quantities: \(P(Y \mid \hbox{do}(A=a))\), \(E[Y \mid \hbox{do}(A=a)]\) (in potential outcome notation: \(P(Y^a)\), \(E[Y^a]\))
    • Standard regression models without IP weights do not estimate these causal quantities.





Censoring and selection bias

Reminder: NHEFS investigation from last time

Research goal: What is the average causal effect of smoking cessation on weight gain at a follow-up visit about 10 years later?

  • There were some people who did not have their weight measured at Visit 2. They must be excluded from our analysis.
    • Define a censoring indicator \(C\) where \(C = 1\) means censored/excluded and \(C = 0\) means uncensored/included.
    • Our analysis only generalizes to uncensored individuals–the types of people who would stay in the study!

  • Before: just wanted to find \(Z\) such that conditional on \(Z\) (within subsets defined by \(Z\)), the outcomes of the treated and untreated were comparable.
    • Could appropriately upweight the treated and untreated to get “populations” of all treated and all untreated.


  • Now: also want to ensure that the outcomes in the censored and uncensored are comparable.
    • Can appropriately upweight the uncensored to represent everyone had they not been censored.
    • The censored get zero weight. (We don’t observe them!)


  • When are the outcomes in the censored and uncensored comparable?
    • When variables in \(A\) and \(Z_2\) together d-separate \(C\) and the outcome \(Y\).
    • e.g., The DAG below. View \(C\) as the new “treatment” variable.
    • Another way to put it: when we put a box around \(C\) in the DAG, \(Z\) must d-separate \(A\) and \(Y\) under the null.
      Same condition as before, now explicit about paths with conditioned-on colliders (non-backdoor paths).

Questions: \(C\) is inherently conditioned on in our analysis (\(C = 0\)).

  • What set \(Z\) d-separates the treatment and outcome under the null?
  • What is a set \(Z_2\) such that \(Z_2 \cup A\) (read \(Z_2\) “union” \(A\), or \(Z_2\) and \(A\) together) d-separate \(C\) and the outcome?





Conclusion:

  • \(Z\) should include:
    sex, age, race, education, smokeintensity, smokeyrs, active, exercise, wt71, and VidGames
    (Same as before with the addition of VidGames)
  • \(Z\) blocks spurious, non-causal paths between \(A\) (qsmk) and \(Y\) (wt82_71)


  • \(Z_2\) should include VidGames
  • Together with \(A\), \(Z_2\) blocks spurious, non-causal paths between \(C\) and \(Y\)
  • Note that \(Z_2\) is contained within \(Z\).
    • This will be the case when \(A\) is a cause of \(C\) (often true).





Ultimate goal: Want to upweight individuals using \(P(A, C = 0 \mid Z)\)

  • Why? Same rationale as our tree diagram with additional \(C = 1\) and \(C = 0\) branches.
  • Allows us to estimate \(E[Y^{a, c = 0}]\): expected potential outcome under treatment \(a\) and when no one has been censored (\(C = 0\))





We can express \(P(A, C = 0 \mid Z)\) as the following:

\[ \begin{align*} P(A, C = 0 \mid Z) &= P(A, C = 0, Z)/P(Z) \\ &= P(A \mid Z) P(C = 0 \mid A, Z) \end{align*} \]

  • Based on the last line, we see (1) the propensity score and (2) another “propensity” of being uncensored.
  • Will want models for both.
  • (The censoring probability is not typically called a propensity score.)





A little extra theory (don’t worry about the math part below if you haven’t taken Probability):

\(Z\) may contain extra variables not in \(Z_2\) if there are, say, many common causes of \(A\) and \(Y\), but only a few of these are a cause of \(C\). Let’s say that these extra variables are in the set \(Z_1\). (\(Z = Z_1 \cup Z_2\))

If the variables in \(Z_1\) are not causes of \(C\), then \(C\) and \(Z_1\) are conditionally independent given \(A\) and \(Z_2\).

Going through the work below…

\[ \begin{align*} P(A, C = 0 \mid Z) &= P(A, C = 0 \mid Z_1, Z_2) \\ &= P(A, C = 0, Z_1, Z_2)/P(Z_1, Z_2) \\ &= P(A \mid Z_1, Z_2) P(C = 0 \mid A, Z_1, Z_2) \\ &= P(A \mid Z_1, Z_2) P(C = 0 \mid A, Z_2) \end{align*} \]

…this implies that the propensity score model (for \(A\)) should depend on all variables in \(Z\) (the set \(Z_1\) and \(Z_2\) together) and that the censoring probability model only need depend on \(Z_2\).





Analysis plan:

Notation:

  • \(Z\) is the set that d-separates \(A\) and \(Y\) when \(C\) is conditioned on.
  • \(Z_2\) (a subset of \(Z\)) that, together with \(A\), d-separates \(C\) and \(Y\)
  1. Model the propensity scores \(P(A\mid Z)\) using all variables \(Z\) in the d-separating set. Create inverse probability weights \(W^A = 1/P(A\mid Z)\) (treatment weights).

  2. Model the probability of censoring \(P(C \mid A, Z_2)\). Create inverse probability weights \(W^C = 1/P(C = 0\mid A, Z_2)\) (censoring weights).

  3. Create final weights (for treatment and censoring) \(W^{A,C} = W^A\times W^C\).

  4. Fit the desired marginal structural model with these final weights.





Exercises

A template Rmd is available here.

Setup

library(readr)
library(dplyr)
library(ggplot2)
library(geepack)

nhefs <- read_csv("https://cdn1.sph.harvard.edu/wp-content/uploads/sites/1268/1268/20/nhefs.csv")

Create a censoring indicator variable that will be TRUE if the individual was censored and FALSE otherwise.

nhefs$cens <- is.na(nhefs$wt82_71)

Throughout we’ll use the full nhefs dataset, and we’ll work from the DAG below (\(C\) has been conditioned on):



Part 1: Modeling to obtain weights

  1. Based on the DAG above, identify the set of variables \(Z\) that d-separates qsmk and weight gain.

  2. Fit an appropriate treatment propensity score model, and call this ps_treat_mod. For simplicity, assume that a quadratic relationship for the quantitative variables gives a good fit. Make sure to wrap categorical variables inside factor() in your model formula.

  3. Add a new variable to your dataset called weights_treat that contains the IP weights for treatment.

  4. Is it believable that quitting smoking (treatment) is a cause of censoring? How do the tabulations/calculations below help answer this?

    # P(censored | quitters) and P(censored | nonquitter)
    table(qsmk = nhefs$qsmk, cens = nhefs$cens)
    table(qsmk = nhefs$qsmk, cens = nhefs$cens) %>% prop.table(margin = 1)
  5. How would you see if the data support the age --> C and education --> C arrows? Describe, but don’t actually perform this analysis.

  6. Fit an appropriate model for the probability of censoring, and call this prob_cens_mod. For simplicity, assume that a quadratic relationship for the quantitative variables gives a good fit. Make sure to wrap categorical variables inside factor() in your model formula.

  7. Add a new variable to your dataset called weights_cens that contains the IP weights for censoring. (Note that in R, the default for logistic regression is to model the probability of the outcome variable equaling 1.)

  8. Add a final weight variable to your dataset called weight_TC.




Part 2: Fitting MSMs

Fit the following two MSMs (\(A\) is qsmk, and for sex, 0 indicates males and 1 indicates females):

\[ E[Y^{a, c=0}] = \beta_0 + \beta_1 a \] \[ E[Y^{a, c=0}] = \beta_0 + \beta_1 a + \beta_2\hbox{sex} + \beta_3 a\times\hbox{sex} \]

Note: In our previous analysis where we worked with nhefs_subs, we had actually fit the following:

\[ E[Y^{a} \mid C = 0] = \beta_0 + \beta_1 a \] \[ E[Y^{a} \mid C = 0] = \beta_0 + \beta_1 a + \beta_2\hbox{sex} + \beta_3 a\times\hbox{sex} \]

Refit these models here using the relevant weights. How do your results compare?