Topic 13 Applied Analysis: IPW
Learning Goals
- Conduct and interpret results from an appropriate IPW analysis to estimate causal effects and effect modification of causal effects.
Slides from today are available here.
Analysis
You can download a template RMarkdown file to start from here.
Data context and research questions
Research questions: Does a policy that delays the tenure process (for new parents) actually help tenure outcomes? Does it help men and women equally?
The data and full codebook are available on Moodle (aer_primarysample.csv
and ReadMe.pdf
). An abbreviated codebook is below:
Treatment and primary outcome:
gncs
: Is a gender-neutral clock-stopping policy in place at this person’s school? (treatment)tenure_policy_school
: Did this person get tenure at the university with the clock-stopping policy? (primary outcome)
Additional outcomes (mediators):
top_pubsX
: cumulative number of peer-reviewed publications in top-5 journals by year X (3, 5, 7, and 9) since PhD completionPUBSX
: cumulative number of non-top-5 peer-reviewed publications by year X (3, 5 7, and 9) since PhD completion
Additional covariates:
female
: Is the faculty member a female?pol_u
: policy university identifier (values are randomized for privacy reasons)pol_job_start
: identifier for year the job at the policy university started (values are randomized for privacy reasons)phd_rank
: 1-5 categorical variable of PhD program tier based on placements into the top-50 departments in our samplephd_rank_miss
: indicator for missing PhD informationpost_doc
: indicator for doing a post-doc before the first tenure-track positionug_students
: number of undergraduate students at the university, in thousandsgrad_students
: number of graduate students at the university, in thousandsfaculty
: number of faculty at the university (all disciplines), in hundredsfull_av_salary
: average salary of full professors at the university, in thousandsassist_av_salary
: average salary of assistant professors at the university, in thousands,revenue
: annual revenue of the university, in 10,000,000sfemale_ratio
: fraction of the faculty at the university who are femalefull_ratio
: fraction of the faculty at the university who are full professorsRANK
: equal to 1 for top-10 departments and 2 for all other departmentsmax_csstops
: number of children born within five years of PhD completion
Loading data and packages
You’ll need to install the survey
package first.
library(survey) # install.packages("survey")
library(tidyverse)
library(splines) # Make sure that the splines package is installed before running
read_csv("aer_primarysample.csv")
tenure <-
tenure %>%
tenure <- mutate(
faculty = ifelse(faculty_miss==1, NA, faculty),
revenue = ifelse(revenue_miss==1, NA, revenue),
female_ratio = ifelse(female_ratio_miss==1, NA, female_ratio),
full_ratio = ifelse(full_ratio_miss==1, NA, full_ratio),
phd_rank = ifelse(phd_rank_miss==1, NA, phd_rank),
max_csstops = ifelse(max_csstops_miss==1, NA, max_csstops)
)
Part 1: Review outcome regression results
Previously we used outcome regression to estimate the overall effect of tenure clock stopping policies and the effects for males and females. Models from one of our analyses are displayed here.
Review the results of this analysis by interpreting the relevant coefficients and confidence intervals on the odds ratio scale.
# Overall ACE
glm(tenure_policy_school ~ gncs + ns(female_ratio, 3) + ns(full_ratio, 3) + factor(max_csstops)*female, data = tenure, family = "binomial")
mod_overall <-summary(mod_overall)
confint(mod_overall)
# Subgroup ACEs
glm(tenure_policy_school ~ gncs*female + ns(female_ratio, 3) + ns(full_ratio, 3) + factor(max_csstops)*female, data = tenure, family = "binomial")
mod_subgroup <-summary(mod_subgroup)
confint(mod_subgroup)
Part 2: Propensity score modeling
We will assume that the revenue
, female_ratio
, full_ratio
, and max_csstops
are sufficient (proxies for) variables needed to achieve conditional exchangeability. (female
was included in the outcome regresion models for assessing subgroup effects but wasn’t determined to be part of a noncausal path in our causal graph analysis.)
Fit an appropriate propensity score model called
ps_mod
. Use visualizations to inform the construction of your model. (Just focus on appropriately handling nonlinearity.)Use your model to compute appropriate weights, and add these weights to the dataset.
Use “before and after” visualizations to compare balance of key variables before and after weighting. You’ll need to use
factor(gncs)
to make your visualizations–this forces R to viewgncs
as a categorical variable.- Note: If any distributions look dissimilar after weighting, this is an indication that the propensity score model may be misspecified. We won’t address this in our analysis today for time reasons, but in practice, more flexible modeling techniques would be explored.
Coding notes:
The code below will be useful for exploring nonlinearity in covariate-treatment relationships. The blue smooth reflects observed data trends, and the red smooth shows the predictions from a logistic regression model with a specific model formula.
- Formula
y ~ x
: Covariate is included as a linear term - Formula
y ~ ns(x,3)
: Covariate is modeled with a flexible nonlinear function (a “spline” with 3 degrees of freedom–don’t worry about the details of this)
The model predictions should generally align with observed data trends. If they don’t line up well using formula y ~ x
, update the formula to y ~ ns(x, 3)
to see if there is an improvement.
ggplot(DATA, aes(x = COVARIATE, y = OUTCOME)) +
geom_point() +
geom_smooth(se = FALSE, color = "blue", method = "loess") +
geom_smooth(formula = INSERT_MODEL_FORMULA, method = "glm",
method.args = list(family="binomial"),
se = FALSE, color = "red"
)
The (incomplete) code below is useful for computing appropriate weights:
DATA %>%
DATA <- mutate(
ps = predict(ps_mod, newdata = DATA, type = "response"),
ipw = case_when(
==1 ~ ???,
TREAT_VAR==0 ~ ???
TREAT_VAR
) )
You can incorporate weights into most ggplot2
figures by adding weight
to the aes
thetics:
aes(..., weight = ip_weight)
Part 3: Modeling with IP weights
We can use the IP weights constructed above in a weighted regression model to estimate causal effects.
Note: The
svydesign()
andsvyglm()
functions from thesurvey
package are used to ensure that the weights are treated appropriately. (An updated (“robust”) standard error calculation method is used to appropriately account for the fact that the IP weights are estimated and have uncertainty.)Note that we are removing cases with missing values for the weights. (These cases had missing values in the covariates used in the propensity score model.) You’ll think about whether this biases the analysis in Part 4.
The code for estimating the overall ACE is complete. Adapt this code to estimate subgroup ACEs in males and females.
Using both the confidence intervals and effect magnitudes, discuss the results of your analysis in a contextually meaningful way. (Tie these results back to the research questions.)
How do confidence interval widths compare to those from outcome regression?
# Remove cases with missing weights
tenure %>%
tenure_subs <- filter(!is.na(ipw))
# Set up information about weights
svydesign(ids = ~0, weights = tenure_subs$ipw, data = tenure_subs)
design <-
# Fit a marginal structural model to estimate overall ACE
svyglm(
overall_fit <-~ gncs,
tenure_policy_school data = tenure_subs,
design = design,
family = "quasibinomial"
)summary(overall_fit)
confint(overall_fit)
# Adapt the svyglm() code above to include an interaction between treatment and female
subgroup_fit <-