A visual primer to instrumental variables

By Kangho Suh

When assessing the possible efficacy or effectiveness of an intervention, the main objective is to attribute changes you see in the outcome to that intervention alone. That is why clinical trials have strict inclusion and exclusion criteria, and frequently use randomization to create “clean” populations with comparable disease severity and comorbidities. By randomizing, the treatment and control populations should match not only on observable (e.g., demographic) characteristics, but also on unobservable or unknown confounders. As such, the difference in results between the groups can be interpreted as the effect of the intervention alone and not some other factors. This avoids the problem of selection bias, which occurs when the exposure is related to observable and unobservable confounders, and which is endemic to observational studies.

In an ideal research setting (ethics aside), we could clone individuals and give one clone the new treatment and the other one a placebo or standard of care and assess the change in health outcomes. Or we could give an individual the new treatment, study the effect the treatment has, go back in time through a DeLorean and repeat the process with the same individual, only this time with a placebo or other control intervention. Obviously, neither of these are practical options. Currently, the best strategy is randomized controlled trials (RCTs), but these have their own limitations (e.g. financial, ethical, and time considerations) that limit the number of interventions that can be studied this way. Also, the exclusion criteria necessary to arrive at these “clean” study populations sometimes mean that they do not represent the real-world patients who will use these new interventions.

For these reasons, observational studies present an attractive option to RCTs by using electronic health records, registries, or administrative claims databases. Observational studies have their own drawbacks, such as selection bias detailed above. We try to address some of these issues by controlling for covariates in statistical models or by using propensity scores to create comparable study groups that have similar distributions of observable covariates (check out the blog entry of using propensity scores by my colleague Lauren Strand). Another method that has been gaining popularity in health services research is an econometric technique called instrumental variables (IV) estimation. In fact, two of my colleagues and the director of our program (Mark Bounthavong, Blythe Adamson, and Anirban Basu, respectively) wrote a primer on the use of IV here.

In their article Mark, Blythe, and Anirban explain the endogeneity issue when the treatment variable is associated with the error term in a regression model. For those of you who might still be confused (I certainly was for a long time!), I’ll use a simple figure1that I found in a textbook to explain how IVs work.

Instrumental Variables

1p. 147 from Kennedy, Peter. A Guide to Econometrics 6th Edition. Oxford: Blackwell Published, 2008. Print

The figure uses circles to represent the variation within variables that we are interested in: each circle represents the treatment variable (X), outcome variable (Y), or the instrumental variable (Z). First, focus on the treatment and outcome circles. We know that some amount of the variability in the outcome is explained by the treatment variable (i.e. treatment effect); this is indicated by the overlap between the two circles (red, blue, and purple). The remaining green section of the outcome variable represents the error (ϵ) obtained with a statistical model. However, if treatment and ϵ are not independent due to, for example, selection bias, some of the green spills over to the treatment circle, creating the red section. Our results are now biased, because a portion (red) of the variation in our outcome is attributed to both treatment and ϵ.

Enter in instrumental variable Z. It must meet two criteria: 1) be strongly correlated with treatment (large overlap of instrument and treatment) and 2) not be correlated with the error term (no overlap with red or green). In the first stage, we regress treatment on the instrument, and obtain the predicted values of treatment (orange and purple). We then regress the outcome on the predicted values of treatment to get the treatment effect (purple). Because we have only used the exogenous part of our treatment X to explain Y, our estimates are unbiased.

Now that you understand the benefit of IV estimators visually, maybe you can explicitly see some of the drawbacks as well. The information used to estimate the treatment effect became much smaller. It went from the overlap between treatment and outcome (red, blue, and purple) to just the purple area. As a result, while the IV estimator may be unbiased, it has more variance than a simple OLS estimator. One way to improve this limitation is to have an instrument that is highly correlated with treatment to make the purple area as large as possible.

A more concerning limitation with IV estimation is the interpretability of results, especially in the context of treatment effect heterogeneity. I will write another blog post about this issue and how it can be addressed if you have a continuous IV, using a method called person-centered treatment (PeT) effects that Anirban created.  Stay tuned!

Reminders About Propensity Scores

Propensity score (PS)-based models are everywhere these days.  While these methods are useful for controlling for unobserved confounders in observational data and for reducing dimensionality in big datasets, it is imperative that analysts should use good judgement when applying and interpreting PS analyses. This is the topic of my recent methods article in ISPOR’s Value and Outcomes Spotlight.

I became interested in PS methods during my Master’s thesis work on statin drug use and heart structure and function, which has just been published in Pharmacoepidemiology and Drug Safety. To estimate long-term associations between these two variables, I used the Multi-Ethnic Study of Atherosclerosis (MESA), an observational cohort of approximately 6000 individuals with rich covariates, subclinical measures of cardiovascular disease, and clinical outcomes over 10+ years of follow-up. We initially used traditional multivariable linear regression to estimate the association between statin initiation and progression of left ventricular mass over time but found that using PS methods allowed for better control for unobserved confounding. After we generated PS for the probability of starting a statin, we used matching procedures to match initiators and non-initiators, and estimated an average treatment effect in the treated. Estimates from both traditional regressions and PS-matching procedures found a small, dose-dependent protective effect of statins against left ventricular structural dysfunction. This finding of very modest association contrasts with findings from much smaller, short-term studies.

I did my original analyses using Stata, where there are a few packages for PS including psmatch2 and teffects. My analysis used psmatch2, which is generally considered inferior to teffects because it does not provide proper standard errors. I got around this limitation, however, by bootstrapping confidence intervals, which were all conservative compared with teffects confidence intervals.

pscores1
Figure 1: Propensity score overlap among 835 statin initiators and 1559 non-initiators in the Multi-Ethnic Study of Atherosclerosis (MESA)

Recently, I gathered the gumption to redo some of the aforementioned analysis in R. Coding in R is a newly acquired skill of mine, and I wanted to harness some of R’s functionality to build nicer figures. I found this R tutorial from Simon Ejdemyr on propensity score methods in R to be particularly useful. Rebuilding my propensity scores with a logistic model that included approximately 30 covariates and 2389 participant observations, I first wanted to check the region of common support. The region of common support is the overlap between the distributions of PS for the exposed versus unexposed, which indicates the comparability of the two groups. Sometimes, despite fitting the model with every variable you can, PS overlap can be quite bad and matching can’t be done. But I was able to get acceptable overlap on values of PS for statin initiators and non-initiators (see Figure 1). Using the R package MatchIt to do nearest neighbor matching with replacement, my matched dataset was reduced to 1670, where all statin initiators matched. I also checked covariate balance conditional on PS in statin initiator and non-initiator groups. Examples are in Figure 2.  In these plots, the LOWESS smoother is effectively calculating a mean of the covariate level at the propensity score. I expect the means for statin initiators and non-initiators to be similar, so the smooths should be close. In the ends of the age distribution, I see some separation, which is likely to be normal tail behavior. Formal statistical tests can also be used to test covariates balance in the newly matched groups.

pscores2
Figure 2: LOWESS smooth of covariate balance for systolic blood pressure (left) and age (right) across statin initiators and non-initiator groups (matched data)

Please see my website for additional info about my work.

Book Review

Difficult Choices Between What is Best for Science or Best for Our Career

rigor_mortis1

RIGOR MORTISHow Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions
By Richard Harris
288 pages, Perseus Books Group, List Price $28

In “Rigor Mortis: How Sloppy ScienceCreates Worthless Cures, Crushes Hope, and Wastes Billions,” Richard Harris provides compelling evidence through a series of stories and statistics that medical research is plagued by unnecessary errors despite our technology, effort, money, and passion to make a positive impact. This review takes the perspective of a graduate student in health sciences with the aim of assessing the value of Rigor Mortis for the next generation of scientists.  While the book focuses more on sloppy biological science, his concerns are equally valid in the areas of data science and disease modeling.

Richard Harris, a journalist at NPR who writes about science, has started an important conversation about the broad impact of our current scientific culture: we are publishing too many scientific studies which may have false or unreproducible results. Graduate students in health science research or related fields should not be surprised by Harris’s premise. The pressure to produce a large quantity of publications, instead of fewer and higher quality papers, weighs on every grad student in the world.

In 2017, the CHOICE Institute asked its members to read Rigor Mortis and discuss its implications for our field. One emerging theme was that trainees need to be able to report unethical behaviors without fearing adverse consequences. While required annual courses from the University of Washington Biomedical Research Integrity Series challenge students to reconsider their own personal conflicts of interest in publishing research, this remains a difficult ideal to implement in the face of other pressures. Around the lunch table in our grad student lounge, the book sparked an uncomfortable conversation about multiple testing during regression model fitting, and the long stretch of grey area between a dusty, pre-specified analysis plan and our shiny, new hypothesis-generating exploratory findings.

Harris’s storytelling reminded me of a book I love by David Quammen called “Spillover.” Both Rigor Mortis and Spillover are written by distinguished journalists about very complicated and technical problems. Using New York Times reader-friendly language,  both authors include conversations with scientists from around the world to share their stories so that the layperson can understand.

Both books highlight a common dilemma in academia: Should I do what is best for science or what is best for my career? Further, is this an incentives problem or a system problem? The current structure and business of research guide us to make choices that will enhance our career, while science is still often perceived as an altruistic pursuit for the greater good. The book offers a challenge to academic researchers: who among us can claim “no conflict of interest”?

rigor_mortis2
“Attending the panel that rejected his paper proposal, the grad student inwardly trashes each presenter’s research.” — Lego Grad Student

Applying the book’s messages to health economics and outcomes research

I experienced this dilemma when deciding whether to share my HIV disease model. Scientific knowledge and methodology should be completely transparent, yet software coding to implement these techniques is intellectual property that we should not necessarily give away for free. My dilemma isn’t unique. Disease modelers everywhere struggle with this question: should we post our Excel spreadsheet or R code online for others to review and validate and risk having our discovery poached?

This is just one example of tension Harris highlights in his book, and why it is so complex to change our current scientific culture. Scientific advancement is ideally a collective good, but individuals will always need personal incentives to innovate.

Key book takeaways for young scientists

  1. Use valid ingredients
  2. Show your work
  3. No HARKing (Hypothesizing After the Results of the study are Known)
  4. Don’t jump to conclusions (and discourage others from doing this with your results)
  5. Be tough. People may try to discredit you if your hypothesis goes against their life’s work, or for any number of reasons.
  6. Be confident in your science.
  7. Recognize the tension between your own achievement and communal scientific advancement

Further discussions for fixing a broken system

  1. If money is being wasted in biomedical science and research, how do we fix the system to save money without sacrificing incentives to produce valuable innovations? One of our CHOICE Institute graduates, Carrie Bennette, asked this very question in cancer research and you can read about her findings here.
  2. Incentives need to be changed. Academic promotions should not be dependent on the number of our publications but the quality and impact of our contributions. Can we change the culture obsessed with impact factor and promote alternatives such as the H-index or Google Scholar metrics?
  3. Academic tenure systems are antiquated. How do we balance the trade-offs between hiring a post-doc or hiring a permanent staff scientist? Post-doc positions train the next generation and are cheap, however they result in workflow discontinuity from frequent turnover. Permanent staff scientists are trained to stay for a longer period of time, but would disrupt an engrained academic pipeline.

Conclusion

I think all students in any science-related field would benefit from reading this book. Cultural and systematic change will happen faster when we have uncomfortable conversations at the table with our colleagues and mentors. Additionally, we need to take the baton Richard Harris has passed us and start running with our generation of colleagues toward finding and implementing solutions. As our influence in our respective fields grows, so too does our responsibility.

Rigor Mortis is available in hardcover on Amazon for $18.65 and Audible for $19.95.

SaveSaveSaveSave

SaveSave