Economic Evaluation Methods Part I: Interpreting Cost-Effectiveness Acceptability Curves and Estimating Costs

By Erik Landaas, Elizabeth Brouwer, and Lotte Steuten

One of the main training activities at the CHOICE Institute at the University of Washington is to instruct graduate students how to perform economic evaluations of medical technologies. In this blog post series, we give a brief overview of two important economic evaluation concepts. Each one of the concepts are mutually exclusive and are meant to stand alone. The first of this two-part series describes how to interpret a cost-effectiveness acceptability curve (CEAC) and then delves into ways of costing a health intervention. The second part of the series will describe two additional concepts: how to develop and interpret cost-effectiveness frontiers and how multi-criteria decision analysis (MCDA) can be used in Health Technology Assessment (HTA).

 

Cost-Effectiveness Acceptability Curve (CEAC)

The CEAC is a way to graphically present decision uncertainty around the expected incremental cost-effectiveness of healthcare technologies. A CEAC is created using the results of a probabilistic analysis(PA).[1] PA involves simultaneously drawing a set of input parameter values by randomly sampling from each parameter distribution, and then storing the model results.  This is repeated many times (typically 1,000 to 10,000), resulting in a distribution of outputs that can be graphed on the cost-effectiveness plane. The CEAC reflects the proportion of results that are considered ‘favorable’ (i.e. cost effective) in relation to a given cost-effectiveness threshold.

The primary goal of a CEAC graph is to inform coverage decisions among payers that are considering a new technology, comparing one or more established technologies that may include the standard of care. The CEAC enables a payer to determine, over a range of willingness to pay (WTP) thresholds, the probability that a medical technology is considered cost-effective in comparison to its appropriate comparator (e.g. usual care), given the information available at the time of the analysis. A WTP threshold is generally expressed in terms of societal willingness to pay for an additional life year or quality-adjusted life year (QALY) gained. In the US, WTP thresholds typically range between $50,000 – $150,000 per QALY.

The X-axis of a CEAC represents the range of WTP thresholds. The Y-axis represents the probability of each comparator being cost-effective at a given WTP threshold, and ranges between 0% and 100%. Thus, it simply reflects the proportion of simulated ICERs from the PA that fall below the corresponding thresholds on the X-axis.

Figure 1. The Cost-Effectiveness Acceptability Curve

CEAC

Coyle, Doug, et al. “Cost-effectiveness of new oral anticoagulants compared with warfarin in preventing stroke and other cardiovascular events in patients with atrial fibrillation.” Value in health 16.4 (2013): 498-506.

Figure 1 shows CEACs for five different drugs, making it easy for the reader to see that at the lower end of the WTP threshold range (i.e. $0 – $20,000 per QALY), warfarin has the highest probability to be cost-effective (or in this case “optimal”). At WTP values >$20,000 per QALY, dabigatran has the highest probability to be cost-effective. All the other drugs have a lower probability of being cost-effective compared to warfarin and dabigatran at every WTP threshold. The cost-effectiveness acceptability frontier in Figure 1 follows along the top of all the curves and shows directly which of the five technologies has the highest probability of being cost-effective at various levels of the WTP thresholds.

To the extent that the unit price of the technology influences the decision uncertainty, a CEAC can offer insights to payers as well as manufacturers as they consider a value-based price. For example, a lower unit price for the drug may lower the ICER and, all else equal, this increases the probability that the new technology is considered cost-effective at a given WTP threshold. Note, that when new technologies are priced such that the ICER falls just below the WTP for a QALY, (e.g. an ICER of $99,999 when the WTP is $100,000) the decision uncertainty tends to be substantial, often around 50%. If decision uncertainty is perceived to be ‘unacceptably high’, it can be recommended to collect further information to reduce decision uncertainty. Depending on the drivers of decision uncertainty, for example in case of stochastic uncertainty in the efficacy parameters, performance-based risk agreements (PBRAs) or managed entry schemes may be appropriate tools to manage the risk.

Cost estimates

The numerator of most economic evaluations for health is the cost of a technology or intervention. There are several ways to arrive at that cost, and choice of method depends on the context of the intervention and the available data.

Two broadly categorized methods for costing are the bottom-up methodand the top-down method. These methods, described below, are not mutually exclusive and may complement each other, although they often do not produce the same results.

costs

Source of Table: Mogyorosy Z, Smith P. The main methodological issues in costing health care services: a literature review. 2005.

The bottom-up method is also known as the ingredients approach or micro-costing. In this method, the analyst identifies all the items necessary to complete an intervention, such as medical supplies and clinician time, and adds them up to estimate the total cost. The main categories to consider when calculating costs via the bottom-up method are medical costs and non-medical costs. Medical costs can be direct, such as the supplies used to perform a surgery, or indirect, such as the food and bed used for inpatient care. Non-medical costs often include costs to the patient, such as transportation to the clinic or caregiver costs. The categories used when estimating the total cost of an intervention will depend on the perspective the analyst takes (perspectives include patient, health system, or societal).

The bottom-up approach can be completed prospectively or retrospectively, and can be helpful for planning and budgeting. Because the method identifies and values each input, it allows for a clear breakdown as to where dollars are being spent. To be accurate, however, one must be able to identify all the necessary inputs for an intervention and know how to value capital inputs like MRI machines or hospital buildings. The calculations may also become unwieldy on a very large scale. The bottom-up approach is often used in global health research, where medical programs or governmental agencies supply specific items to implement an intervention, or in simple interventions where there are only a few necessary ingredients.

The top-down estimation approach takes the total cost of a project and divides it by the number of service units generated. In some cases, this is completed simply looking at the budget for a program or an intervention and then dividing that total by the number of patients. The top-down approach is useful because it is a simple, intuitive measurement that captures the actual amount of money spent on a project and the number of units produced, particularly for large projects or organizations. Compared to the bottom-up approach, the top-down approach can be much faster and cheaper. The top-down approach can only be used retrospectively, however, and may not allow for the breakdown of how the money was spent or be able to identify variations between patients.

While the final choice will depend on several factors, it makes sense to try and think through (or model) which of the cost inputs are likely to be most impactful on the model results. For example, the costs of lab tests may most accurately be estimated by a bottom-up costing approach. However, if these lab costs are likely to be a fraction of the cost of treatment, say a million dollar cure for cancer, then going through the motions of a bottom-up approach may not be the most efficient way to get your PhD-project done in time. In other cases, however, a bottom-up approach may provide crucial insights that move the needle on the estimated cost-effectiveness of medical technologies, particularly in settings where a lack of existing datasets is limiting the potential of cost-effectiveness studies to inform decisions on the allocation of scarce healthcare resources.

[1]Fenwick, Elisabeth, Bernie J. O’Brien, and Andrew Briggs. “Cost‐effectiveness acceptability curves–facts, fallacies and frequently asked questions.” Health economics 13.5 (2004): 405-415.

Commonly Misunderstood Concepts in Pharmacoepidemiology

By Erik J. Landaas, MPH, PhD Student and Naomi Schwartz, MPH, PhD Student

 

Epidemiologic methods are central to the academic and research endeavors at the CHOICE institute. The field of epidemiology fosters the critical thinking required for high quality medical research. Pharmacoepidemiology is a sub-field of epidemiology and has been around since the 1970’s. One of the driving forces behind the establishment of pharmacoepidemiology was the Thalidomide disaster. In response to this tragedy, laws were enacted that gave the FDA authority to evaluate the efficacy of drugs. In addition, drug manufacturers were required to conduct clinical trials to provide evidence of a drug’s efficacy. This spawned a new and important body of work surrounding drug safety, efficacy, and post-marketing surveillance.[i]

In this article, we break down three of the more complex and often misunderstood concepts in pharmacoepidemiology: immortal time bias, protopathic bias, and drug exposure definition and measurement.

 

Immortal Time Bias

In pharmacoepidemiology studies, immortal time bias typically arises when the determination of an individual’s treatment status involves a delay or waiting period during which follow-up time is accrued. Immortal time bias is a period of follow-up during which, by design, the outcome of interest cannot occur. For example, the finding that Oscar winners live longer than non-winnersis a result of immortal time bias. In order for an individual to win an Oscar, he/she must live long enough to receive the award.  A pharmacoepidemiology example of this is depicted in Figure 1. A patient who receives a prescription may survive longer because he/she must live long enough to receive a prescription while a patient who does not receive a prescription has no survival requirements.  The most common way to avoid immortal time bias is to use a time-varying exposure variable. This allows subjects to contribute to both unexposed (during waiting period) and exposed person time.

 

Figure 1. Immortal Time Bias

Picture2_pharmepi post.png 

Lévesque, Linda E., et al. “Problem of immortal time bias in cohort studies: example using statins for preventing progression of diabetes.” Bmj 340 (2010): b5087.

Protopathic Bias or Reverse Causation

Protopathic bias occurs when a drug of interest is initiated to treat symptoms of the disease under study before it is diagnosed. For example, early symptoms of inflammatory bowel disease (IBD) are often consistent with the indications for prescribing proton pump inhibitors (PPIs). Thus, many individuals who develop IBD have a history of PPI use. A study to investigate the association between PPIs and subsequent IBD would likely conclude that taking PPIs causes IBD when, in fact, the IBD was present (but undiagnosed) before the PPIs were prescribed.  This scenario is illustrated by the following steps:

  • Patient has early symptoms of an underlying disease (e.g. acid reflux)
  • Patient goes to his/her doctor and gets a drug to address symptoms (e.g. PPI)
  • Patient goes on to develop a diagnosis of having IBD (months or even years later)

It is easy to conclude from the above scenario that PPIs cause IBD, however the acid reflux was actually a manifestation of underlying IBD that was not yet diagnosed.  Protopathic bias occurs in this case because of the lag time between first symptoms and diagnosis. One effective way to address protopathic bias is by excluding exposures during the prodromal period of the disease of interest.

 

Drug Exposure Definition and Measurement 

Defining and classifying exposure to a drug is critical to the validity of pharmacoepidemiology studies. Most pharmacoepidemiology studies use proxies for drug exposure, because it is often impractical or impossible to measure directly (e.g. observing a patient take a drug, monitoring blood levels). In lieu of actual exposure data, exposure ascertainment is typically based on medication dispensing records. These records can be ascertained from electronic health records, pharmacies, pharmacy benefit managers (PBMs), and other available healthcare data repositories. Some of the most comprehensive drug exposure data are available among Northern European countries and large integrated health systems such as Kaiser Permanente in the United States. Some strengths of using dispensing records to gather exposure data are:

  • Easy to ascertain and relatively inexpensive
  • No primary data collection
  • Often available for large sample sizes
  • Can be population based
  • No recall or interviewer bias
  • Linkable to other types of data such as diagnostic codes and labs

Limitations of dispensing records as a data source include:

  • Completeness can be an issue
  • Usually does not capture over-the-counter (OTC) drugs
  • Dispensing does not guarantee ingestion
  • Often lacks indication for use
  • Must make some assumptions to calculate dose and duration of use

Some studies collect drug exposure data using self-report methods (e.g. interviews or surveys). These methods are useful when the drug of interest is OTC and thus not captured by dispensing records. However, self-reported data is subject to recall bias and requires additional considerations when interpreting results. Alternatively, some large epidemiologic studies require patients to bring in all their medications when they do their study interviews (eg. bring your brown bag of medications). This can provide a more reliable method of collecting medication information than self-report.

It is also important to consider the risk of misclassification of exposure. When interpreting results, remember that differential misclassification (different for those with and without disease) can result in either an inflated measure of association, or a measure of association that is closer to the null. In contrast, non-differential misclassification (unrelated to the occurrence or presence of disease) shifts the measure of association closer to the null. For further guidance on defining drug exposure, please look at Figure 2.

 

Figure 2. Checklist: Key considerations for defining drug exposure

Picture3_pharmepi post.png
Velentgas, Priscilla, et al., eds. Developing a protocol for observational comparative effectiveness research: a user’s guide. Government Printing Office, 2013.

As alluded to above, pharmacoepidemiology is a field with complex research methods. We hope this article clarifies these three challenging concepts.

 

 

[i](Pinar Balcik, Gulcan Kahraman “Pharmacoepidemiology.” IOSR Journal of Pharmacy (e)-ISSN: 2250-3013, (p)-ISSN: 2319-4219 Volume 6, Issue 2 (February 2016), PP. 57-62)

Is there still value in the p-value?

not sure if significantDoing science is expensive, so a study that reveals significant results yet cannot be replicated by other investigators, represents a lost opportunity to invest those resources elsewhere. At the same time, the pressure on researchers to publish is immense.

These are the tensions that underlie the current debate about how to resolve issues surrounding the use of the p-value and the infamous significance threshold of 0.05. This measurement was adopted in the early 20th century to indicate the probability that the observed results are obtained by chance variation, and the 0.05 threshold has been with it since the beginning, allowing researchers to declare as significant any effect they find that can cross that threshold.

This threshold was selected for convenience in a time when computation of the p-value was difficult to calculate. Our modern scientific tools have made calculation so easy, however, that it is hard to defend a 0.05 threshold as anything but arbitrary. A group of statisticians and researchers is trying to rehabilitate the p-value, at least for the time being, so that we can improve the reliability of results with minimal disruption to the scientific production system. They hope to do this by changing the threshold for statistical significance to 0.005.

In a new editorial in JAMA, Stanford researcher John Ioannidis, a famous critic of bias and irreproducibility in research, has come out in favor of this approach. His argument is pragmatic. In it, he acknowledges that misunderstandings of the p-value are common: many people believe that a result is worth acting on if it is supported by a significant p-value, without regard for the size of the effect or the uncertainty surrounding it.

Rather than reeducating everyone who ever needs to interpret scientific research, then, it is preferable to change our treatment of the threshold signaling statistical significance. Ioannidis also points to the success of genome-wide association studies, which improved in reproducibility after moving to a statistical significance threshold of p < 5 x 10-5.

As Ioannidis admits, this is an imperfect solution. The proposal has set off substantial debate within the American Statistical Association. Bayesians, for example, see it as perpetuating the same flawed practices that got us into the reproducibility crisis in the first place. In an unpublished but widely circulated article from 2017 entitled Abandon Statistical Significance [pdf warning], Blakely McShane, Andrew Gelman, and others point to several problems with lowering the significance threshold that make it unsuitable for medical research.

First, they point out that the whole idea of the null hypothesis is poorly suited to medical research. Virtually anything ingested by or done to the body has downstream effects on other processes, almost certainly including the ones that any given trial hopes to measure. Therefore, using the null hypothesis as a straw man takes away the focus on what a meaningful effect size might be and how certain we are about the effect size we calculate for a given treatment.

They also argue that the reporting of a single p-value hides important decisions made in the analytic process itself, including all the different ways that the data could have been analyzed. They propose reporting all analyses attempted, in an attempt to capture the “researcher degrees of freedom” – the choices made by the analyst that affect how the results are calculated and interpreted.

Beyond these methodological issues, lowering the significance threshold could increase the costs of clinical trials. If our allowance for Type I error is reduced by an order of magnitude, our required sample size roughly doubles, holding all other parameters equal. In a regulatory environment where it costs over a billion dollars to bring a drug to market, this need for increased recruitment could drive up costs (which would need to be passed on to the consumer) and delay the health benefits of market release for good drugs. It is unclear whether these potential cost increases will be offset by the savings of researchers producing more reliable, reproducible studies earlier in the development process.

It also remains to be seen whether the lower p-value’s increased sample size requirement might dissuade pharmaceutical companies from bringing products to market that have a low marginal benefit. After all, you need a larger sample size to detect smaller effects, and that would only be amplified under the new significance thresholds. Overall, the newly proposed significance threshold interacts with value considerations in ways that are hard to predict but potentially worth watching.

Generating Survival Curves from Study Data: An Application for Markov Models

By Mark Bounthavong

Mark_Headshot
CHOICE Student Mark Bounthavong

In cost-effectiveness analysis (CEA), a life-time horizon is commonly used to simulate the overall costs and health effects of a chronic disease. Data for mortality comparing therapeutic treatments are normally derived from survival curves or Kaplan-Meier curves published in clinical trials. However, these Kaplan-Meier curves may only provide survival data up to a few months to a few years, reflecting the length of the trial.

In order to adapt these clinical trial data to a lifetime horizon for use in cost-effectiveness modeling, modelers must make assumptions about the curve and extrapolate beyond what was seen empirically. Luckily, extrapolation to a lifetime horizon is possible using a series of methods based on parametric survival models (e.g., Weibull, exponential). Performing these projections can be challenging without the appropriate data and software, which is why I wrote a tutorial that provides a practical, step-by-step guide to estimate a parameter method (Weibull) from a survival function for use in CEA models.

I split my tutorial into two parts, as described below.

Part 1 begins by providing a guide to:

  • Capture the coordinates of a published Kaplan-Meier curve and export the results into a *.CSV file
  • Estimate the survival function based on the coordinates from the previous step using a pre-built template
  • Generate a Weibull curve that closely resembles the survival function and whose parameters can be easily incorporated into a simple three-state Markov model

Part 2 concludes with a step-by-step guide to:

  • Describe how to incorporate the Weibull parameters into a Markov model
  • Compare the survival probability of the Markov model to the reference Kaplan-Meier curve to validate the method and catch any errors
  • Extrapolate the survival curve across a lifetime horizon

The tutorial requires using and transferring data across a couple of different software. You will need to have some familiarity with Excel to perform these parametric simulations. You should download and install the open source software “Engauge Digitizer” developed by Mark Mitchell, which can be found here. You should also download and install the latest version of R and RStudio to generate the parametric survival curve parameters.

Hoyle and Henley wrote a great paper on using data from a Kaplan-Meier curve to generate parameters for a parametric survival model, which can be found here. The tutorial makes use of their methods and supplemental file. Specifically, you will need to download their Excel Template to generate the parametric survival curve parameters.

I have created a public folder with the relevant files used in the tutorial here.

If you have any comments or notice any errors, please contact me at mbounth@uw.edu

A visual primer to instrumental variables

By Kangho Suh

When assessing the possible efficacy or effectiveness of an intervention, the main objective is to attribute changes you see in the outcome to that intervention alone. That is why clinical trials have strict inclusion and exclusion criteria, and frequently use randomization to create “clean” populations with comparable disease severity and comorbidities. By randomizing, the treatment and control populations should match not only on observable (e.g., demographic) characteristics, but also on unobservable or unknown confounders. As such, the difference in results between the groups can be interpreted as the effect of the intervention alone and not some other factors. This avoids the problem of selection bias, which occurs when the exposure is related to observable and unobservable confounders, and which is endemic to observational studies.

In an ideal research setting (ethics aside), we could clone individuals and give one clone the new treatment and the other one a placebo or standard of care and assess the change in health outcomes. Or we could give an individual the new treatment, study the effect the treatment has, go back in time through a DeLorean and repeat the process with the same individual, only this time with a placebo or other control intervention. Obviously, neither of these are practical options. Currently, the best strategy is randomized controlled trials (RCTs), but these have their own limitations (e.g. financial, ethical, and time considerations) that limit the number of interventions that can be studied this way. Also, the exclusion criteria necessary to arrive at these “clean” study populations sometimes mean that they do not represent the real-world patients who will use these new interventions.

For these reasons, observational studies present an attractive option to RCTs by using electronic health records, registries, or administrative claims databases. Observational studies have their own drawbacks, such as selection bias detailed above. We try to address some of these issues by controlling for covariates in statistical models or by using propensity scores to create comparable study groups that have similar distributions of observable covariates (check out the blog entry of using propensity scores by my colleague Lauren Strand). Another method that has been gaining popularity in health services research is an econometric technique called instrumental variables (IV) estimation. In fact, two of my colleagues and the director of our program (Mark Bounthavong, Blythe Adamson, and Anirban Basu, respectively) wrote a primer on the use of IV here.

In their article Mark, Blythe, and Anirban explain the endogeneity issue when the treatment variable is associated with the error term in a regression model. For those of you who might still be confused (I certainly was for a long time!), I’ll use a simple figure1that I found in a textbook to explain how IVs work.

Instrumental Variables

1p. 147 from Kennedy, Peter. A Guide to Econometrics 6th Edition. Oxford: Blackwell Published, 2008. Print

The figure uses circles to represent the variation within variables that we are interested in: each circle represents the treatment variable (X), outcome variable (Y), or the instrumental variable (Z). First, focus on the treatment and outcome circles. We know that some amount of the variability in the outcome is explained by the treatment variable (i.e. treatment effect); this is indicated by the overlap between the two circles (red, blue, and purple). The remaining green section of the outcome variable represents the error (ϵ) obtained with a statistical model. However, if treatment and ϵ are not independent due to, for example, selection bias, some of the green spills over to the treatment circle, creating the red section. Our results are now biased, because a portion (red) of the variation in our outcome is attributed to both treatment and ϵ.

Enter in instrumental variable Z. It must meet two criteria: 1) be strongly correlated with treatment (large overlap of instrument and treatment) and 2) not be correlated with the error term (no overlap with red or green). In the first stage, we regress treatment on the instrument, and obtain the predicted values of treatment (orange and purple). We then regress the outcome on the predicted values of treatment to get the treatment effect (purple). Because we have only used the exogenous part of our treatment X to explain Y, our estimates are unbiased.

Now that you understand the benefit of IV estimators visually, maybe you can explicitly see some of the drawbacks as well. The information used to estimate the treatment effect became much smaller. It went from the overlap between treatment and outcome (red, blue, and purple) to just the purple area. As a result, while the IV estimator may be unbiased, it has more variance than a simple OLS estimator. One way to improve this limitation is to have an instrument that is highly correlated with treatment to make the purple area as large as possible.

A more concerning limitation with IV estimation is the interpretability of results, especially in the context of treatment effect heterogeneity. I will write another blog post about this issue and how it can be addressed if you have a continuous IV, using a method called person-centered treatment (PeT) effects that Anirban created.  Stay tuned!

Reminders About Propensity Scores

Propensity score (PS)-based models are everywhere these days.  While these methods are useful for controlling for unobserved confounders in observational data and for reducing dimensionality in big datasets, it is imperative that analysts should use good judgement when applying and interpreting PS analyses. This is the topic of my recent methods article in ISPOR’s Value and Outcomes Spotlight.

I became interested in PS methods during my Master’s thesis work on statin drug use and heart structure and function, which has just been published in Pharmacoepidemiology and Drug Safety. To estimate long-term associations between these two variables, I used the Multi-Ethnic Study of Atherosclerosis (MESA), an observational cohort of approximately 6000 individuals with rich covariates, subclinical measures of cardiovascular disease, and clinical outcomes over 10+ years of follow-up. We initially used traditional multivariable linear regression to estimate the association between statin initiation and progression of left ventricular mass over time but found that using PS methods allowed for better control for unobserved confounding. After we generated PS for the probability of starting a statin, we used matching procedures to match initiators and non-initiators, and estimated an average treatment effect in the treated. Estimates from both traditional regressions and PS-matching procedures found a small, dose-dependent protective effect of statins against left ventricular structural dysfunction. This finding of very modest association contrasts with findings from much smaller, short-term studies.

I did my original analyses using Stata, where there are a few packages for PS including psmatch2 and teffects. My analysis used psmatch2, which is generally considered inferior to teffects because it does not provide proper standard errors. I got around this limitation, however, by bootstrapping confidence intervals, which were all conservative compared with teffects confidence intervals.

pscores1
Figure 1: Propensity score overlap among 835 statin initiators and 1559 non-initiators in the Multi-Ethnic Study of Atherosclerosis (MESA)

Recently, I gathered the gumption to redo some of the aforementioned analysis in R. Coding in R is a newly acquired skill of mine, and I wanted to harness some of R’s functionality to build nicer figures. I found this R tutorial from Simon Ejdemyr on propensity score methods in R to be particularly useful. Rebuilding my propensity scores with a logistic model that included approximately 30 covariates and 2389 participant observations, I first wanted to check the region of common support. The region of common support is the overlap between the distributions of PS for the exposed versus unexposed, which indicates the comparability of the two groups. Sometimes, despite fitting the model with every variable you can, PS overlap can be quite bad and matching can’t be done. But I was able to get acceptable overlap on values of PS for statin initiators and non-initiators (see Figure 1). Using the R package MatchIt to do nearest neighbor matching with replacement, my matched dataset was reduced to 1670, where all statin initiators matched. I also checked covariate balance conditional on PS in statin initiator and non-initiator groups. Examples are in Figure 2.  In these plots, the LOWESS smoother is effectively calculating a mean of the covariate level at the propensity score. I expect the means for statin initiators and non-initiators to be similar, so the smooths should be close. In the ends of the age distribution, I see some separation, which is likely to be normal tail behavior. Formal statistical tests can also be used to test covariates balance in the newly matched groups.

pscores2
Figure 2: LOWESS smooth of covariate balance for systolic blood pressure (left) and age (right) across statin initiators and non-initiator groups (matched data)

Please see my website for additional info about my work.

Book Review

Difficult Choices Between What is Best for Science or Best for Our Career

rigor_mortis1

RIGOR MORTISHow Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions
By Richard Harris
288 pages, Perseus Books Group, List Price $28

In “Rigor Mortis: How Sloppy ScienceCreates Worthless Cures, Crushes Hope, and Wastes Billions,” Richard Harris provides compelling evidence through a series of stories and statistics that medical research is plagued by unnecessary errors despite our technology, effort, money, and passion to make a positive impact. This review takes the perspective of a graduate student in health sciences with the aim of assessing the value of Rigor Mortis for the next generation of scientists.  While the book focuses more on sloppy biological science, his concerns are equally valid in the areas of data science and disease modeling.

Richard Harris, a journalist at NPR who writes about science, has started an important conversation about the broad impact of our current scientific culture: we are publishing too many scientific studies which may have false or unreproducible results. Graduate students in health science research or related fields should not be surprised by Harris’s premise. The pressure to produce a large quantity of publications, instead of fewer and higher quality papers, weighs on every grad student in the world.

In 2017, the CHOICE Institute asked its members to read Rigor Mortis and discuss its implications for our field. One emerging theme was that trainees need to be able to report unethical behaviors without fearing adverse consequences. While required annual courses from the University of Washington Biomedical Research Integrity Series challenge students to reconsider their own personal conflicts of interest in publishing research, this remains a difficult ideal to implement in the face of other pressures. Around the lunch table in our grad student lounge, the book sparked an uncomfortable conversation about multiple testing during regression model fitting, and the long stretch of grey area between a dusty, pre-specified analysis plan and our shiny, new hypothesis-generating exploratory findings.

Harris’s storytelling reminded me of a book I love by David Quammen called “Spillover.” Both Rigor Mortis and Spillover are written by distinguished journalists about very complicated and technical problems. Using New York Times reader-friendly language,  both authors include conversations with scientists from around the world to share their stories so that the layperson can understand.

Both books highlight a common dilemma in academia: Should I do what is best for science or what is best for my career? Further, is this an incentives problem or a system problem? The current structure and business of research guide us to make choices that will enhance our career, while science is still often perceived as an altruistic pursuit for the greater good. The book offers a challenge to academic researchers: who among us can claim “no conflict of interest”?

rigor_mortis2
“Attending the panel that rejected his paper proposal, the grad student inwardly trashes each presenter’s research.” — Lego Grad Student

Applying the book’s messages to health economics and outcomes research

I experienced this dilemma when deciding whether to share my HIV disease model. Scientific knowledge and methodology should be completely transparent, yet software coding to implement these techniques is intellectual property that we should not necessarily give away for free. My dilemma isn’t unique. Disease modelers everywhere struggle with this question: should we post our Excel spreadsheet or R code online for others to review and validate and risk having our discovery poached?

This is just one example of tension Harris highlights in his book, and why it is so complex to change our current scientific culture. Scientific advancement is ideally a collective good, but individuals will always need personal incentives to innovate.

Key book takeaways for young scientists

  1. Use valid ingredients
  2. Show your work
  3. No HARKing (Hypothesizing After the Results of the study are Known)
  4. Don’t jump to conclusions (and discourage others from doing this with your results)
  5. Be tough. People may try to discredit you if your hypothesis goes against their life’s work, or for any number of reasons.
  6. Be confident in your science.
  7. Recognize the tension between your own achievement and communal scientific advancement

Further discussions for fixing a broken system

  1. If money is being wasted in biomedical science and research, how do we fix the system to save money without sacrificing incentives to produce valuable innovations? One of our CHOICE Institute graduates, Carrie Bennette, asked this very question in cancer research and you can read about her findings here.
  2. Incentives need to be changed. Academic promotions should not be dependent on the number of our publications but the quality and impact of our contributions. Can we change the culture obsessed with impact factor and promote alternatives such as the H-index or Google Scholar metrics?
  3. Academic tenure systems are antiquated. How do we balance the trade-offs between hiring a post-doc or hiring a permanent staff scientist? Post-doc positions train the next generation and are cheap, however they result in workflow discontinuity from frequent turnover. Permanent staff scientists are trained to stay for a longer period of time, but would disrupt an engrained academic pipeline.

Conclusion

I think all students in any science-related field would benefit from reading this book. Cultural and systematic change will happen faster when we have uncomfortable conversations at the table with our colleagues and mentors. Additionally, we need to take the baton Richard Harris has passed us and start running with our generation of colleagues toward finding and implementing solutions. As our influence in our respective fields grows, so too does our responsibility.

Rigor Mortis is available in hardcover on Amazon for $18.65 and Audible for $19.95.

SaveSaveSaveSave

SaveSave