Is there still value in the p-value?

not sure if significantDoing science is expensive, so a study that reveals significant results yet cannot be replicated by other investigators, represents a lost opportunity to invest those resources elsewhere. At the same time, the pressure on researchers to publish is immense.

These are the tensions that underlie the current debate about how to resolve issues surrounding the use of the p-value and the infamous significance threshold of 0.05. This measurement was adopted in the early 20th century to indicate the probability that the observed results are obtained by chance variation, and the 0.05 threshold has been with it since the beginning, allowing researchers to declare as significant any effect they find that can cross that threshold.

This threshold was selected for convenience in a time when computation of the p-value was difficult to calculate. Our modern scientific tools have made calculation so easy, however, that it is hard to defend a 0.05 threshold as anything but arbitrary. A group of statisticians and researchers is trying to rehabilitate the p-value, at least for the time being, so that we can improve the reliability of results with minimal disruption to the scientific production system. They hope to do this by changing the threshold for statistical significance to 0.005.

In a new editorial in JAMA, Stanford researcher John Ioannidis, a famous critic of bias and irreproducibility in research, has come out in favor of this approach. His argument is pragmatic. In it, he acknowledges that misunderstandings of the p-value are common: many people believe that a result is worth acting on if it is supported by a significant p-value, without regard for the size of the effect or the uncertainty surrounding it.

Rather than reeducating everyone who ever needs to interpret scientific research, then, it is preferable to change our treatment of the threshold signaling statistical significance. Ioannidis also points to the success of genome-wide association studies, which improved in reproducibility after moving to a statistical significance threshold of p < 5 x 10-5.

As Ioannidis admits, this is an imperfect solution. The proposal has set off substantial debate within the American Statistical Association. Bayesians, for example, see it as perpetuating the same flawed practices that got us into the reproducibility crisis in the first place. In an unpublished but widely circulated article from 2017 entitled Abandon Statistical Significance [pdf warning], Blakely McShane, Andrew Gelman, and others point to several problems with lowering the significance threshold that make it unsuitable for medical research.

First, they point out that the whole idea of the null hypothesis is poorly suited to medical research. Virtually anything ingested by or done to the body has downstream effects on other processes, almost certainly including the ones that any given trial hopes to measure. Therefore, using the null hypothesis as a straw man takes away the focus on what a meaningful effect size might be and how certain we are about the effect size we calculate for a given treatment.

They also argue that the reporting of a single p-value hides important decisions made in the analytic process itself, including all the different ways that the data could have been analyzed. They propose reporting all analyses attempted, in an attempt to capture the “researcher degrees of freedom” – the choices made by the analyst that affect how the results are calculated and interpreted.

Beyond these methodological issues, lowering the significance threshold could increase the costs of clinical trials. If our allowance for Type I error is reduced by an order of magnitude, our required sample size roughly doubles, holding all other parameters equal. In a regulatory environment where it costs over a billion dollars to bring a drug to market, this need for increased recruitment could drive up costs (which would need to be passed on to the consumer) and delay the health benefits of market release for good drugs. It is unclear whether these potential cost increases will be offset by the savings of researchers producing more reliable, reproducible studies earlier in the development process.

It also remains to be seen whether the lower p-value’s increased sample size requirement might dissuade pharmaceutical companies from bringing products to market that have a low marginal benefit. After all, you need a larger sample size to detect smaller effects, and that would only be amplified under the new significance thresholds. Overall, the newly proposed significance threshold interacts with value considerations in ways that are hard to predict but potentially worth watching.

Generating Survival Curves from Study Data: An Application for Markov Models

By Mark Bounthavong

CHOICE Student Mark Bounthavong

In cost-effectiveness analysis (CEA), a life-time horizon is commonly used to simulate the overall costs and health effects of a chronic disease. Data for mortality comparing therapeutic treatments are normally derived from survival curves or Kaplan-Meier curves published in clinical trials. However, these Kaplan-Meier curves may only provide survival data up to a few months to a few years, reflecting the length of the trial.

In order to adapt these clinical trial data to a lifetime horizon for use in cost-effectiveness modeling, modelers must make assumptions about the curve and extrapolate beyond what was seen empirically. Luckily, extrapolation to a lifetime horizon is possible using a series of methods based on parametric survival models (e.g., Weibull, exponential). Performing these projections can be challenging without the appropriate data and software, which is why I wrote a tutorial that provides a practical, step-by-step guide to estimate a parameter method (Weibull) from a survival function for use in CEA models.

I split my tutorial into two parts, as described below.

Part 1 begins by providing a guide to:

  • Capture the coordinates of a published Kaplan-Meier curve and export the results into a *.CSV file
  • Estimate the survival function based on the coordinates from the previous step using a pre-built template
  • Generate a Weibull curve that closely resembles the survival function and whose parameters can be easily incorporated into a simple three-state Markov model

Part 2 concludes with a step-by-step guide to:

  • Describe how to incorporate the Weibull parameters into a Markov model
  • Compare the survival probability of the Markov model to the reference Kaplan-Meier curve to validate the method and catch any errors
  • Extrapolate the survival curve across a lifetime horizon

The tutorial requires using and transferring data across a couple of different software. You will need to have some familiarity with Excel to perform these parametric simulations. You should download and install the open source software “Engauge Digitizer” developed by Mark Mitchell, which can be found here. You should also download and install the latest version of R and RStudio to generate the parametric survival curve parameters.

Hoyle and Henley wrote a great paper on using data from a Kaplan-Meier curve to generate parameters for a parametric survival model, which can be found here. The tutorial makes use of their methods and supplemental file. Specifically, you will need to download their Excel Template to generate the parametric survival curve parameters.

I have created a public folder with the relevant files used in the tutorial here.

If you have any comments or notice any errors, please contact me at

Trends for Performance-based Risk-sharing Arrangements

Author: Shuxian Chen

CHOICE Student Shuxian Chen

When considering the approval of new drugs, devices and diagnostic products, there’s always a tension between making the product’s benefits available to more people and collecting more information in trials. The restrictive design of randomized-controlled trials (RCTs) mean that their indications of effectiveness don’t always hold in the real world. They’re also unlikely to detect long-term adverse events. This uncertainty and risk make it hard for payers to make coverage decisions for new interventions.

Performance-based risk-sharing arrangements (PBRSAs), also known as patient access schemes (PAS), managed entry arrangements, and coverage with evidence development (CED), help to reduce such risk. These are arrangements between a payer and a pharmaceutical, device, or diagnostic manufacturer where the price level and/or nature of reimbursement is related to the actual future performance of the product in either the research or ‘real world’ environment rather than the expected future performance [1].

I recently developed a review paper with CHOICE faculty Josh Carlson and Lou Garrison that gave an update of the trends in PBRSAs both in the US and globally. Using the University of Washington Performance-Based Risk-Sharing Database, we have identified 437 eligible cases between 1993 and 2016 from that contains information obtained by searching Google, PubMed, and government websites. Eighteen cases have been added to the database in 2017 and 2018. Seventy-two cases are from the US.

Figure 1. Eligible cases between 1993-2016 by country

Chen_Bar Graph_March16_2018

Australia, Italy, the US, Sweden and the UK are the five countries that have the largest number of PBRSAs. (Distribution of cases from different countries can be seen in Graph 1.) Except for the US, cases from the other four countries are identified from their government programs: the Pharmaceutical Benefits Scheme (PBS) in Australia, the Italian Medicines Agency (AIFA) in Italy, the Swedish Dental and Pharmaceutical Benefits Agency (TLV) in Sweden, and the National Institute for Health and Care Excellence (NICE) in the UK. These single-payer systems have more power in negotiating drug price with the manufacturer than we do in the US.

Cases in the US are more heterogeneous, with both public (federal/state-level) and private payers involved. The US Centers for Medicare and Medicaid Services (CMS) contributes to 25 (37%) of the 72 US cases. Among these, most arrangements involve medical devices and diagnostic products and originate in the CED program at CMS [2]. This program is used to generate additional data to support national coverage decisions for potentially innovative medical technologies and procedures, as coverage for patients is provided only in the context of approved clinical studies [3]. For pharmaceuticals, there have been few PBRSAs between CMS and manufacturers – no cases established between 2006 and 2016. However, in August 2017, Novartis announced that a first-of-its-kind collaboration with the CMS has been made: a PBRSA for Kymriah™ (tisagenlecleucel), their novel cancer treatment for B-cell acute lymphoblastic leukemia that uses the body’s own T cells to fight cancer [4]. The arrangement allows for payment only when participants respond to Kymriah™ by the end of the first month. It can be categorized as performance-linked reimbursement (PLR), as reimbursement is only provided to the manufacturer if the patient meets the pre-specified measure of clinical outcomes. This recent collaboration may lead to a larger number and more variety of PBRSAs between pharmaceutic manufacturers and the CMS.

Please refer to our article for more detailed analyses regarding the trends in PBRSAs.


[1] Carlson JJ, Sullivan SD, Garrison LP, Neumann PJ, Veenstra DL. Linking payment to health outcomes: a taxonomy and examination of performance-based reimbursement schemes between healthcare payers and manufacturers. Health Policy. 2010;96(3): 179–90. doi:10.1016/j.healthpol.2010.02.005.

[2] CMS. Coverage with Evidence Development. Available at:

[3] Neumann PJ, Chambers J. Medicare’s reset on ‘coverage with evidence development’. Health Affairs Blog. 2013 Apr 1. with-evidence-development/

[4] Novatis. Novartis receives first ever FDA approval for a CAR-T cell therapy, Kymriah(TM) (CTL019), for children and young adults with B-cell ALL that is refractory or has relapsed at least twice. 2017. Available at:

A visual primer to instrumental variables

By Kangho Suh

When assessing the possible efficacy or effectiveness of an intervention, the main objective is to attribute changes you see in the outcome to that intervention alone. That is why clinical trials have strict inclusion and exclusion criteria, and frequently use randomization to create “clean” populations with comparable disease severity and comorbidities. By randomizing, the treatment and control populations should match not only on observable (e.g., demographic) characteristics, but also on unobservable or unknown confounders. As such, the difference in results between the groups can be interpreted as the effect of the intervention alone and not some other factors. This avoids the problem of selection bias, which occurs when the exposure is related to observable and unobservable confounders, and which is endemic to observational studies.

In an ideal research setting (ethics aside), we could clone individuals and give one clone the new treatment and the other one a placebo or standard of care and assess the change in health outcomes. Or we could give an individual the new treatment, study the effect the treatment has, go back in time through a DeLorean and repeat the process with the same individual, only this time with a placebo or other control intervention. Obviously, neither of these are practical options. Currently, the best strategy is randomized controlled trials (RCTs), but these have their own limitations (e.g. financial, ethical, and time considerations) that limit the number of interventions that can be studied this way. Also, the exclusion criteria necessary to arrive at these “clean” study populations sometimes mean that they do not represent the real-world patients who will use these new interventions.

For these reasons, observational studies present an attractive option to RCTs by using electronic health records, registries, or administrative claims databases. Observational studies have their own drawbacks, such as selection bias detailed above. We try to address some of these issues by controlling for covariates in statistical models or by using propensity scores to create comparable study groups that have similar distributions of observable covariates (check out the blog entry of using propensity scores by my colleague Lauren Strand). Another method that has been gaining popularity in health services research is an econometric technique called instrumental variables (IV) estimation. In fact, two of my colleagues and the director of our program (Mark Bounthavong, Blythe Adamson, and Anirban Basu, respectively) wrote a primer on the use of IV here.

In their article Mark, Blythe, and Anirban explain the endogeneity issue when the treatment variable is associated with the error term in a regression model. For those of you who might still be confused (I certainly was for a long time!), I’ll use a simple figure1that I found in a textbook to explain how IVs work.

Instrumental Variables

1p. 147 from Kennedy, Peter. A Guide to Econometrics 6th Edition. Oxford: Blackwell Published, 2008. Print

The figure uses circles to represent the variation within variables that we are interested in: each circle represents the treatment variable (X), outcome variable (Y), or the instrumental variable (Z). First, focus on the treatment and outcome circles. We know that some amount of the variability in the outcome is explained by the treatment variable (i.e. treatment effect); this is indicated by the overlap between the two circles (red, blue, and purple). The remaining green section of the outcome variable represents the error (ϵ) obtained with a statistical model. However, if treatment and ϵ are not independent due to, for example, selection bias, some of the green spills over to the treatment circle, creating the red section. Our results are now biased, because a portion (red) of the variation in our outcome is attributed to both treatment and ϵ.

Enter in instrumental variable Z. It must meet two criteria: 1) be strongly correlated with treatment (large overlap of instrument and treatment) and 2) not be correlated with the error term (no overlap with red or green). In the first stage, we regress treatment on the instrument, and obtain the predicted values of treatment (orange and purple). We then regress the outcome on the predicted values of treatment to get the treatment effect (purple). Because we have only used the exogenous part of our treatment X to explain Y, our estimates are unbiased.

Now that you understand the benefit of IV estimators visually, maybe you can explicitly see some of the drawbacks as well. The information used to estimate the treatment effect became much smaller. It went from the overlap between treatment and outcome (red, blue, and purple) to just the purple area. As a result, while the IV estimator may be unbiased, it has more variance than a simple OLS estimator. One way to improve this limitation is to have an instrument that is highly correlated with treatment to make the purple area as large as possible.

A more concerning limitation with IV estimation is the interpretability of results, especially in the context of treatment effect heterogeneity. I will write another blog post about this issue and how it can be addressed if you have a continuous IV, using a method called person-centered treatment (PeT) effects that Anirban created.  Stay tuned!

ISPOR’s Special Task Force on US Value Assessment Frameworks: A summary of dissenting opinions from four stakeholder groups

By Elizabeth Brouwer


The International Society for Pharmacoeconomics and Outcomes Research (ISPOR) recently published an issue of their Value in Health (VIH) journal featuring reports on Value Assessment Frameworks. This marks the culmination of a Spring 2016 initiative “to inform the shift toward a value-driven health care system by promoting the development and dissemination of high-quality, unbiased value assessment frameworks, by considering key methodological issues in defining and applying value frameworks to health care resource allocation decisions.” (VIH Editor’s note) The task force summarized and published their findings in a 7-part series, touching on the most important facets of value assessment. Several faculty of the CHOICE Institute at the University of Washington authored portions of the report, including Louis Garrison, Anirban Basu and Scott Ramsey.

In the spirit of open dialogue, the journal also published commentaries representing the perspectives of four stakeholder groups: payers (in this case, private insurance groups), patient advocates, academia, and the pharmaceutical industry. While supportive of value assessment in theory, each commentary critiqued aspects of the task force’s report, highlighting the contentious nature of value assessment in the US health care sector.

Three common themes emerged, however, among the dissenting opinions:

  1. Commenters saw CEA as a flawed tool, on which the task force placed too much emphasis

All commentaries except the academic perspective bemoaned the task force’s reliance on cost-effectiveness analysis. Payers, represented in an interview of two private insurance company CEOs, claimed that they do not have a choice on whether to cover most new drugs. If it’s useful at all, then, CEA informs the ways that payers distinguish between drugs of the same class. The insurers went on to claim that they are more interested in the way that CEA can highlight high-value uses for new drugs, as most are expected to be expensive regardless.

Patient advocates also saw CEA as a limited tool and were opposed to any value framework overly dependent on the cost per QALY paradigm.  The commentary equated CEAs to clinical trials—while informative, they imperfectly reflect how a drug will fare in the real world. Industry representatives, largely representing the PhRMA Foundation, agreed that the perspective provided by CEAs is too narrow and shouldn’t be the cornerstone for value assessment, at least in the context of coverage and reimbursement decisions.

  1. Commenters disagreed with how the task force measured benefits (the QALY)

All four commentaries noted the limitations the quality-adjusted life-year (QALY). The patient advocates and the insurance CEOs both claimed that the QALY did not reflect their definition of health benefits. The insurance representatives reminded us that their businesses don’t give weight to societal value because it is not in their business model. Similarly, the patient advocate said the QALY did not reflect patient preferences, where value is more broadly defined. The QALY, for example, does not adequately capture the influence of health care on functionality, ability to work, or family life. The patient advocate noted that while the task force identified these flaws and their methodological difficulties, it stopped short of recommending or taking any action to address them.

Industry advocates wrote that what makes the QALY useful—it’s ability to make comparisons across most health care conditions and settings—is also what makes it ill-suited for use in a complex health care system. Individual parts of the care continuum cannot be considered in isolation. They also noted that the QALY is discriminatory to vulnerable populations and was not reflective of their customers’ preferences.

Mark Sculpher, Professor at the University of York representing health economic theory and academia, defended the QALY to an extent, noting that the measure is the most suitable available unit for measuring health. He acknowledged the QALY’s limitations in capturing all the benefits of health care, however, and noted that decision makers and not economists should be the ones defining benefit.


  1. Commenters noticed a disconnect between the reports and social/political realities

Commenters seemed disappointed that the task force did not go further in directing the practical application of value assessment frameworks within the US health care sector. The academic representative wrote that, while economic underpinnings are important, ultimately value frameworks need to be useful to, and reflect the values of, the decision makers. He argued that decision-makers’ buy-in is invaluable, as they hold the power to implement and execute resource allocation. Economics can provide a foundation for this but should not be the source of judgement relating to value if the US is going to take-up value assessment frameworks to inform decisions.

Patient advocates and industry representatives went further in their criticism, saying the task force seemed disconnected from the existing health care climate. The patient advocate author felt the task force ignored the social and political realities in which health care decisions are made. Industry representatives pointed out that current policy, written in the Patient Protection and Affordable Care Act (PPACA), prohibited a QALY-based CEA because most decision makers in the US believe it inappropriate for use in health care decision making. Both groups wondered why the task force continued to rely on CEA methodology when it had been prohibited by the public sector.


The United States will continue to grapple with value assessment as it seeks to balance innovation with budgetary constraints. The ISPOR task force ultimately succeeded in its mission, which was never to specify a definitive and consensual value assessment framework, but instead to consider “key methodological issues in defining and applying value frameworks to health care resource allocation decisions.”

The commentaries also succeeded in their purpose: highlighting the ongoing tensions in creating value assessment frameworks that stakeholders can use. There is a need to improve tools that value health care to assure broader uptake, along with a need to accept flawed tools until we have better alternatives. The commentaries also underscore a chicken-and-egg phenomenon within health care policy. Value assessment frameworks need to align with the goals of decision-makers, but decision-makers also need value frameworks to help set goals.

Ultimately, Mark Sculpher may have summarized it best in his commentary. Value assessment frameworks ultimately seek to model the value of health care technology and services. But as Box’s adage reminds us: although all models are wrong, some are useful. How to make value assessment frameworks most useful moving forward remains a lively, complex conversation.