Lessons from a meta-analysis in the presence of statistical heterogeneity. A case study of SARS-CoV-2 detection window.

By Enrique M Saldarriaga and Beth Devine

The objective of this entry is to present the lessons we learned from a meta-analysis we conducted on the detection pattern on SARS-CoV-2. In the process, we found high statistical heterogeneity across the studies, that persisted even after stratification by demographic and study-design characteristics. Although these results did not allow us to increase our knowledge on the shedding patterns of COVID-19, it prompted us to review the concepts, assumptions, and methods available to measure heterogeneity and how it affects the estimation of quantitative summaries.

In this post, we present our analytic process and reflections on the methods. We discuss the use of mean versus median and the crosswalk between them, key differences between fixed and random effects models, measures of heterogeneity, and analytic tools implemented with R. Along the way, we provide a tutorial of the methods used to conduct a meta-analysis. 

SARS-CoV-2 window of detection

The window of SARS-CoV-2 detection presents key information to understand the patterns of virus shedding and infectiousness to better implement testing and isolation strategies.1,2 A diagnostic test conducted too soon or too late can lead to a false negative result, increasing the likelihood of virus transmission.

Dr. Kieran Walsh et al.3 conducted a systematic review of studies that described the duration of virus detection. The authors included “any study that reports on the viral load or duration of viral detection or infectivity of COVID-19”, excluding studies without laboratory confirmation of COVID-19 from molecular testing (i.e., polymerase chain reaction or PCR). Thus, they included cohort studies, cross-sectional, non-randomized clinical trials, and case series, from various countries and age groups (adults and children). In addition, the viral samples came from the upper-respiratory tract, lower-respiratory tract, and stool samples. From a narrative summary, the authors concluded that while the trajectory of SARS-CoV-2 viral load is relatively consistent over the course of the disease, the duration of infectivity is unclear.

We decided to meta-analyze the results of this well-conducted systematic review. To boost internal consistency across studies, we focused our meta-analysis solely on studies that reported upper respiratory tract samples. 

Mean v. Median

To combine results, it is necessary to have consistency in the reported metric. Most of the studies reported mean and standard deviation, but others reported median and inter-quantile range or max-min range. We followed the conclusions of Wan et al4 to estimate sample mean and standard deviation based on the reporting information presented for each study.

We employed one of two possible methods:

Method 1.

Where ɸ-1(·) is the inverse of the probability density function for the normal distribution (the function `qnorm()` in R), centered at 0 with standard deviation 1; P is defined by defined by P = (n − 0.375)/(n + 0.25), where n is the sample size.

Method 2.

Where IQ1 is the lower bound of the inter-quantile range equivalent to the 2.5 percentile, and IQ3 the upper bound, equivalent to the 97.5 percentile. P is defined by P = (0.75n − 0.175)/(n + 0.25).

The underlying assumption for both methods is that the observations summarized by the median arise from a normal distribution, which can be a limitation. However, these methods improve upon commonly accepted conversion formulas (see Bland et al 20155 and Hozo et al 20056 for more details), by relaxing non-negativity assumptions, and using more stable and adaptive quantities to estimate the SD.

Fixed vs random meta-analysis

The pooled mean is a weighted average and the decision of using a fixed or random effects model directly affects how the study weights are generated. If we assume that the studies have a common effect then it makes sense that the pooled mean places more importance (i.e., weight) on the studies with the lowest uncertainty. In other words, the assumption that the true value (in direction and magnitude) is the same across all studies implies that observed differences are due to chance. On the contrary, if there is no prior knowledge suggesting a common effect, and rather, each study provides an estimate of its own, then the weighting process should reflect that. The first alternative calls for a fixed effects model and the latter for random effects. The random effects assumption is less restrictive as it acknowledges the variation in the true effects estimated in each study.7 Thus, the precision (i.e., estimated uncertainty expressed in the standard deviation) of the studies plays an important role, but so does the assumption (and/or knowledge) about the relationship across studies. See Tufanaru et al 20158 and Borenstein et al 20109for a complete discussion on these two statistical models. 

Fixed- v. random-effects model: comparative table of key characteristics and rationale.

CriterionFixed-effects modelRandom-effects model
Goal of statistical inference (statistical generalizability of results).Results apply only to studies in meta-analysis.Results apply beyond studies included in the analysis.
Statistical assumption regarding the parameter.There is one common, fixed parameter and all studies estimate the same common parameter.There is no common parameter and studies estimate different parameters.
Nonstatistical assumption regarding the comparability of studies from a clinical point of view (participants, interventions, comparators, and outcomes).It is reasonable to consider that studies are similar enough and that there is a common effect.Studies are different and it is not reasonable to consider that there is a common effect.
The nature of meta-analysis results.The meta-analysis summary effect is an estimate of the effect that is common to all studies included in the analysis.The meta-analysis summary effect is an estimate of the mean of a distribution of true effects; it is not the shared common estimate, because there is not one.
Adapted from Tufanaru et al., JBI Evidence Implementation 2015; Table 4

“The fixed-effects meta-analysis model’s total effect is an estimator of the combined effect of all studies. In contrast, the random-effect meta-analysis’s full effect is an estimator of the mean value of the true effect distribution” (Hackenberger 202010).

In our analysis, we determined that there was no common effect across studies due to differing study designs and populations studied. 


Statistical heterogeneity is a consequence of clinical and/or methodological differences across studies and drives to what extent it is possible to assume that the true value found by each study is the same. Clinical differences include participant characteristics, and intervention design and implementation; methodological differences include definition and measurement of outcomes, procedures for data collection, and any other characteristic associated to the design of the study.

There are two main metrics that we can use to summarize heterogeneity: the percentage of variance attributable to study heterogeneity (I2) and the true-effect variance (𝜏2). 𝜏2 is possibly the most widely used metric. It builds on the chi-squared test – usually referred to as the Cochran’s Q in the literature – for expected vs observed information, under the null hypothesis that differences observed across studies are due to chance alone. ­A limitation of this test is that it provides a binary assessment and ignores the degree of heterogeneity, which is more relevant, as variability in method, procedures, and results across studies is expected.  

Julian Higgins11 proposed the I2, a newer metric to describe the percentage of total variation due to heterogeneity rather than chance (i.e., uses the same rationale as the Q test). Formally, I2 = 100% × (Q − df)/Q where df is the degrees of freedom. Negative values of Q − df are equated to zero so I2 is bounded between 0% and 100%. This metric should be interpreted in the context of the analysis and other factors, such as the characteristics of the studies, and the magnitude and direction of the estimate of individual values. As a rule of thumb, an I2 higher than 50% might represent substantial heterogeneity and requires caution in the interpretation of the pooled values.

Another measure of heterogeneity is the variance of the true effects, 𝜏2. This metric is consistent with the random-effects assumption that there could be more than one true effect, and each study provides an estimate of those. There are several ways to estimate 𝜏, the most popular is the DerSimonian and Laird method, which is based on normal maximum likelihood. The main limitation of this estimate is that unless the sampling variances are homogeneous (regardless of the number of studies included) it tends to underestimate 𝜏. Viechtbauer 200512 provides a thorough assessment of the alternatives. 

The interpretation of 𝜏2 is straightforward and it provides an estimate of the between-study variance of the true effects. It therefore helps to inform whether a quantitative summary makes sense; a large variance can make the mean meaningless. Further, the estimation of 𝜏2 has uncertainty, and it is possible to estimate its confidence interval for a deeper assessment of the variance across studies.

All these metrics can measure statistical heterogeneity. However, it depends on the researcher to determine if, even in absence of statistical heterogeneity, the results of two or more studies should be combined into a single value. That assessment depends upon the clinical characteristics of the studies under analysis, specifically the population, place, and time, the interventions evaluated, and outcomes measured. This is the main reason we excluded studies whose samples did not arise from the upper-respiratory tract, because pooling the results of structurally different studies would have been a mistake.

See Chapter 9 of the Cochrane Handbook for an introduction to the topic and Ioannidis J, JECP 200813 for an informative discussion on how to assess heterogeneity and bias in meta-analysis. 

Confidence intervals and prediction intervals

Both the mean and the standard error in a meta-analysis are a function of the inverse variance weights, which are estimated considering the individual standard deviation and the assumption of heterogeneity for the true effect. Random effects models tend to present more equally distributed weights across studies compared to fixed effects, and therefore the estimate of the standard error is higher leading to wider confidence intervals (CI). In the extreme, the presence of high between-study variance in meta-analysis can yield a pooled mean that closely resembles an arithmetic mean (despite study sample size or precision in the estimates) due to the equal distribution of weights across studies.14

If the pooled mean is denoted by μ and the standard error is SE(μ), then by the central limit theorem, the 95% CI are estimated as μ ± 1.96 × SE(μ) where 1.96 is the percentile 97.5 of the normal distribution. The 95% CI provides information about the precision of the pooled mean, i.e., the uncertainty of the estimation. Borenstein et al 20099 present a helpful discussion of the rationale and steps involved in the estimation of the pooled mean and standard errors for both fixed and random effects models.

Under the random-effects assumption, we can estimate the prediction interval for a future study. Its derivation uses both the pooled standard error, SE(μ), and the between-variance estimate, 𝜏2. The approximate 95% prediction interval for the estimation of a new study is given by:15

Where 𝜶 is the level of significance, usually, 5%; t𝜶k-2 denotes the 100 × (1 − 𝜶/2)% percentile (97.5% when 𝜶 = 0.05 ) of the t-distribution with k − 2 degrees of freedom, where k is the number of studies included in the meta-analysis. The use of a t-distribution (instead of a normal as for the estimation of confidence intervals) aims to reflect the uncertainty surrounding 𝜏2, thus the use of a distribution with heavier tails. 


Our analysis of the findings from Walsh et al3 was implemented in R using the “ meta` package.16 This library offers several alternatives to conduct meta-analysis; for our study two were particularly relevant – `metagen` and `metamean`. Both base the weights calculation on the inverse-variance methods; the former treats each individual value as a treatment effect (i.e., the difference in performance in two competing alternatives), while the latter assumes each value is a single mean. As shown below, the mean and confidence intervals under both are similar: 13.9 days of detectability (95% CI 11.7, 16.7) using `metagen`, and 15.6 days (95 %CI 12.3, 18.9) using `metamean`. However, the heterogeneity estimates are very different. We present the results of both analyses below. 

Our results of the pooled mean using `metagen`

TE: Treatment Effect; se: Standard Error; MD: Mean Difference; 95%CI: 95% Confidence Intervals; MRAW: Raw or untransformed Mean

Our results of the pooled mean using `metamean`

TE: Treatment Effect; se: Standard Error; MD: Mean Difference; 95%CI: 95% Confidence Intervals; MRAW: Raw or untransformed Mean

The main difference between the two methods is the assumption made over the uncertainty metric included in the data. While `metagen` assumes that the metric is the standard error (SE; that arose from a previous statistical analysis), `metamean` assumes that it is the standard deviation (SD) and hence corrects it using the sample size (n) of the studies, using SE = SD/n0.5. Therefore, the estimated confidence intervals for each study are wider under `metagen` which gives the impression that all studies are more alike, reducing the estimated heterogeneity. The opposite happens under `metamean`, the confidence intervals are narrower in comparison and therefore the heterogeneity is higher. 

This difference in the uncertainty estimation under each method is also reflected in the weights. Under `metagen` the fixed-effects model shows a more homogeneous distribution of weights compared to the high concentration displayed under `metamean`. This is because the narrower intervals under `metamean` lead one to think that studies like Lavezzo and Chen,3 for example, are very accurate and therefore more important in the estimation. In contrast, under the random-effects models, the weights in `metamean` are almost the same for all studies, indicating that each arises from a different true effect distribution, while under `metagen` the wider intervals lead one to think that some studies arise from the similar true distributions and hence deserve higher weights. A homogeneous distribution of weights in the random-effect model under `metagen` is consistent with the notion that under enough uncertainty, the pooled mean closely approximates the simple arithmetic mean.14


The function `metamean` is the appropriate option for our analysis, because it is consistent with the information reported by the studies: a single mean and SD. We found that 99% of the variability is attributable to statistical heterogeneity (the I2 estimate) and that the standard deviation of the true effect is around 9 days (𝜏 = 81.50.5). The mean duration of the detectable period is 15.6 days (95%CI 12.3, 18.9). 

We believe that the level of heterogeneity found is likely a consequence of the marked difference in the types of studies included in the systematic review, that ranged from case series to non-randomized clinical trials. Further, Walsh et al3 collected data from March to May 2020, and the variability of the studies is a consequence of the scarce information about COVID at the time. From the prediction interval we found that an undefined study will find a mean duration of detectable period between -3 and 34 days at 95% of confidence. The impossibility of this result (i.e., having a negative duration) is a consequence of the high level of study heterogeneity found in our analysis.

Given the level of heterogeneity, a quantitative approach is not a feasible option to summarize the results across studies. Even though these results did not allow us to expand our clinical knowledge on the shedding patterns of COVID, it created an interesting exercise that helped us reflect on the underlying assumptions and methods of meta-analysis. The code for this analysis is available in GitHub at https://github.com/emsaldarriaga/COVID19_DurationDetection. This includes all steps presented in this entry plus the data gathering process using web scrapping and stratified analysis by type of publication, population, and countries.


  1. Bedford J, Enria D, Giesecke J, et al. COVID-19: towards controlling of a pandemic. The Lancet. 2020;395(10229):1015-1018. doi:10.1016/S0140-6736(20)30673-5
  2. Cohen K, Leshem A. Suppressing the impact of the COVID-19 pandemic using controlled testing and isolation. Sci Rep. 2021;11(1):6279. doi:10.1038/s41598-021-85458-1
  3. Walsh KA, Jordan K, Clyne B, et al. SARS-CoV-2 detection, viral load and infectivity over the course of an infection. J Infect. 2020;81(3):357-371. doi:10.1016/j.jinf.2020.06.067
  4. Wan X, Wang W, Liu J, Tong T. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. BMC Med Res Methodol. 2014;14(1):135. doi:10.1186/1471-2288-14-135
  5. Bland M. Estimating Mean and Standard Deviation from the Sample Size, Three Quartiles, Minimum, and Maximum. Int J Stat Med Res. 2015;4(1):57-64. doi:10.6000/1929-6029.2015.04.01.6
  6. Hozo SP, Djulbegovic B, Hozo I. Estimating the mean and variance from the median, range, and the size of a sample. BMC Med Res Methodol. 2005;5(1):13. doi:10.1186/1471-2288-5-13
  7. Serghiou S, Goodman SN. Random-Effects Meta-analysis: Summarizing Evidence With Caveats. JAMA. 2019;321(3):301-302. doi:10.1001/jama.2018.19684
  8. Tufanaru C, Munn Z, Stephenson M, Aromataris E. Fixed or random effects meta-analysis? Common methodological issues in systematic reviews of effectiveness. JBI Evid Implement. 2015;13(3):196-207. doi:10.1097/XEB.0000000000000065
  9. Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. A basic introduction to fixed-effect and random-effects models for meta-analysis. Res Synth Methods. 2010;1(2):97-111. doi:10.1002/jrsm.12
  10. K. Hackenberger B. Bayesian meta-analysis now – let’s do it. Croat Med J. 2020;61(6):564-568. doi:10.3325/cmj.2020.61.564
  11. Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ. 2003;327(7414):557-560. doi:10.1136/bmj.327.7414.557
  12. Viechtbauer W. Bias and Efficiency of Meta-Analytic Variance Estimators in the Random-Effects Model. J Educ Behav Stat. 2005;30(3):261-293. doi:10.3102/10769986030003261
  13. Ioannidis JPA. Interpretation of tests of heterogeneity and bias in meta-analysis. J Eval Clin Pract. 2008;14(5):951-957. doi:10.1111/j.1365-2753.2008.00986.x
  14. Imrey PB. Limitations of Meta-analyses of Studies With High Heterogeneity. JAMA Netw Open. 2020;3(1):e1919325. doi:10.1001/jamanetworkopen.2019.19325
  15. Higgins JPT, Thompson SG, Spiegelhalter DJ. A re-evaluation of random-effects meta-analysis. J R Stat Soc Ser A Stat Soc. 2009;172(1):137-159. doi:10.1111/j.1467-985X.2008.00552.x
  16. Balduzzi S, Rücker G, Schwarzer G. How to perform a meta-analysis with R: a practical tutorial. Evid Based Ment Health. 2019;22(4):153-160. doi:10.1136/ebmental-2019-300117

The Utility Function, Indifference Curves, and Healthcare

By Brennan T. Beal

Impetus For The Post

When I first learned about utility functions and their associated indifference curves, I was shown an intimidating figure that looked a bit like the image below. If you were lucky, you were shown a computer generated image. The less fortunate had a professor furiously scribbling them onto a board.


A few things were immediately of concern: why are there multiple indifference curves for one function if it only represents one consumer? Why are the curves moving? And… who is Natasha? So, while answering my own questions, I thought sharing the knowledge would be helpful. This post will hopefully provide a better description than maybe most of us have heard and by the end you will understand:

  1. What indifference curves are and what they represent
  2. How a budget constraint relates to these indifference curves and the overall utility function
  3. How to optimize utility within these constraints (if you’re brave)

For the scope of this post, I’ll assume you have some fundamental understanding of utility theory.

Click here to link to my original post to continue reading.

Economic Evaluation Methods Part I: Interpreting Cost-Effectiveness Acceptability Curves and Estimating Costs

By Erik Landaas, Elizabeth Brouwer, and Lotte Steuten

One of the main training activities at the CHOICE Institute at the University of Washington is to instruct graduate students how to perform economic evaluations of medical technologies. In this blog post series, we give a brief overview of two important economic evaluation concepts. Each one of the concepts are mutually exclusive and are meant to stand alone. The first of this two-part series describes how to interpret a cost-effectiveness acceptability curve (CEAC) and then delves into ways of costing a health intervention. The second part of the series will describe two additional concepts: how to develop and interpret cost-effectiveness frontiers and how multi-criteria decision analysis (MCDA) can be used in Health Technology Assessment (HTA).


Cost-Effectiveness Acceptability Curve (CEAC)

The CEAC is a way to graphically present decision uncertainty around the expected incremental cost-effectiveness of healthcare technologies. A CEAC is created using the results of a probabilistic analysis(PA).[1] PA involves simultaneously drawing a set of input parameter values by randomly sampling from each parameter distribution, and then storing the model results.  This is repeated many times (typically 1,000 to 10,000), resulting in a distribution of outputs that can be graphed on the cost-effectiveness plane. The CEAC reflects the proportion of results that are considered ‘favorable’ (i.e. cost effective) in relation to a given cost-effectiveness threshold.

The primary goal of a CEAC graph is to inform coverage decisions among payers that are considering a new technology, comparing one or more established technologies that may include the standard of care. The CEAC enables a payer to determine, over a range of willingness to pay (WTP) thresholds, the probability that a medical technology is considered cost-effective in comparison to its appropriate comparator (e.g. usual care), given the information available at the time of the analysis. A WTP threshold is generally expressed in terms of societal willingness to pay for an additional life year or quality-adjusted life year (QALY) gained. In the US, WTP thresholds typically range between $50,000 – $150,000 per QALY.

The X-axis of a CEAC represents the range of WTP thresholds. The Y-axis represents the probability of each comparator being cost-effective at a given WTP threshold, and ranges between 0% and 100%. Thus, it simply reflects the proportion of simulated ICERs from the PA that fall below the corresponding thresholds on the X-axis.

Figure 1. The Cost-Effectiveness Acceptability Curve


Coyle, Doug, et al. “Cost-effectiveness of new oral anticoagulants compared with warfarin in preventing stroke and other cardiovascular events in patients with atrial fibrillation.” Value in health 16.4 (2013): 498-506.

Figure 1 shows CEACs for five different drugs, making it easy for the reader to see that at the lower end of the WTP threshold range (i.e. $0 – $20,000 per QALY), warfarin has the highest probability to be cost-effective (or in this case “optimal”). At WTP values >$20,000 per QALY, dabigatran has the highest probability to be cost-effective. All the other drugs have a lower probability of being cost-effective compared to warfarin and dabigatran at every WTP threshold. The cost-effectiveness acceptability frontier in Figure 1 follows along the top of all the curves and shows directly which of the five technologies has the highest probability of being cost-effective at various levels of the WTP thresholds.

To the extent that the unit price of the technology influences the decision uncertainty, a CEAC can offer insights to payers as well as manufacturers as they consider a value-based price. For example, a lower unit price for the drug may lower the ICER and, all else equal, this increases the probability that the new technology is considered cost-effective at a given WTP threshold. Note, that when new technologies are priced such that the ICER falls just below the WTP for a QALY, (e.g. an ICER of $99,999 when the WTP is $100,000) the decision uncertainty tends to be substantial, often around 50%. If decision uncertainty is perceived to be ‘unacceptably high’, it can be recommended to collect further information to reduce decision uncertainty. Depending on the drivers of decision uncertainty, for example in case of stochastic uncertainty in the efficacy parameters, performance-based risk agreements (PBRAs) or managed entry schemes may be appropriate tools to manage the risk.

Cost estimates

The numerator of most economic evaluations for health is the cost of a technology or intervention. There are several ways to arrive at that cost, and choice of method depends on the context of the intervention and the available data.

Two broadly categorized methods for costing are the bottom-up methodand the top-down method. These methods, described below, are not mutually exclusive and may complement each other, although they often do not produce the same results.


Source of Table: Mogyorosy Z, Smith P. The main methodological issues in costing health care services: a literature review. 2005.

The bottom-up method is also known as the ingredients approach or micro-costing. In this method, the analyst identifies all the items necessary to complete an intervention, such as medical supplies and clinician time, and adds them up to estimate the total cost. The main categories to consider when calculating costs via the bottom-up method are medical costs and non-medical costs. Medical costs can be direct, such as the supplies used to perform a surgery, or indirect, such as the food and bed used for inpatient care. Non-medical costs often include costs to the patient, such as transportation to the clinic or caregiver costs. The categories used when estimating the total cost of an intervention will depend on the perspective the analyst takes (perspectives include patient, health system, or societal).

The bottom-up approach can be completed prospectively or retrospectively, and can be helpful for planning and budgeting. Because the method identifies and values each input, it allows for a clear breakdown as to where dollars are being spent. To be accurate, however, one must be able to identify all the necessary inputs for an intervention and know how to value capital inputs like MRI machines or hospital buildings. The calculations may also become unwieldy on a very large scale. The bottom-up approach is often used in global health research, where medical programs or governmental agencies supply specific items to implement an intervention, or in simple interventions where there are only a few necessary ingredients.

The top-down estimation approach takes the total cost of a project and divides it by the number of service units generated. In some cases, this is completed simply looking at the budget for a program or an intervention and then dividing that total by the number of patients. The top-down approach is useful because it is a simple, intuitive measurement that captures the actual amount of money spent on a project and the number of units produced, particularly for large projects or organizations. Compared to the bottom-up approach, the top-down approach can be much faster and cheaper. The top-down approach can only be used retrospectively, however, and may not allow for the breakdown of how the money was spent or be able to identify variations between patients.

While the final choice will depend on several factors, it makes sense to try and think through (or model) which of the cost inputs are likely to be most impactful on the model results. For example, the costs of lab tests may most accurately be estimated by a bottom-up costing approach. However, if these lab costs are likely to be a fraction of the cost of treatment, say a million dollar cure for cancer, then going through the motions of a bottom-up approach may not be the most efficient way to get your PhD-project done in time. In other cases, however, a bottom-up approach may provide crucial insights that move the needle on the estimated cost-effectiveness of medical technologies, particularly in settings where a lack of existing datasets is limiting the potential of cost-effectiveness studies to inform decisions on the allocation of scarce healthcare resources.

[1]Fenwick, Elisabeth, Bernie J. O’Brien, and Andrew Briggs. “Cost‐effectiveness acceptability curves–facts, fallacies and frequently asked questions.” Health economics 13.5 (2004): 405-415.

Commonly Misunderstood Concepts in Pharmacoepidemiology

By Erik J. Landaas, MPH, PhD Student and Naomi Schwartz, MPH, PhD Student


Epidemiologic methods are central to the academic and research endeavors at the CHOICE institute. The field of epidemiology fosters the critical thinking required for high quality medical research. Pharmacoepidemiology is a sub-field of epidemiology and has been around since the 1970’s. One of the driving forces behind the establishment of pharmacoepidemiology was the Thalidomide disaster. In response to this tragedy, laws were enacted that gave the FDA authority to evaluate the efficacy of drugs. In addition, drug manufacturers were required to conduct clinical trials to provide evidence of a drug’s efficacy. This spawned a new and important body of work surrounding drug safety, efficacy, and post-marketing surveillance.[i]

In this article, we break down three of the more complex and often misunderstood concepts in pharmacoepidemiology: immortal time bias, protopathic bias, and drug exposure definition and measurement.


Immortal Time Bias

In pharmacoepidemiology studies, immortal time bias typically arises when the determination of an individual’s treatment status involves a delay or waiting period during which follow-up time is accrued. Immortal time bias is a period of follow-up during which, by design, the outcome of interest cannot occur. For example, the finding that Oscar winners live longer than non-winnersis a result of immortal time bias. In order for an individual to win an Oscar, he/she must live long enough to receive the award.  A pharmacoepidemiology example of this is depicted in Figure 1. A patient who receives a prescription may survive longer because he/she must live long enough to receive a prescription while a patient who does not receive a prescription has no survival requirements.  The most common way to avoid immortal time bias is to use a time-varying exposure variable. This allows subjects to contribute to both unexposed (during waiting period) and exposed person time.


Figure 1. Immortal Time Bias

Picture2_pharmepi post.png 

Lévesque, Linda E., et al. “Problem of immortal time bias in cohort studies: example using statins for preventing progression of diabetes.” Bmj 340 (2010): b5087.

Protopathic Bias or Reverse Causation

Protopathic bias occurs when a drug of interest is initiated to treat symptoms of the disease under study before it is diagnosed. For example, early symptoms of inflammatory bowel disease (IBD) are often consistent with the indications for prescribing proton pump inhibitors (PPIs). Thus, many individuals who develop IBD have a history of PPI use. A study to investigate the association between PPIs and subsequent IBD would likely conclude that taking PPIs causes IBD when, in fact, the IBD was present (but undiagnosed) before the PPIs were prescribed.  This scenario is illustrated by the following steps:

  • Patient has early symptoms of an underlying disease (e.g. acid reflux)
  • Patient goes to his/her doctor and gets a drug to address symptoms (e.g. PPI)
  • Patient goes on to develop a diagnosis of having IBD (months or even years later)

It is easy to conclude from the above scenario that PPIs cause IBD, however the acid reflux was actually a manifestation of underlying IBD that was not yet diagnosed.  Protopathic bias occurs in this case because of the lag time between first symptoms and diagnosis. One effective way to address protopathic bias is by excluding exposures during the prodromal period of the disease of interest.


Drug Exposure Definition and Measurement 

Defining and classifying exposure to a drug is critical to the validity of pharmacoepidemiology studies. Most pharmacoepidemiology studies use proxies for drug exposure, because it is often impractical or impossible to measure directly (e.g. observing a patient take a drug, monitoring blood levels). In lieu of actual exposure data, exposure ascertainment is typically based on medication dispensing records. These records can be ascertained from electronic health records, pharmacies, pharmacy benefit managers (PBMs), and other available healthcare data repositories. Some of the most comprehensive drug exposure data are available among Northern European countries and large integrated health systems such as Kaiser Permanente in the United States. Some strengths of using dispensing records to gather exposure data are:

  • Easy to ascertain and relatively inexpensive
  • No primary data collection
  • Often available for large sample sizes
  • Can be population based
  • No recall or interviewer bias
  • Linkable to other types of data such as diagnostic codes and labs

Limitations of dispensing records as a data source include:

  • Completeness can be an issue
  • Usually does not capture over-the-counter (OTC) drugs
  • Dispensing does not guarantee ingestion
  • Often lacks indication for use
  • Must make some assumptions to calculate dose and duration of use

Some studies collect drug exposure data using self-report methods (e.g. interviews or surveys). These methods are useful when the drug of interest is OTC and thus not captured by dispensing records. However, self-reported data is subject to recall bias and requires additional considerations when interpreting results. Alternatively, some large epidemiologic studies require patients to bring in all their medications when they do their study interviews (eg. bring your brown bag of medications). This can provide a more reliable method of collecting medication information than self-report.

It is also important to consider the risk of misclassification of exposure. When interpreting results, remember that differential misclassification (different for those with and without disease) can result in either an inflated measure of association, or a measure of association that is closer to the null. In contrast, non-differential misclassification (unrelated to the occurrence or presence of disease) shifts the measure of association closer to the null. For further guidance on defining drug exposure, please look at Figure 2.


Figure 2. Checklist: Key considerations for defining drug exposure

Picture3_pharmepi post.png
Velentgas, Priscilla, et al., eds. Developing a protocol for observational comparative effectiveness research: a user’s guide. Government Printing Office, 2013.

As alluded to above, pharmacoepidemiology is a field with complex research methods. We hope this article clarifies these three challenging concepts.



[i](Pinar Balcik, Gulcan Kahraman “Pharmacoepidemiology.” IOSR Journal of Pharmacy (e)-ISSN: 2250-3013, (p)-ISSN: 2319-4219 Volume 6, Issue 2 (February 2016), PP. 57-62)

Is there still value in the p-value?

not sure if significantDoing science is expensive, so a study that reveals significant results yet cannot be replicated by other investigators, represents a lost opportunity to invest those resources elsewhere. At the same time, the pressure on researchers to publish is immense.

These are the tensions that underlie the current debate about how to resolve issues surrounding the use of the p-value and the infamous significance threshold of 0.05. This measurement was adopted in the early 20th century to indicate the probability that the observed results are obtained by chance variation, and the 0.05 threshold has been with it since the beginning, allowing researchers to declare as significant any effect they find that can cross that threshold.

This threshold was selected for convenience in a time when computation of the p-value was difficult to calculate. Our modern scientific tools have made calculation so easy, however, that it is hard to defend a 0.05 threshold as anything but arbitrary. A group of statisticians and researchers is trying to rehabilitate the p-value, at least for the time being, so that we can improve the reliability of results with minimal disruption to the scientific production system. They hope to do this by changing the threshold for statistical significance to 0.005.

In a new editorial in JAMA, Stanford researcher John Ioannidis, a famous critic of bias and irreproducibility in research, has come out in favor of this approach. His argument is pragmatic. In it, he acknowledges that misunderstandings of the p-value are common: many people believe that a result is worth acting on if it is supported by a significant p-value, without regard for the size of the effect or the uncertainty surrounding it.

Rather than reeducating everyone who ever needs to interpret scientific research, then, it is preferable to change our treatment of the threshold signaling statistical significance. Ioannidis also points to the success of genome-wide association studies, which improved in reproducibility after moving to a statistical significance threshold of p < 5 x 10-5.

As Ioannidis admits, this is an imperfect solution. The proposal has set off substantial debate within the American Statistical Association. Bayesians, for example, see it as perpetuating the same flawed practices that got us into the reproducibility crisis in the first place. In an unpublished but widely circulated article from 2017 entitled Abandon Statistical Significance [pdf warning], Blakely McShane, Andrew Gelman, and others point to several problems with lowering the significance threshold that make it unsuitable for medical research.

First, they point out that the whole idea of the null hypothesis is poorly suited to medical research. Virtually anything ingested by or done to the body has downstream effects on other processes, almost certainly including the ones that any given trial hopes to measure. Therefore, using the null hypothesis as a straw man takes away the focus on what a meaningful effect size might be and how certain we are about the effect size we calculate for a given treatment.

They also argue that the reporting of a single p-value hides important decisions made in the analytic process itself, including all the different ways that the data could have been analyzed. They propose reporting all analyses attempted, in an attempt to capture the “researcher degrees of freedom” – the choices made by the analyst that affect how the results are calculated and interpreted.

Beyond these methodological issues, lowering the significance threshold could increase the costs of clinical trials. If our allowance for Type I error is reduced by an order of magnitude, our required sample size roughly doubles, holding all other parameters equal. In a regulatory environment where it costs over a billion dollars to bring a drug to market, this need for increased recruitment could drive up costs (which would need to be passed on to the consumer) and delay the health benefits of market release for good drugs. It is unclear whether these potential cost increases will be offset by the savings of researchers producing more reliable, reproducible studies earlier in the development process.

It also remains to be seen whether the lower p-value’s increased sample size requirement might dissuade pharmaceutical companies from bringing products to market that have a low marginal benefit. After all, you need a larger sample size to detect smaller effects, and that would only be amplified under the new significance thresholds. Overall, the newly proposed significance threshold interacts with value considerations in ways that are hard to predict but potentially worth watching.