Some challenges of working with claims databases

By Nathaniel Hendrix

Real-world evidence has become increasingly important as a data source for comparative effectiveness research, drug safety research, and adherence studies, among other types of research. In addition to sources such as electronic medical records, mobile data, and disease registries, much of the real-world evidence we use comes from large claims databases like Truven Health MarketScan or IQVIA, which record patients’ insurance claims for services and drugs. The enormous size of these databases means that researchers can detect subtle safety signals or study rare conditions where they may not have been able to previously.

Using these databases is not without its challenges, though. In this article, I’ll be discussing a few challenges that I’ve encountered as I’ve worked with faculty on a claims database project in the past year. It’s important for researchers to be aware of these limitations, as they necessarily inform our understanding of how claims-based studies should be designed and interpreted.

Challenge #1: Treatment selection bias

Treatment selection bias occurs when patients are assigned to treatment based on some characteristic that also affects the outcome of interest. If patients with more severe disease are assigned to Drug A rather than Drug B, patients using Drug A may have worse outcomes and we might conclude that Drug B is more effective. Alternatively, if patients with a certain comorbidity are preferentially prescribed a different drug than those patients without the comorbidity – an example of channeling bias – we may conclude that this drug is associated with this comorbidity.

These conclusions would be too hasty, though. What we’d like to do is to simulate a randomized trial, where patients are assigned to treatment without regard for their personal characteristics. Methods such as propensity scores give us this option, but these methods often unavailable to researchers working with claims data. This is because many disease characteristics are not recorded in claims data.

An example might clarify this: imagine that you’re trying to assess the effect of HAART (highly active anti-retroviral therapy) on mortality in HIV patients. Disease characteristics such as CD4 count would be associated with both use of HAART and mortality, but are not recorded in claims data. We could adjust our analysis for other factors such as age and time since diagnosis, but our result would be biased. It’s important, therefore, to understand whether any covariates affect both treatment assignment and the outcome of interest, and to consider other data sources (such as disease registries) if they do.

Challenge #2: Claims data don’t include how the prescription was written

The nature of pharmacy claims data is to record when patients pick up their medications. This creates excellent opportunities for studying resource use and adherence, but these data, unfortunately, lack information about when and how the prescription for these medications was written.

One effect of this is that we don’t know how much time passes between a drug’s being prescribed and when it’s first used. Clearly, if several months pass between the initial prescription and a patient finally picking up that drug from the pharmacy, that would be time spent in non-adherence. We’re not able to capture that time, though. In the case of primary non-adherence, where a prescription is written for a drug that is never picked up at all, this behavior cannot be detected, potentially interfering with our ability to understand the causes of adverse outcomes and to assess the need for interventions that can improve adherence.

Challenge #3: Errors in days’ supply

Days’ supply is essential for calculating adherence and resource use, but errors sometimes appear that can be difficult to work with. Sometimes these are clear entry errors. For example, if a technician enters 310 days instead of 30 days. The payer usually rejects claims made with unusual days’ supply, but some such claims remain in the database.

Another issue is that certain errors in the days’ supply of drugs can be impossible to interpret. For example, if a drug is usually dispensed with an 84-day supply (i.e., 12 weeks) and a claim appears that has a 48-day supply, it’s impossible to know whether the prescriber had escalated the dose or the pharmacy staff had accidentally entered the days’ supply incorrectly. This is one of several reasons why it’s important to carefully consider imposing restrictions on the days’ supply for claims if this parameter is relevant to your research.

Errors such as these can significantly impact analyses that work with days’ supply of prescriptions, so it’s essential to be proactive about looking for cases where the days’ supply is not realistic or interpretable. Consider setting a realistic range to truncate days’ supply before you undertake your analysis.

Challenge #4: Generalizing results from claims studies can be difficult

Claims databases are usually grouped by insurance type. For example, the commercial claims database only contains encounters by commercially-insured patients and their dependents while excluding patients insured by Medicare and/or Medicaid. They may also only include Medicare patients with supplementary insurance. Separating these populations into different databases can make it difficult and sometimes unaffordable for researchers to produce generalizable results as well as introducing complexity due to the need for merging databases.

These populations are all quite different from each other: commercially-insured enrollees are generally healthier than Medicaid enrollees of the same age. And the “dual-eligibles” – enrollees in both Medicare and Medicaid – are different from individuals enrolled in just one of these programs. Since it’s costly and sometimes infeasible to capture all of these patients in a single analysis, you may need to hone your research question carefully so it can be answered by a single database instead of trying to access them all. Fortunately, sampling weights are now common, which helps generalize within your age and insurance grouping even if they are somewhat cumbersome to work with.

In summary, claims databases have added immeasurable value to several fields of research by collecting information on the real-world behavior of clinicians and patients. Still, there are some significant challenges that need to be taken into account when considering using claims data. Finding a good scientific question that suits these data means understanding their limitations. These are a few of the most important ones, but anyone who works with these data long enough will be sure to discover challenges unique to their own research program.

Published by

Nathaniel Hendrix

Nathaniel Hendrix is a fourth year PhD candidate at the CHOICE Institute. His dissertation is on the application of machine learning to cancer screening.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s