Refining adverse drug reaction signals by incorporating interaction variables identified using emergent pattern mining
Graphical abstract
Introduction
Negative side effects of medication, termed adverse drug reactions (ADRs), are a serious burden to healthcare [1], [2]. ADRs are estimated as the cause of 6.5% of UK hospitalisations [2] and a study investigating US death due to ADRs reported rates between 0.08 and 0.12 per 100,000 [3]. Studies have suggested that the rate of ADRs is increasing annually [4], motivating the improvement of methods for detecting them.
The process of detecting ADRs starts during clinical trials, however clinical trials often lack sufficient power to detect all ADRs for numerous reasons including time limitations, unrealistic conditions and a limited number of people being included [5]. It is then down to post-marketing surveillance to identify the remaining undiscovered ADRs. This involves three stages: signal detection (identifying associations between drugs and outcomes), signal refinement (prioritising/filtering spurious relationships) and signal evaluation (confirming causality after numerous sources of evidence). There has been a big focus towards developing signal detection methods, involving various forms of data such as spontaneous reporting systems [6], online data [7], [8], chemical structures [9] and longitudinal observational data [10], [11]. Unfortunately, all the data sources have their own limitations. Spontaneous reporting systems are historically the main source used for post-marketing analysis but often contain missing values, suffer from under- and over-reporting, and rely on people noticing ADRs [12]. Longitudinal observational data have recently been used to complement spontaneous reporting system data for extracting new drug safety information, and are an excellent potential source of information due to the quantity of observational data available and the number of variables recorded. If we could overcome existing issues, mainly confounding, that limit the use of observational data for causal inference then we may be able to aid the discovery of new ADRs.
We are often plagued with confounding when investigating potential causal relationships retrospectively in observational data [13] due to the data collection being non-random. When an association between an exposure and outcome is discovered in observational data, it may often be explained by the presence of confounding. A confounding variable is one that leads to distorted effect estimates between an exposure and outcome due to the confounder being associated with both the exposure and outcome. For a variable to be considered a confounder of an exposure and outcome relationship it must be a risk factor of the outcome, it must be associated with the exposure and it cannot lie within the causal pathway between the exposure and outcome.
Consider, for example, the situation where we wish to determine the relationship between a drug given to treat hypertension and myocardial infarction. If we naively look at the incidence of myocardial infarction within a year after treatment for patients given the drug and the incidence of myocardial infarction within a randomly chosen year for patients never given the drug, then we are likely to find that myocardial infarction is more common in those given the drug and conclude that the drug is associated with an increased incidence of myocardial infarction. However, our conclusion is likely explained by confounding, as patients given the drug (those with hypertension) are medically different from those who do not have hypertension. It is likely that some of the patients given the drug have a poor diet or are stressed. Poor diet and stress would have contributed to the hypertension but are also risk factors of myocardial infarction. Therefore poor diet and stress would be confounding factors. To correctly determine a relationship between an exposure and outcome it is important to account for confounding variables. Techniques such as risk adjustment, stratification, or equally distributing the confounding variables between the comparison groups are potential ways to reduce confounding [14].
Adjusting for confounders in observational data requires identifying the confounders. Although existing methods aim to address confounding, various studies have shown that existing signal generation methods developed for longitudinal observational data have a high false positive rate [15], [16]. This is most likely due to difficulties identifying confounding variables in a data-driven way. Some studies have shown that including a large number of variables, such as drug indications, into drug safety methods can reduce confounding [17], [18], [19], but none of these methods included interactive terms. A medical illness is likely to be a result of multiple variables interacting. For example, cardiovascular disease is common in patients with a genetic predisposition such as familial hypercholesterolemia and based on lifestyle such as diet and exercise. Therefore, it is interactive terms between medical events or drugs that are most likely to correspond to confounding variables. However, when there are thousands of medical events and drugs, the number of possible interactions is very large. Existing data-driven methods for incorporating interactive terms into regression models include hierarchal lasso, which adds the interactions along with an interaction regularisation term [20], and methods utilising matrix factorisation [21]. However, these methods are likely to be highly inefficient when there are thousands of variables to consider (which is often the case for observational data). Instead, methods such as emergent pattern mining [22] that can efficient identify outcome specific associations, even when large numbers of variables are being considered, may be more suitable. A similar idea was used to successfully detect survival associate rules [23] based on cox regression and association rule mining. This shows that it is possible to reduce confounding by combining cox regression and association rule mining.
A suitable post-marketing framework that extracts knowledge from longitudinal observational data could be of the form displayed in Fig. 1. The first stage of the proposed framework is to apply an efficient large-scale signal generation method to find associations between exposures and outcomes. In the first step the method would efficiently search through all the exposure and outcome possibilities to find associated pairs. An example of a suitable signal generation method is the high dimensionality propensity score (HDPS) [24]. The HDPS works by developing a predictive model for taking the drug and then a matched cohort analysis is applied, where controls are selected based on having a high propensity for taking the drug (the predictive model predicts that they would have the drug). The HDPS can limit confounding by accounting for a large number of variables. Unfortunately, it is not without issues [25], [26] and still often signals many false positives [15], this highlights the requirement of additional analysis that can reduce the false positive rate. The second step in the framework is the signal refinement, where complex confounding relationships are discovered and incorporated into a more detailed analysis. The output of the signal refinement is a small set of exposure–outcome pairs that are prioritised for signal evaluation. The final step would be to formally evaluate the remaining signals using a number of different data sources, as establishing a causal relationship requires an accumulation of evidence.
In this paper we focus on the signal refinement stage, as there are no data-driven methods to refine signals, but numerous signal generation and evaluation methods exist. The objective of this research is to develop a data-driven signal refinement methodology that can be applied after ADR signal generation using longitudinal observational data to filter and re-rank the signals by addressing complex confounding. We will test the data-driven methodology by analysing the relationship between numerous drugs and the outcome myocardial infarction (MI). We are exploring three goals:
- 1.
Whether emergent pattern mining can be used to identify candidate interaction confounding covariates in a data-driven way.
- 2.
Whether the inclusion of interaction confounding covariates into a regression analysis can reduce confounding and be used for data-driven ADR signal refinement.
- 3.
Whether lasso and ridge regularisation are suitable techniques to enable the inclusion of a large number of potential interaction covariates.
Section snippets
Materials
The longitudinal observational database used in this study is The Health Improvement Network (THIN) database (http://www.thin-uk.com. THIN contains complete medical records for patients registered at a participating general practice within the UK. At present approximately 6% of the UK general practices are participating, resulting in THIN containing data on over 4 million active patients. The validity of the THIN database for pharmacoepidemiology studies has been investigated [27] and it was
Emergent pattern mining results
We identified 77,246 eligible patients who had MI recorded and these patients transactions were included into . We matched 150,304 patients into . The frequent pattern mining was applied to 23,808 items in D1 and found 3,886,408 frequent itemsets for with a minimum support of 0.001(9,920,792 with a minimum support 0.0005). For there were 26,705 items, with association rule mining identifying 2,092,949 frequent itemsets for with a minimum support of 0.001 (5,502,600 with a minimum
Discussion
This is the first methodology proposed for incorporating candidate interaction confounder covariates into a cox regression for drug safety. The standard cox regression that only considered indication of the various drug families on the day of or prior to index, age and sex ranked bisphosphonates (BNF 06060200), a non-ADR, as the most likely to cause MI. However, incorporating the candidate interaction confounders into the elastic net regression with small values for reduced the confounding in
Conclusions
In this paper we proposed a novel framework to efficiently enable the inclusion of high-order interactive terms, potentially representing confounders, into a cox regression analysis to refine ADR signals. The framework combines emergent pattern mining, that searches billions of possible interactions to identify terms potentially corresponding to confounders, and regularised cox regression. We investigated the framework by applying it to investigate how likely six different drug families are to
Conflict of interest statement
None declared.
References (37)
Limitations and strengths of spontaneous reports data
Clin. Ther.
(1998)- et al.
Adverse drug reactions in hospital in-patientsa prospective analysis of 3695 patient-episodes
PLoS One
(2009) - et al.
Adverse drug reactions as cause of admission to hospitalprospective analysis of 18 820 patients
Br. Med. J.
(2004) - et al.
Adverse drug reaction deaths reported in United States vital statistics, 1999–2006
Ann. Pharmacother.
(2012) - et al.
Ten-year trends in hospital admissions for adverse drug reactions in England 1999–2009
J. R. Soc. Med.
(2010) - et al.
Pharmacovigilancemethods, recent developments and future perspectives
Eur. J. Clin. Pharmacol.
(2008) - et al.
A comparison of measures of disproportionality for signal detection in spontaneous reporting systems for adverse drug reactions
Pharmacoepidemiol. Drug Saf.
(2002) - J. Bian, U. Topaloglu, F. Yu, Towards large-scale twitter mining for drug-related adverse events, in: Proceedings of...
- et al.
A novel semisupervised algorithm for rare prescription side effect discovery
IEEE J. Biomed. Health Inform.
(2014) - et al.
Mapping adverse drug reactions in chemical space
J. Med. Chem.
(2009)