Application of Association Rules Analysis in Mining Adverse Drug Reaction Signals

Wei, Jianxiang; Dai, Jimin; Zhao, Yingya; Han, Pu; Zhu, Yunxia; Huang, Weidong

doi:10.3390/app112210828

Open AccessArticle

Application of Association Rules Analysis in Mining Adverse Drug Reaction Signals

¹

School of Management, Nanjing University of Posts and Telecommunications, Nanjing 210003, China

²

Key Research Base of Philosophy and Social Sciences in Jiangsu-Information Industry Integration Innovation and Emergency Management Research Center, Nanjing 210003, China

³

School of Internet of Things, Nanjing University of Posts and Telecommunications, Nanjing 210003, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2021, 11(22), 10828; https://doi.org/10.3390/app112210828

Submission received: 9 September 2021 / Revised: 4 November 2021 / Accepted: 10 November 2021 / Published: 16 November 2021

(This article belongs to the Special Issue Advanced Decision Making in Clinical Medicine)

Download

Browse Figure

Review Reports Versions Notes

Abstract

:

Adverse drug reactions (ADRs) are increasingly becoming a serious public health problem. Spontaneous reporting systems (SRSs) are an important way for many countries to monitor ADRs produced in the clinical use of drugs, and they are the main data source for ADR signal detection. The traditional signal detection methods are based on disproportionality analysis (DPA) and lack the application of data mining technology. In this paper, we selected the spontaneous reports from 2011 to 2018 in Jiangsu Province of China as the research data and used association rules analysis (ARA) to mine signals. We defined some important metrics of the ARA according to the two-dimensional contingency table of ADRs, such as Confidence and Lift, and constructed performance evaluation indicators such as Precision, Recall, and F1 as objective standards. We used experimental methods based on data to objectively determine the optimal thresholds of the corresponding metrics, which, in the best case, are Confidence = 0.007 and Lift = 1. We obtained the average performance of the method through 10-fold cross-validation. The experimental results showed that F1 increased from 31.43% in the MHRA method to 40.38% in the ARA method; this was a significant improvement. To reduce drug risk and provide decision making for drug safety, more data mining methods need to be introduced and applied to ADR signal detection.

Keywords:

association rule; data mining; adverse drug reaction; signal detection

1. Introduction

Adverse drug reaction (ADR) refers to an appreciably harmful or unpleasant reaction, resulting from an intervention related to the use of a medicinal product, which predicts hazards from future administration and warrants prevention, specific treatment, alteration of the dosage regimen, or withdrawal of the product [1]. ADR is a common clinical manifestation, and we must pay attention to the frequency of this potential injury. Because it is related to morbidity and mortality, it may cause unnecessary economic loss, and, to a certain extent, it harms the doctor–patient relationship [2]. Under normal circumstances, it is difficult to detect all possible ADRs before the drug is marketed. After the drug is marketed and put into use, we can detect more comprehensive ADRs through long-term observation. To ensure the safety of medication for each patient and reduce ADRs, all countries have established a spontaneous reporting system (SRS) to collect adverse drug events (ADEs) as an essential data source for ADR signal detection. SRS has been defined as an unsolicited communication by a healthcare professional or consumer to a company, regulatory authority, or other organizations (e.g., the WHO, Regional Centre, Poison Control Centre) that describe one or more ADRs in a patient who was given one or more medicinal products and that does not derive from a study or any organized data collection scheme.

The traditional ADR signal detection method is mainly based on disproportionality analysis (DPA). The proportional reporting ratio (PRR) method is the most basic signal detection algorithm in the early stage [3,4,5]. On this basis, the British Medicines and Healthcare Products Regulatory Agency (MHRA) combined the value of PRR with the target drug, the target ADR report number, and the Pearson Chi-square value as a more stable signal detection method, called the MHRA method [6]. At present, this method has been widely used by the Pharmacovigilance Center in the Netherlands, the Drug ADR Monitoring System in the United Kingdom, the Uppsala Monitoring Center of the World Health Organization (WHO, UMC), and the Drug ADR Spontaneous Reporting System in the United States [7]. However, the results of the MHRA method are easily affected by spontaneous reports. Although the MHRA method is determined by three metrics and the results are relatively stable and accurate, as the number of spontaneous reports increases, the value calculated by the formula of the index will inevitably decrease as the base increases due to the limitations of the metrics themselves. When the specified thresholds remain unchanged, the sensitivity of the MHRA method will decrease.

As an important data mining method, association rules analysis (ARA) has been introduced into signal detection to solve the problems of drug safety. Through research on relevant papers, we found that (1) most of the researchers applied the ARA method to specific drugs and performed personalized analysis of the ADRs of some drugs. That is to say, researchers rarely used SRS for many drugs and a larger range of studies to detect ADR signals. (2) Most researchers used ARA as a preliminary screening tool for the detection of ADR signals. By calculating the Support, Confidence, or other metrics of each ADR combination, they filtered out the combinations that were not in the ideal range and used other algorithms to further detect the ADR signals in the selected high-quality combinations. This approach failed to tap the maximum potential of ARA, increased the research cost of data mining, and complicated the experiment. (3) When using ARA to filter data, only Support, Confidence, or other metrics were used as the screening criteria, and the performance of the criteria were not evaluated. Therefore, it was difficult to confirm whether the experimental results were the optimum.

In response to the above situation, we put forward our research hypothesis: firstly, we tried to introduce SRS in ARA for a larger range of ADR signal detection. Secondly, we believed that ARA cannot only be used as a preliminary screening tool for data; it has the ability to do more work. Thus, we proposed using ARA to complete the whole process of ADR signal detection, using Confidence and lift to finalize the ADR signals, and using F1 and other indicators to objectively describe the detection performance of ARA to ensure the best results. In order to verify that our research does make the ability to mine ADR signals with ARA stronger than other methods, we also compared the ARA method with the MHRA method to prove the reliability of the ARA.

This paper aims to fully utilize ARA to mine ADR signals, improve the accuracy of ADR signal detection at a minimum cost, provide more reliable decision support for drug safety, and strive to minimize the side effects of drugs used to improve health services and health practices.

The remainder of this paper is organized as follows: related works are given in Section 2. The process of the experiment and the explanation of the relevant theory are formulated in Section 3. In Section 4, the experimental results are presented and compared with other methods. The discussion about the advantages and limitations of the ARA method proposed in this paper is in Section 5. Finally, the concluding remarks are provided in Section 6.

2. Related Works

Based on the concept of strong rules, Rakesh Agrawal, Tomasz Imieliński, and Arun Swami introduced association rules [8]. They used them to mine the rules between products in large-scale transaction data recorded by supermarket point-of-sale systems. Today, many ARA algorithms have been proposed. Apriori uses a breadth-first search strategy to count the Support of item sets and uses a candidate generation function that exploits the downward closure property of Support [9]. Eclat (alt. ECLAT, stands for Equivalence Class Transformation) is a depth-first search algorithm based on set intersection. It is suitable for both sequential as well as parallel execution with locality-enhancing properties [10,11,12]. The ASSOC procedure is a GUHA method that mines for generalized association rules using fast bit string operations [13]. The association rules mined by this method are more general than those output by Apriori, for example “items” can be connected both with conjunction and disjunctions, and the relation between the antecedent and consequent of the rule is not restricted to setting minimum Support and Confidence as in Apriori: an arbitrary combination of supported interest measures can be used.

ARA has a wide range of applications in many aspects, such as Web usage mining, intrusion detection, continuous production, and bioinformatics. The research in this paper applies ARA to the field of biomedicine. In this field, many researchers have used ARA for research. Jenna M. Reps et al. proposed a proof-of-concept method that learned common associations and used this knowledge to automatically refine side effect signals (i.e., exposure–outcome associations) by removing instances of the exposure–outcome associations that are caused by confounding [14]. They then calculated a novel measure termed the confounding-adjusted risk value, a more accurate absolute risk value of a patient experiencing the outcome within 60 days of the exposure. Tentative results suggested that the method works. Sharma D extracted useful information from the quarterly tables produced, synthesized the information to obtain the rules using Apriori algorithm to vary the Confidence and other measure levels. Interactions of Patients’ demographic characteristics (such as age, gender, etc.), length of therapy, and dosages of the drugs taken were also explored to determine if such factors play a role in driving the reactions [15].

In this paper, we proposed an ARA method based on Confidence and Lift and used simulation experiments to objectively find their optimal thresholds. Under our performance evaluation system, this ARA method performed well.

3. Methods

3.1. Date Source

The data source was monitoring reports of ADRs in Jiangsu Province, China, from 2011 to 2018. The reference dataset contains known ADR combinations extracted from drug package inserts, and it is used as an objective standard for performance evaluation of signal detection.

3.2. MHRA

The MHRA is a signal detection method adopted by the British Medicines and Healthcare Products Regulatory Agency, also known as the comprehensive standard method or MHRA method. It bases on the PRR method and comprehensively considers the chi-square value x² and the absolute number of reports a. Only when the three conditions of a ≥ 3, PRR ≥ 2, and x² ≥ 4 are met simultaneously can the signal be considered to exist, indicating that there is a relationship between a specific drug and a specific ADR.

The PRR method was first applied to the ADR monitoring system in the United Kingdom. It is a method used to quantitatively analyze the data of ADR records collected by the SRS [16]. The three metrics mentioned above are all based on two-dimensional contingency tables of ADRs. In Table 1, a represents the number of the target ADR caused by the target drug, b represents the number of all other ADRs caused by the target drug, c represents the number of the target ADR caused by other drugs other than the target drug, and d represents the number of all other ADRs caused by other drugs other than the target drug. The MHRA method is more rigorous than the PRR method; it guarantees the minimum number of combination cases, and the result is relatively more stable.

P R R = \frac{a / (a + b)}{c / (c + d)},

(1)

χ^{2} = \frac{{(| a d - b c | - \frac{n}{2})}^{2} \cdot n}{(a + b) (c + d) (a + c) (b + d)}, n = a + b + c + d,

(2)

3.3. Association Rules Analysis

ARA is a commonly used data mining method used to discover the interrelationships between data in large datasets [17]. ARA is defined as the implicit expression of X→Y, where X and Y are sets of items that originate from the same dataset but do not intersect [8]. Furthermore, X is called the antecedent or left-hand side (LHS) and Y the consequent or right-hand side (RHS). In this paper, X means “drug” in spontaneous reports, and Y means “adverse reaction”. In order to filter the signals that meet the requirements, researchers have designed many different functional metrics, of which the most commonly used are Support, Confidence, and Lift. According to Table 1, we defined the calculation formulas of these three metrics as follows:

S u p p o r t = \frac{a}{a + b + c + d},

(3)

C o n f i d e n c e = \frac{a}{a + b},

(4)

L i f t = \frac{c o n f i d e n c e}{a + c / a + b + c + d},

(5)

Support indicates the proportion of the data combination that contains both X and Y to the total data volume. From Formula (3), it can be seen that the denominator in the expression of Support is the number of records in the entire dataset, and the numerator only contains the number of records in which the target ADR caused by the target drug “a”. The base of the denominator is too large, and the numerator is relatively too small, which causes the value of Support to infinitely approach to 0. If Support were used as the metric of signal detection in this experiment, the accuracy and sensitivity of the detection result would be greatly reduced. Thus, we deprecated Support.

Confidence indicates the proportion of data containing X that also contain Y. It can also be interpreted as an estimate of the conditional probability P(Y|X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS [18,19].

Lift indicates the ratio of “the proportion of data containing X that also contain Y” and “the proportion of data containing Y in the population”. Lift reflects the correlation between X and Y in the ARA. If Lift = 1, it means that X and Y are not correlated. If Lift > 1, the higher the Lift, and the higher the positive correlation between X and Y. If Lift < 1, the lower the Lift, and the higher the negative correlation between X and Y [18].

3.4. Performance Evaluation

When the metrics for the detection of ADR signals were determined to be Confidence and Lift, our core task was to determine the best thresholds for Confidence and Lift.

In order to better compare the performance of different methods for detecting ADR signals, we defined three indicators to describe the performance of detecting ADR signals, namely: Precision, Recall [20], and F1 [21,22]. The expressions of these three indicators depend on Table 2.

Our dataset contains known ADR combinations extracted from package inserts. We used them as an objective standard for performance evaluation. In Table 2, “Known as positive in the ADR dataset” means the ADR combination has been recorded in the dataset as a known ADR combination. “Known as negative in the ADR dataset” means the ADR combination was not been recorded in the dataset, so we temporarily denied that it was an objective ADR combination. “MHRA/ARA method tested positive” means the ADR combination was detected as an ADR signal by MHRA/ARA method. “MHRA/ARA method tested negative” means the ADR combination was not been detected as an ADR signal by MHRA/ARA method.

As shown in the Table 2, TP (True Positive) represents the number of known positive ADR combinations in the ADR dataset that can be detected as positive by the MHRA/ARA method. FN (False Negative) represents the number of known positive ADR combinations in the ADR dataset that can be detected as negative by the MHRA/ARA method. FP (False Positive) represents the number of known negative ADR combinations in the ADR dataset that can be detected as positive by the MHRA/ARA method. TN (True Negative) represents the number of known negative ADR combinations in the ADR dataset that can be detected as negative by the MHRA/ARA method [23].

P r e c i s i o n = \frac{T P}{T P + F P} \times 100 %,

(6)

R e c a l l = \frac{T P}{T P + F N} \times 100 %,

(7)

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} \times 100 %,

(8)

Precision represents the proportion of the number of true ADR signals predicted by a method to the number of all ADR signals predicted by the method, and it considers the accuracy of the detection. Recall represents the proportion of the number of true ADR signals to the number of actual ADR signals in the prediction of a certain method. It considers integrity20. The value of F1 is the harmonic average of Precision and Recall21,22. The higher the value of F1, the better the performance of the method.

3.5. Method of Determining the Thresholds

We imported the data in the dataset into Microsoft Visual FoxPro and used Microsoft Visual FoxPro for simulation experiments. Through the existing research data of Confidence and Lift, the approximate range of the thresholds of the two metrics was determined, and through continuous precision and refinement, the ideal thresholds were obtained. Due to the limitation of the definition of performance indicators, when the value of Precision becomes larger, the value of Recall will inevitably become smaller, so we used F1 (the harmonic average of Precision and Recall) as the final indicator to reflect the performance of the ARA.

3.6. Method Comparison

In order to analyze the advantages of the ARA method more objectively, we used 10-fold cross-validation on the ARA method to obtain the average performance level of the ARA method and compared it with the performance of the MHRA method. In addition, we compared the two methods from the perspective of formulation, and used specific drug event examples to concretely present the performance of the two methods.

4. Results

4.1. Study Dataset

A total of 751,606 reports were selected as the study dataset, and all experiments were conducted based on this dataset. We first processed the data through the above formulas to obtain the values of all indicators such as PRR, Confidence, and Lift, and further obtained our experimental results through these data.

4.2. Optimal Threshold

According to the method described above, we determined appropriate thresholds for the Confidence and Lift to maximize the effectiveness. The specific results are shown in Table 3 and Table 4.

Starting from the definition of Confidence and Lift, we first selected some reasonable thresholds for them and observed the value of F1 obtained from that place. As shown in Table 3, the effective threshold of the Lift should start from 1. We selected five values of 1, 1.5, 2, 2.5, and 3 and found that the value of F1 decreases with the increase in Lift. Additionally, we chose 0.005, 0.01, 0.015, 0.02, and 0.025 as the Confidence’s threshold and found that when the Confidence’s threshold is 0.01, the ARA performs best.

On this basis, we analyzed the adjacent values of the current optimal thresholds and reasonably suspected that the optimal thresholds would appear between Confidence ∊ [0.005,0.01] and Lift ∊ [1,1.5]. Thus, we used more exact values for the simulation experiments. It can be seen from Table 4 that when confidence = 0.007 and lift = 1, F1 achieves a maximum value of 40.51%. Thus, we finally determined that Confidence = 0.007 and Lift = 1 are the optimal thresholds.

4.3. Comparison of the MHRA Method and the ARA Method at the Performance Level

However, the best thresholds obtained from the overall sample and the performance they exhibited were the best results that the ARA method could achieve. In order to obtain the average performance level of the ARA method, we used 10-fold cross-validation to evaluate the performance of our method. We divided the dataset into 10 subsets on average, selected 9 of them as the training set and the other as the test set, and performed 10 experiments without repeating them. In the training set, we used the ARA method to obtain the optimal thresholds, then used them as the optimal thresholds in the test set, and evaluated the performance indicators of the thresholds which we obtained from training set in the test set. When we determined the thresholds of the metrics used in the training set, we could divide each piece of data in the test set into four categories: TP, TN, FP, and FN based on the threshold, and then we used Formulas (6)–(8) to obtain the performance of the ARA method to detect ADR signals through the simulation experiments and found the average performance of ten experiments. We used the three average performance indicators obtained from ten experiments as the final result of the threshold determination, respectively: Precision = 36.28%, Recall = 45.65%, and F1 = 40.38%. Figure 1 is a comparison of the performance of the ARA method and the MHRA method.

As shown in Figure 1, Precision was reduced from 40.41% in the MHRA method to 36.28% in the ARA method; Recall increased from 25.72% in the MHRA method to 45.65% in the ARA method; F1 increased from 31.43% in the MHRA method to 40.38% in the ARA method. As can be seen from Section 3.4, we used F1 as the main indicator to judge the performance of ADR signal detection, and the F1 of the ARA method was increased by 28.48% compared to the F1 of the MHRA method. These results showed that the performance of the ARA method was much better than that of the MHRA method.

We used the data of levofloxacin in the dataset to concretely characterize the superior performance of ARA. We screened the top ten potential ADRs that may be caused by levofloxacin from the dataset and reported the results in Table 5 (the frequency of adverse symptoms is in descending order). “Has been identified as a positive signal” means the ADR was determined as the drug’s ADR. “Detected as a positive signal by MHRA method” means that the ADR was detected as an ADR signal by the MHRA method. “Detected as a positive signal by ARA method” means that the ADR was detected as an ADR signal by the ARA method.

In this dataset, the number of rashes caused by levofloxacin was 7088, accounting for about 15.60% of the ADRs caused by the drug. Moreover, a rash was determined as levofloxacin’s ADR, but the MHRA method did not detect this as an ADR signal, while the ARA method detected it as an ADR signal. Similarly, 4226 people had nausea after using levofloxacin, 2967 people vomited after using levofloxacin, and 2644 people had allergies after using levofloxacin. These were all determined as levofloxacin’s ADRs, but were only detected as ADR signals by the ARA method. Moreover, pruritus and phlebitis were detected as ADR signals in the three situations.

In conclusion, the ARA mined six kinds of ADRs that were determined as ADRs before. This was enough to illustrate the effectiveness of the method. Furthermore, the anaphylactoid reaction was not previously determined as an ADR signal, but it was detected as an ADR signal by the ARA method, and relevant medical research has confirmed our conjecture [24]. This showed that the ARA method’s accuracy and performance are higher than those of the MHRA method.

4.4. Comparison of the MHRA Method and the ARA Method at the Formula Level

We used the controlled variable method to analyze and compared the formulas of the MHRA method and the ARA method to study the influence of the formulas on the detection of ADR signals. We took levonorgestrel ethinyl estradiol and etimicin as examples for analysis, and the specific results are shown in Table 6 and Table 7.

In Table 6, levonorgestrel ethinyl estradiol is a long-acting contraceptive that affects the endocrine. Only 17 cases of its ADRs have been reported in the entire dataset. The amount of data is minimal. Even when the PRR and x² met the standard, six ADR combinations appeared negative under the MHRA method and appeared positive under the ARA method because the value ‘a’ did not meet the standard. For example, the ADR of uterine bleeding caused by levonorgestrel ethinyl estradiol focused on a = 1, b = 16, c = 14, and d = 751,575; the ADR of menstrual disorders caused by levonorgestrel ethinyl estradiol focused on a = 1, b = 16, c = 227, and d = 751,362. The drug reported a total of 13 ADR combinations, of which 1 group was detected positive by the MHRA method, and 13 groups were detected positive by the ARA method. That is, because the number of cases of target ADRs caused by the target drug is not up to the standard of MHRA, the detection difference between the MHRA method and the ARA method reached 50%. If the ADRs recorded in the package inserts are used as the reference standard, then under the dataset, the detection accuracy rate of the drug by the MHRA method is 0, and the detection accuracy rate of the ARA method is 30.77%.

In Table 7, etimicin is an aminoglycoside antibiotic drug. There are 3710 ADRs that have been reported in this dataset, accounting for 0.49% of the data in this dataset. In the reported case of this drug, because it did not meet the criteria of the PRR formula, the number of ADR combinations that tested negative under the MHRA method and tested positive under the ARA method reached eight groups. For example, the ADR of palpitations caused by etimicin focused on a = 105, b = 3605, c = 11,104, and d = 736,792; the ADR of rash caused by etimicin focused on a = 641, b = 3069, c = 114,059, and d = 633,837. A total of 209 ADR combinations were reported for the drug, among which the detection difference between the MHRA method and the ARA method due to the PRR formula was 3.83%. If the ADRs recorded in the package inserts are used as the reference standard, then under the dataset, the detection accuracy rate of the drug by the MHRA method is 11.54%, and the detection accuracy rate of the ARA method is 37.5%.

In conclusion, the ARA method has better universality, and it can handle smaller or larger datasets more calmly while maintaining high accuracy. This also means that when we want to process data in other datasets, we only need to execute the methods provided in this paper step by step, and we can obtain each dataset’s own optimal threshold.

5. Discussion

5.1. Progress in the ARA Field

ARA has become a common method in the field of data mining. Kai Guo et al. used ARA as a data screening tool combined with embedded models to detect ADR signals and proved that this method could effectively detect potential ADRs through specific studies on rofecoxib and gadoversetamide [25]. Heba Ibrahim et al. used an optimized tailored mining algorithm called “hybrid Apriori”. The results showed that the proposed method could extract signals of serious ADRs, and various association patterns could be identified based on the relationships among the elements which composed a pattern [26]. Dan Zhang et al. used ARA to analyze the characteristics and regularities of cardiac ADRs induced by Chinese materia medica. They found that the cardiac ADRs had strong correlations with the ADRs of the nervous system and digestive system, and the aconitum species and other toxic Chinese materia medica were intimately associated with the ADRs of the nervous system and digestive system [27]. Through the critical research of the above papers and other papers, we proposed the research provided in this paper on the use of ARA to mine ADR signals and achieved good results. The specific manifestations are as follows: 1. We introduced SRS, which greatly expanded the scope of ARA to detect ADR signals and demonstrated the powerful potential of ARA. 2. We used ARA as the only tool to mine ADR signals, which fully demonstrated the powerful capabilities of ARA. 3. Because there was no need to use more tools, the economic cost was greatly saved, which has important practical significance. 4. We added a performance evaluation mechanism, which ensured the accuracy of the results to a greater extent. These are also the advantages of our research results.

5.2. Advantages over the Traditional MHRA

Compared with the traditional DPA method, (1) the advantage of ARA lies in the higher adaptability to datasets with different characteristics. Taking the MHRA method as an example, this method has fixed index combination thresholds. If the data for the dataset were increased, the sensitivity would decrease. While the ARA method’s index thresholds change dynamically, the thresholds can be determined by simulation experiments based on the specific characteristics of the dataset. Similarly, due to the limitation of the combination of metrics of the MHRA method, the amount of data in the dataset using this method cannot be too small. Otherwise, it will cause few positive signals to be detected, which leads to no research values. All metrics of the ARA method are expressed in the form of proportions, so the result is relatively more stable even if the amount of data is small. (2) The ARA method has better detection performance for rare ADR signals. We have verified this in the above experiment. At the theoretical level, due to the different data characteristics of each dataset and some unavoidable interference factors, such as individual selective reporting of ADRs and over-reporting of ADRs, these may cause some real and rare ADRs to be masked. If we used the MHRA method, the positive ADR signal might be ignored due to the fact that the number of records of specific ADRs is small, or the frequency of ADRs caused by specific drugs is relatively lower in the same ADR range. The ARA method is more accurate and stable for detecting ADR signals because of its metrics’ proportional representation and dynamics.

From the perspective of formulas, The MHRA method requires that the ADR combination completely meets a ≥ 3, PRR ≥ 2, and x² ≥ 4, then it is determined as an ADR signal. The ARA method requires that the ADR combination completely meets Confidence ≥ 0.007 and Lift ≥ 1, then it is determined as an ADR signal.

The three formulas of MHRA method can be explained as

(1): The number of target ADR caused by the target drug ≥3;
(2): The probability of target ADR caused by the target drug ≥ the probability of target ADR caused by all other drugs ×2;
(3): The Chi-square value of the ADR ≥4, which means the drug is related to the ADR.

The two formulas of the ARA method can be explained as:

(1): The number of cases of target drug producing target ADR accounted for the number of cases of target drug producing all ADRs ≥ 0.007 (the probability of target ADR occurring when the target drug is used ≥ 0.007);
(2): When using the target drug, the probability of the target ADR is greater than the probability of using all the drugs to produce the target ADR. At the same time, Lift also reflects the correlation between the drug and the ADR.

By analyzing the definitions of the formulas of the two methods and their meaning of expressions, we divided the variable ‘a’ in the MHRA method and Formula (4) in the ARA method into a group for comparison, at the same time, we divided Formulas (1) and (2) in the MHRA method and Formula (5) in the ARA method into a group for comparison.

Formula (5) in the ARA method is more comprehensive Formulas (1) and (2) in the MHRA method. The function of Lift’s formula expression is similar to that of the PRR’s formula expression in MHRA, but because it also has the implication of “correlation”, the Chi-square value is added to MHRA to compensate for the implication of “correlation”. For the PRR’s expression in the MHRA method and the Lift’s expression in the ARA method, the values of their denominators c/c + d and a + c/a + b + c + d tend to behave in the same way when the amount of data in the dataset is large enough, so the requirements needed to achieve the PRR’s formula are more stringent, but this also reduces the ability of the MHRA method to capture rare ADR signals. When the amount of data in the dataset is moderate or small, the requirements for reaching the PRR’s formula are similar to those for reaching the Lift’s formula.

In practical applications, if the amount of data reported for a drug is small or the drug produces rare ADRs, the value of ‘a’ in the MHRA method may not meet the standard, and the ratio is used in the ARA method to determine whether the number of cases that the target drug producing target ADR meets the standard; thus, the ARA method has better stability and can capture rare ADR signals more accurately, and it also makes the ARA method fairer and more accurate in processing different levels of data.

It can be seen from the formula that the MHRA method is more suitable for data analysis in a dataset with a moderate amount of data and the number of ADR reports for each independent drug in the dataset is sufficient, while the ARA method has better general applicability. Additionally, the ARA method is more capable of capturing rare ADR signals.

5.3. Limitations of the ARA

However, in the course of the experiment, we still found some limitations. In the dataset we used, we used the two variables of Confidence and Lift as metrics to mark the ADR signals according to the characteristics of this dataset and performed simulation experiments based on the data in this dataset to obtain the best thresholds of the two metrics. Nevertheless, if we need to update the dataset or apply it to a new dataset, the performance of these two metrics might fluctuate greatly. Because the data characteristics of different datasets are different, we may need to replace or add new metrics, such as the Support mentioned above. Moreover, because of the change of data, the metric threshold has to be reconfirmed through simulation experiments. That is to say, we have not been able to find a universal indicator combination and an objective method that can be used to confirm the metric threshold directly.

From the perspective of data, the unbalanced distribution of the ADR spontaneous report data means that the frequency of using different tablets may vary greatly, and the frequency of ADRs caused by drugs may also vary greatly. Therefore, when we use the same index to detect the dataset, it will produce unfairness, which will lead to inaccuracy of detection. In subsequent improvements, we will consider using data stratification methods to separate different characteristic data according to a certain method and, respectively, confirm more effective index thresholds to detect ADR signals. We are committed to further improving the universality and performance of ARA for mining ADR spontaneous report datasets.

6. Conclusions

In this paper, we analyzed the current research related to ARA and found some shortcomings in the process of using ARA. On this basis, we put forward our own research hypothesis: we introduced SRS and tried to use only ARA as a tool to mine ADR signals because this could better utilize the capabilities of ARA and save costs to a certain extent. According to the actual situation of the dataset used, we chose Confidence and Lift as metrics for detecting ADR signals. We used 10-fold cross-validation. Through the three indicators of Precision, Recall, and F1, we compared the results of the ARA method with the results of the MHRA method and proved that the ARA method is effective. Furthermore, at the performance level, we took the drug levofloxacin and its ADRs as an example to further prove the reliability of the ARA method. At the formula level, by analyzing the mathematical and physical meanings of the formulas, we have confirmed that the ARA method has better universal applicability to various datasets.

Finally, we summarized the progress of the ARA method proposed in this paper in ARA-related fields, and its advantages over other traditional data mining methods. At the same time, we also reflect on the limitations of the ARA and consider continuing to improve on this basis to make them more universal and reliable.

Author Contributions

J.W. conceived of the study. J.D. participated in data processing and experimental simulation. J.W. and J.D. participated in data analysis, and drafting of the manuscript. Y.Z. (Yingya Zhao), P.H., Y.Z. (Yunxia Zhu), and W.H. were involved in critically revising the manuscript and reviewing experiment results. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the National Social Science Foundation of China (No. 17CTQ022) and the Major Project of Philosophy and Social Science Research in Jiangsu Universities (2020SJZDA102).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are not publicly available due to the policy of confidentiality of the CFDA but are available from the corresponding author on reasonable request and with permission of the CFDA.

Conflicts of Interest

The authors declare no conflict of interest.

References

Edwards, I.R.; Aronson, J.K. Adverse drug reactions: Definitions, diagnosis, and management. Lancet 2000, 356, 1255–1259. [Google Scholar] [CrossRef]
Coleman, J.J.; Pontefract, S.K. Adverse drug reactions. Clin. Med. 2016, 16, 481–485. [Google Scholar] [CrossRef]
Banks, D.; Woo, E.J.; Burwen, D.R.; Perucci, P.; Braun, M.M.; Ball, R. Comparing data mining methods on the VAERS database. Pharmacoepidemiol. Drug Safe 2005, 14, 601–609. [Google Scholar] [CrossRef] [PubMed]
Matsushita, Y.; Kuroda, Y.; Niwa, S.; Sonehara, S.; Hamada, C.; Yoshimura, I. Criteria revision and performance comparison of three methods of signal detection applied to the spontaneous reporting database of a pharmaceutical manufacturer. Drug Saf. 2007, 30, 715–726. [Google Scholar] [CrossRef]
Hauben, M. Early postmarketing drug safety surveillance: Data mining points to consider. Ann. Pharm. 2004, 38, 1625–1630. [Google Scholar] [CrossRef]
Singh, S.; Loke, Y.K.; Furberg, C.D. Long-term Risk of Cardiovascular Events with Rosiglitazone: A Meta-analysis. JAMA J. Am. Med. Assoc. 2007, 298, 1189–1195. [Google Scholar] [CrossRef] [PubMed]
Hauben, M.; Zhou, X.F. Quantitative Methods in Pharmacovigilance: Focus on Signal Detection. Drug Saf. 2003, 26, 159–186. [Google Scholar] [CrossRef] [PubMed]
Agrawal, R.; Imieliński, T.; Swami, A. Mining association rules between sets of items in large databases. SIGMOD Rec. 1993, 22, 207–216. [Google Scholar] [CrossRef]
Agrawal, R.; Srikant, R. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), Santiago de, Chile, Chile, 12–15 September 1994; pp. 487–499. [Google Scholar]
Zaki, M.J. Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 2000, 12, 372–390. [Google Scholar] [CrossRef] [Green Version]
Zaki, M.J.; Parthasarathy, S.; Ogihara, M.; Li, W. New Algorithms for Fast Discovery of Association Rules. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA, USA, 14–17 August 1997; pp. 283–286. [Google Scholar]
Zaki, M.J.; Parthasarathy, S.; Ogihara, M.; Li, W. Parallel Algorithms for Discovery of Association Rules. Data Min. Knowl. Discov. 1997, 1, 343–373. [Google Scholar] [CrossRef]
Hájek, P.; Havránek, T. Mechanizing Hypothesis Formation: Mathematical Foundations for a General Theory; Springer: Berlin/Heidelberg, Germany, 1978. [Google Scholar]
Reps, J.M.; Aickelin, U.; Ma, J.; Zhang, Y. Refining Adverse Drug Reactions Using Association Rule Mining for Electronic Healthcare Data. In Proceedings of the 2014 IEEE International Conference on Data Mining Workshop, Shenzhen, China, 14 December 2014; pp. 763–770. [Google Scholar]
Sharma, D. Application of Association Rules in Clinical Data Mining: A Case Study for Identifying Adverse Drug Reactions. Value Health 2016, 19, 101. [Google Scholar] [CrossRef]
Evans, S.J.W.; Waller, P.C.; Davis, S. Use of proportional reporting ratios (PRRs) for signal generation from spontaneous adverse drug reaction reports. Pharm. Drug Saf. 2001, 10, 483–486. [Google Scholar] [CrossRef] [PubMed]
Frawley, W.J.; Piatetsky-Shapiro, G.; Matheus, C.J. Knowledge Discovery in Databases: An Overview. AI Mag. 1992, 13, 57–70. [Google Scholar]
Hahsler, M.; Grün, B.; Hornik, K. Arules—A Computational Environment for Mining Association Rules and Frequent Item Sets. J. Stat. Softw. 2005, 14, 1–25. [Google Scholar] [CrossRef]
Hipp, J.; Güntzer, U.; Nakhaeizadeh, G. Algorithms for association rule mining—A general survey and comparison. SIGKDD Explor. Newsl. 2000, 2, 58–64. [Google Scholar] [CrossRef]
Olson, D.L.; Delen, D. Advanced Data Mining Techniques, 1st ed.; Springer: Berlin/Heidelberg, Germany, 2008; p. 138. [Google Scholar]
Sasaki, Y. The truth of the F-measure. Teach Tutor Mater. 2007, 1, 1–5. [Google Scholar]
Perruchet, P.; Peereman, R. The exploitation of distributional information in syllable processing. J. Neurolinguist. 2004, 17, 97–119. [Google Scholar] [CrossRef]
Powers, D.M. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
Lee, J.-H.; Lee, W.-Y.; Yong, S.J.; Shin, K.C.; Lee, M.K.; Kim, C.W.; Kim, S.-H. A case of levofloxacin-induced anaphylaxis with elevated serum tryptase levels. Allergy Asthma Immunol. Res. 2013, 5, 113–115. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Guo, K.; Lin, H.; Xu, B.; Yang, Z.; Wang, J.; Sun, Y.; Xu, K.; Cai, Z.; Daescu, O.; Li, M. Detecting Potential Adverse Drug Reactions Using Association Rules and Embedding Models. Lect. Notes Comput. Sci. 2017, 10330, 373–378. [Google Scholar]
Ibrahim, H.; Saad, A.; Abdo, A.; Eldin, A.S. Mining association patterns of drug-interactions using post marketing FDA’s spontaneous reporting data. J. Biomed. Inf. 2016, 60, 294–308. [Google Scholar] [CrossRef] [PubMed]
Zhang, D.; Lv, J.; Zhang, B.; Zhang, X.; Jiang, H.; Lin, Z. The characteristics and regularities of cardiac adverse drug reactions induced by Chinese materia medica: A bibliometric research and association rules analysis. J. Ethnopharmacol. 2020, 252, 112582. Available online: https://pubmed.ncbi.nlm.nih.gov/31972324/ (accessed on 21 October 2021). [CrossRef] [PubMed]

Figure 1. Comparison of the performance of the MHRA method and the ARA method.

Table 1. Two-dimensional contingency table of ADRs.

Signal	Suspected ADE	All Other ADEs
Suspected drug	a	b
All other drugs	c	d

Table 2. Performance evaluation metrics of MHRA/ARA method.

Signal	MHRA/ARA Method Tested Positive	MHRA/ARA Method Tested Negative
Known as positive in the ADR dataset	TP	FN
Known as negative in the ADR dataset	FP	TN

Table 3. The value of F1(Preliminary).

F1	Lift = 1	Lift = 1.5	Lift = 2	Lift = 2.5	Lift = 3
Confidence = 0.005	39.91%	36.85%	34.40%	32.04%	30.38%
Confidence = 0.010	40.30%	36.77%	33.74%	30.94%	29.03%
Confidence = 0.015	39.50%	35.73%	32.70%	29.70%	27.72%
Confidence = 0.02	38.29%	34.51%	31.36%	28.47%	26.56%
Confidence = 0.025	37.22%	33.27%	30.10%	27.07%	25.23%

Table 4. The value of F1(Finally).

F1	Lift = 1	Lift = 1.1	Lift = 1.2	Lift = 1.3	Lift = 1.4	Lift = 1.5
Confidence = 0.006	40.28%	39.50%	38.83%	38.17%	37.65%	37.07%
Confidence = 0.007	40.51%	39.71%	38.98%	38.27%	37.71%	37.15%
Confidence = 0.008	40.49%	39.66%	38.93%	38.21%	37.62%	37.01%
Confidence = 0.009	40.35%	39.52%	38.76%	38.07%	37.45%	36.82%

Table 5. ADRs of levofloxacin and related detection signals.

Adverse Drug Reaction	Has Been Identified as a Positive Signal	Detected as Positive Signal by MHRA Method	Detected as Positive Signal by ARA Method
Pruritus	Yes	Yes	Yes
Rash	Yes	No	Yes
Nausea	Yes	No	Yes
Vomiting	Yes	No	Yes
Allergy	Yes	No	Yes
Phlebitis	Yes	Yes	Yes
Dizziness	Yes	No	No
Chest distress	No	No	No
Anaphylactoid reaction	No	No	Yes
Abdominal pain	Yes	No	No

Table 6. Levonorgestrel ethinyl estradiol test results.

	MHRA Method Tested Positive	ARA Method Tested Positive
a Not up to standard, other formulas are up to standard	0	6
All formulas are up to standard	1	13
Recorded in the package inserts	0	4

Number of reports: 17; number of ADR combinations: 13; total amount of data: 751,606.

Table 7. Etimicin test results.

	MHRA Method Tested Positive	ARA Method Tested Positive
PRR Not up to standard, other formulas are up to standard	0	8
All formulas are up to standard	26	16
Recorded in the package inserts	3	6

Number of reports: 3710; number of ADR combinations: 209; total amount of data: 751,606.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, J.; Dai, J.; Zhao, Y.; Han, P.; Zhu, Y.; Huang, W. Application of Association Rules Analysis in Mining Adverse Drug Reaction Signals. Appl. Sci. 2021, 11, 10828. https://doi.org/10.3390/app112210828

AMA Style

Wei J, Dai J, Zhao Y, Han P, Zhu Y, Huang W. Application of Association Rules Analysis in Mining Adverse Drug Reaction Signals. Applied Sciences. 2021; 11(22):10828. https://doi.org/10.3390/app112210828

Chicago/Turabian Style

Wei, Jianxiang, Jimin Dai, Yingya Zhao, Pu Han, Yunxia Zhu, and Weidong Huang. 2021. "Application of Association Rules Analysis in Mining Adverse Drug Reaction Signals" Applied Sciences 11, no. 22: 10828. https://doi.org/10.3390/app112210828

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Application of Association Rules Analysis in Mining Adverse Drug Reaction Signals

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Date Source

3.2. MHRA

3.3. Association Rules Analysis

3.4. Performance Evaluation

3.5. Method of Determining the Thresholds

3.6. Method Comparison

4. Results

4.1. Study Dataset

4.2. Optimal Threshold

4.3. Comparison of the MHRA Method and the ARA Method at the Performance Level

4.4. Comparison of the MHRA Method and the ARA Method at the Formula Level

5. Discussion

5.1. Progress in the ARA Field

5.2. Advantages over the Traditional MHRA

5.3. Limitations of the ARA

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI