Comparison of Data Mining Methods for the Signal Detection of Adverse Drug Events with a Hierarchical Structure in Postmarketing Surveillance

There are several different proposed data mining methods for the postmarketing surveillance of drug safety. Adverse events are often classified into a hierarchical structure. Our objective was to compare the performance of several of these different data mining methods for adverse drug events data with a hierarchical structure. We generated datasets based on the World Health Organization’s Adverse Reaction Terminology (WHO-ART) hierarchical structure. We evaluated different data mining methods for signal detection, including several frequentist methods such as reporting odds ratio (ROR), proportional reporting ratio (PRR), information component (IC), the likelihood ratio test-based method (LRT), and Bayesian methods such as gamma Poisson shrinker (GPS), Bayesian confidence propagating neural network (BCPNN), the new IC method, and the simplified Bayesian method (sB), as well as the tree-based scan statistic through an extensive simulation study. We also applied the methods to real data on two diabetes drugs, voglibose and acarbose, from the Korea Adverse event reporting system. Only the tree-based scan statistic method maintained the type I error rate at the desired level. Likelihood ratio test-based methods and Bayesian methods tended to be more conservative than other methods in the simulation study and detected fewer signals in the real data example. No method was superior to the others in terms of the statistical power and sensitivity of detecting true signals. It is recommended that those conducting drug‒adverse event surveillance use not just one method, but make a decision based on several methods.


Introduction
It is critical to detect signals of adverse drug reactions from real-world data early enough to protect public health. From the real-world data, we could identify new effects of drugs that had not been identified during premarketing clinical trials. Adverse event (AE) information after drug marketing is often collected via a spontaneous reporting system to identify any long-term adverse drug reactions. In Korea, for example, the Korea Institute of Drug Safety and Risk Management (www.drugsafe.or.kr) collects the information through a spontaneous reporting system.
Through this system, anyone, for example, a patient who has taken the drug, a doctor, or the manufacturer, can report an AE. They report information such as the symptoms of the AE, the date of onset, the name of the drug, the frequency and duration of the dose, patient information, and causality assessment information. As the causality can only be reported by medical experts, the information Table 1. Adverse events count for the ith adverse event and the jth drug.

AE
jth Drug All Other Drugs Total ith adverse event n i j n i. − n i j n i. All other adverse events n . j − n i j n .. − n i. − n . j + n i j n .. − n i.

Reporting Odds Ratio (ROR)
The ROR is the odds ratio that a particular AE is reported in patients who take a specific drug compared to patients who take other drugs [2]. The ROR for the ith AE and the jth drug (ROR ij ) is estimated as n . j − n ij / n .. − n i. − n .j + n ij = n ij n .. − n i. − n .j + n ij n i. − n ij n .j − n ij and if either n i. − n ij or n .j − n ij is equal to 0, then RÔR ij is not defined. The log-transformed RÔR ij can be approximated to a normal distribution as follows: n .j − n ij + 1 n .. − n i. − n .j + n ij .
We can obtain an approximate 100(1 − α)% confidence interval (CI) for ROR ij as: where z 1−α/2 = Φ(1 − α/2) and Φ is the standard normal distribution's cumulative distribution function. The null and the alternative hypotheses to test whether the ith AE for the jth drug is a signal or not are expressed as: H 0 : ROR ij = 1 vs. H a : ROR ij > 1.
As Evans et al. [3] suggested, the lower bound of CI ROR ij ,100(1−α)% > 2, so we reject the null hypothesis and conclude that the ith AE can be interpreted as a signal of the disproportionate rate (SDR) for the jth drug.

Proportional Reporting Ratio (PRR)
The PRR is the ratio of the proportion of patients who reported a particular AE after taking a specific drug to the proportion of patients who have taken other drugs that reported the same AE [3]. We estimate the PRR for the ith AE and the jth drug (PRR ij ) as: n .j − n ij /(n .. − n i. ) = n ij (n .. − n i. ) n i. n .j − n ij .
If n i. − n ij ≈ n i. and n .. − n i. − n .j + n ij ≈ (n .. − n i. ), then RÔR ij ≈ PRR ij . We use the normal approximation for the distribution of log PRR ij for inference as follows: Therefore, an approximate 100(1 − α)% CI for PRR ij is expressed as follows: The null hypothesis of H 0 : PRR ij = 1 is rejected if the lower bound of CI PRR ij ,100(1−α)% > 2, as for PRR.

Information Component (IC)
The IC is based on the relative reporting rate RR ij , which indicates how many particular events were reported in excess for a specific drug over the expected number of reported counts under the null hypothesis that a drug and AE are independent. The relative reporting rate is estimated by , where E ij = n i· n ·j n ·· is the expected number of reports for the ith AE and the jth drug under the null hypothesis. The IC for the ith AE and the jth drug is defined as follows: The IC ij is estimated asÎ C ij = log 2 n ij E ij for n ij > 0, n i· > 0, and n ·j > 0. The estimated variance of IC ij is given byσ 2Î . An approximate 100(1 − α)% CI for IC ij is expressed as follows: If the lower bound of CI 100(1−α)% > 1, the ith AE can be interpreted as a signal.

Likelihood Ratio Test-Based Method (LRT)
Huang et al. [4] proposed the likelihood ratio test statistic, which controls the type I error and false discovery rates by using Monte Carlo hypothesis testing. The null and alternative hypotheses to test whether the ith AE for a specific drug ( j * ) is a signal or not are expressed as follows: where p i and q i are defined as the reporting rates of ith AE and other AEs for a specific drug, respectively. The maximum likelihood ratio (MLR) is expressed as follows: where I() is the indicator function,p i = n ij * /n i. , andq i = n .j * − n ij * /(n .. − n i. DuMouchel [5] suggested the GPS method, which is an empirical Bayes signal detection method. The GPS method uses the relative report rate, defined as follows: This indicates the actual frequency compared to the expected frequency. E ij is calculated under the null hypothesis that there is no association between the drug-AE pairs. The null and alternative hypotheses are expressed as follows: The GPS method assumes that the model and prior distributions are as follows: where the observed report count n ij follows the Poisson distribution with unknown mean µ ij = E ij × λ ij . The relative report rate follows the mixture gamma distribution where Gamma(α, β) is a gamma distribution with mean α/β and variance α/β 2 and 0 < w < 1 is the prior probability that λ ij came from the first gamma distribution of mixture. The hyperparameters (α 1 , β 1 , α 2 , β 2 , w) are estimated by the empirical Bayes method, which is also known as the maximum marginal likelihood.
As gamma distribution is a conjugate prior for Poisson distribution, the posterior distribution of λ ij can be obtained in a closed form as follows: ij is the posterior probability that λ ij came from the first gamma distribution of the mixture. This is expressed as follows: where f n ij α, β, E ij is the marginal distribution. This marginal distribution follows the negative binomial distribution as follows: The 5th percentile of the posterior distribution of λ ij (EB05) is used for decision making. EB05 can be obtained by solving the equation as follows: This integral can be solved easily using iterative techniques such as Newton's method. If EB05(λ ij ) is greater than 2, this drug-adverse effect pair is considered a signal of disproportionate rates (SDR).

Bayesian Confidence Propagation Neural Network (BCPNN)
Bate et al. [6] proposed the BCPNN method based on the IC measure. In the BCPNN method, the IC measure was defined as follows: The observed reporting counts and marginal counts are assumed to follow a binomial distribution with a beta distribution for priors as follows: Using the delta method, the posterior mean and variance of IC ij can be obtained as follows: E IC ij data = log 2 n ij + 1 (n .. + 1) 2 (n .. + γ)(n i. + 1) n .j + 1 , . The lower limit of the 95% credible interval for IC ij is calculated by: and if sB α,ij is greater than 2, this drug-AE pair is a possible signal with a higher reporting rate.

New IC Method
The new IC is an improved method for posterior inference in IC analysis, including an accurate estimate for the mode and significantly improved credibility interval estimates. This method also assumes the number of reports n ij ∼ iid Poisson λ ij E ij , where λ ij denotes the relative reporting rate.
The prior of parameters λ ij is given by λ ij ∼ iid Gamma(0.5, 0.5), and the posterior distribution of λ ij is given by λ ij data ∼ iid Gamma n ij + 0.5, E ij + 0.5 . Then, the New IC ij is the posterior mean of log 2 λ ij , which is E(log 2 λ ij data) log 2 n ij +0.5 E ij +0.5 . The 95% credible interval limits (λ 0.025 , λ 0.975 ) are obtained by: Gamma y n ij + 0.5, E ij + 0.5 dy = α for α = 0.025 and α = 0.975. If the lower limit λ 0.025 > 0, the ith AE can be interpreted as a signal.

Simplified Bayesian
For small datasets, the GPS method is usually not recommended because of instability in the estimation of the hyperparameters. Thus, Huang et al. [9] suggested the simplified Bayesian (sB) method, which assumes a weaker assumption on prior distribution than the GPS method. The sB method uses a single gamma distribution as a prior as follows: prior : λ ij ∼ Gamma(α, α), with mean 1 and variance 1/α. Huang et al. [9] proposed using three values (0.5, 0.01, and 0.0001) for α. They also called the prior distribution with α = 0.5 a less noninformative prior. The other prior distributions were called noninformative priors. The posterior distribution is also a single gamma distribution as follows: posterior : λ ij n ij ∼ Gamma α + n ij , α + E ij .
The lower bound of the 95% credible interval for λ ij (sB α, ij ) is used for detecting signals of SDR. sB α, ij is expressed as follows: If sB α,ij is greater than 2, this drug-AE pair is a possible signal with a higher reporting rate. With α = 0.5, the sB method is identical to the new IC method [10]. Hence, we only included the sB method in the simulation.

Tree-Based Scan Statistic
In a medical dictionary, all AEs are categorized into a hierarchical tree structure. Kulldorff et al. [13,14] proposed the tree-based scan statistic, which simultaneously searches for signals at any level (or layer) of AEs in a hierarchical structure. We call the last cell of the tree a leaf and the rest a node. That is, the higher level of leaves is the node. A higher-level node is defined as the parent node; the lower level node is defined as the child node. c i is the observed number of AEs for each leaf I and C = i c i = i n ij is the total observed number of AEs reported in patients who have taken a specific drug j and X = i x i = i n i. is the total number of AEs reported in patients who have taken any drugs.
When the branches of a tree are cut, the sum of the observed and total number of AEs in the leaves of each cut, G, c G = i∈G c i and x G = i∈G x i , respectively, are obtained. G includes both the child nodes and parent nodes as a unit of AE. For each cut G, we can calculate the log likelihood ratio and test statistic: where I() is the indicator function. The cut G that maximizes LR(G) is the most likely cut of related AEs. The null hypothesis implies that the group defined by cut G has the same ratio of observed to expected AEs as the rest of the tree. In inference, Monte Carlo hypothesis testing is used, calculating the most likely cut in each random dataset. Firstly, the likelihood of the most likely cut in a real dataset is calculated. Secondly, 9999 random datasets are generated under the null hypothesis and the test statistic for each random dataset calculated. Then, the p-value is calculated as R/(9999 + 1), where R is the rank of the test statistic of real dataset compared with random datasets. The LRT and TreeScan methods basically use the same test statistic. Because the TreeScan considers the hierarchical structure in nature, the distribution of the test statistic is also obtained by comparing all possible cuts in the hierarchical structure. Even if the two methods detected the same signal, p-values could be different.

Data Generation and Evaluation Measures
We generated datasets that reflect WHO-ART's hierarchical structure, which can be expressed as system-organ classes (SOC), preferred terms (PT), and included terms (IT) for AEs [16]. In the simulation study, we included only SOC and PT levels. To reduce the computation time, we only considered 500 drugs and 300 AEs, which were randomly selected from a total of 2161 PT levels. We followed the approach in the study by Huang et al. [4] to generate our simulation data.
First, we generated marginal counts of AEs n 1. , . . . , n I. (I = 300) and drugs n .1 , . . . , n .J (J = 500) as follows: (n 1. , . . . , n I. )|n .. ∼ Multinomial n .. , u 1 where u ∼ Uni f orm(0, 1) with n .. = I i=1 n i. . Next, we generated the number of cases reported for a specified drug ( j * ), n 1 j * , . . . , n I j * using n 1 j * , . . . , n I j * n .j * ∼ Multinomial n .j * , p rr , n .. is a vector of probabilities with rr 1 j * , . . . , rr I j * as the relative reporting rates. When r 0 is considered as the baseline risk, p rr has the constraints that Note that the number of reported cases was generated for a specific drug, and hence the true signals are signals for each drug. This means that the relative reporting rate for the AE with a true signal is higher than those for all the other AEs for one fixed drug. If an AE is a true signal, the relative reporting rate is greater than 1, while the relative reporting rate is equal to 1 when the AE is a false signal [11]. The cells for the true signals were randomly selected first depending on the assumed proportion of true signals. The relative reporting rate for each of the selected cells as true signals was generated from Uni f orm(1.2, 10) and Uni f orm(1.2, 4).
While the TreeScan method detected signals simultaneously for both SOC and PT levels, all the other methods detected signals from SOC and PT levels separately. To evaluate the performances of the methods considering the hierarchical data structure, we merged two separate results from each level for all methods except the TreeScan method.
We generated 1000 datasets for each of nine different settings with three different total sample sizes (300,000, 500,000, 1,000,000) and three different percentages of true signals (3%, 5%, 10%). We used five different cutoffs, which are the criteria for signal detection for each method. Different criteria have been used depending on the organization for different methods [17]. In practice, one may change the criteria based on experience. We used the same criterion of the lower bound of the 95% CI for fair comparison in our simulation.
To compare the performance, we calculated the type I error rate, sensitivity, positive predicted value (PPV), and power for specific drugs. Under the null hypothesis, the type I error is estimated as follows: Type I error = # of times detecting at least one false − positive signal total # of simulated datasets .
The sensitivity, PPV, and power are estimated as: where S is the total number of simulated datasets with at least one signal detected. We used R software 3.5.2 version (Vienna, Austria) for all simulations and data analyses.

Comparison of Type I Error Rate
To compare the type I error rate of each method and cutoff, all relative reporting rates were set to 1 for each total sample size ( Table 2). The type I error rates of the ROR, PRR, and IC methods were relatively high for the standard cutoff and for all total sample sizes, which means that spurious detection could frequently occur even when there are no actual signals. The type I error rates of the GPS and sB methods were close to 0 for the standard cutoff and all total sample sizes. The type I error rates of the ROR, PRR, IC, GPS, BCPNN, and sB methods varied depending on how the cutoff was set. On the other hand, the type I error rates of the LRT and TreeScan methods were close to the prespecified significance level in most cases, although the LRT method had slightly higher type I error rates.

Comparison of Sensitivity, PPV, and Power
Tables 3 and 4 present the results for sensitivity, PPV, and power of each method when the total sample size is equal to 300,000. The other results are presented in Appendix A. For all simulation settings and the standard cutoff for each method, the ROR, PRR, and IC methods had relatively higher sensitivity and power than the other methods. However, the LRT, GPS, BCPNN, sB, and TreeScan methods had relatively higher PPV than the other methods. This means that the ROR, PRR, and IC methods may detect too many signals regardless of whether they are actually true, so these methods could detect many false signals as well as true ones. On the contrary, the LRT, GPS, BCPNN, sB, and TreeScan methods detected much fewer signals, but more true signals than false ones. When the relative reporting rates were low (Table 4), all the methods had lower performance compared to when the relative reporting rates were high ( Table 3). The GPS, BCPNN, and sB methods had a significant decrease in power and sensitivity, especially the GPS method.
As the percentage of true signals increased for all settings of total sample size, the sensitivity decreased but the PPV and power increased for all methods. As the total sample size increased for all settings of the percentage of true signals, the sensitivity, PPV, and power increased for all methods. However, depending on the cutoff of each method, the sensitivity, PPV, and power varied. No single method was superior to the others overall for all settings.

Korea Adverse Event Reporting System (KAERS)
The KAERS is a spontaneous reporting system that receives and manages adverse drug events reported by patients, manufacturers, or medicine experts, provided by Korea Institute of Drug Safety and Risk Management. It consists of drugs, AEs, basic demographic, and causality assessment information. When reported, a drug and an AE should be reported together in a pair. These can be reported several times depending on the dose and time. If the same drugs and AEs were reported in duplicate, depending on dose or time, only the first report was counted. Therefore, drugs and AEs are paired only one time.
Causality was assessed at six levels: certain, probable, possible, unlikely, unclassified, and unassessable. The assessment criteria are shown in Table 5. We used all drug-AE pairs except for ones with an unassessable level. Not only the reported information on a possible causal relationship between an AE and a drug, but also previously unknown or incompletely documented relationships can be a signal. The causality assessment was performed by a reporter, such as a medical institution, expert, manufacturer, pharmacy, or public health center. Table 5. Causality assessment criteria.

Criterion Level
The context of administration and use of medicines is reasonable. Certain, Probable, Possible It is not described as another medication, chemical, or accompanying illness.
Certain, Probable In case of administration interruption, there is a clinically reasonable response.
Certain, Probable In case of readministration, there is a pharmacologically conclusive response. Certain It could be described as another medication, chemical, or accompanying illness.
Possible, Unlikely It is a temporary condition, not related to the administration and use of medicines. Unlikely It requires more information to assess or it is under examination.
Unclassified It is not assessable and cannot be supplemented. Unassessable In KAERS, AEs were organized under the WHO-ART's hierarchical structure [16]. This consists of four hierarchical levels: system-organ class (SOC), high-level terms (HLT), preferred terms (PT), and included terms (IT). SOC is the highest level. IT represents various expressions about the same AE in the PT level. HLT is a set of PTs related to each other or having some similar symptoms. HLT may or may not exist and therefore are excluded from the analysis. A small subset of the hierarchical structure is listed in Table 6. However, in the KAERS database, more than half of the reports were reported up to the PT level. Thus, we used the PT level as the lowest level of AEs. In the following illustration, we used the SOC and PT levels in the WHO-ART's hierarchical structure.

Data
We used drug-adverse effects pair data from KAERS between 2012 and 2016. Between 2012 and 2016, there were approximately 3.1 million drug-AE pairs with 1615 kinds of PT-level AEs and 1950 kinds of drugs. Restricting the causality assessment information to certain, probable, possible, unlikely, or unclassified levels, approximately 2.5 million drug-AE pairs with 1484 kinds of PT level AEs and 1716 kinds of drugs were left. These data contained 32 SOC levels, 1484 PT levels, and 3557 IT levels. Analyses were done with these drug-AE pairs.

Analysis
We selected two diabetes drugs, voglibose and acarbose, to compare specific results. Both are hypoglycemic agents that are used for type 2 diabetes, along with diet and exercise. These two drugs were selected because of their substantial exposure and comparable characteristics. Voglibose has a simple structure relative to acarbose. Moreover, it is known to be more economical and safer because its absolute administration dose is 1000 times lower than that of acarbose. However, some severe AEs tend to be more reported in voglibose [17,18]. Therefore, we found specific AEs in acarbose and voglibose using KAERS data by the signal detection methods previously described.
First, we compared the number of signals detected by each method from all drug-adverse effect pairs with 1484 kinds of PT level AEs and 1716 kinds of drugs. Second, the specific signals detected by each method were compared for the two diabetes drugs mentioned above. The detection criteria for each method are shown in Table 7 and the TreeScan method was performed with a simple cut.  Table 8 provides the overall signal detection results of all methods. We used the signal detection criteria presented in Table 7. We summarized the number of detected signals separately for PT and SOC levels. The GPS, BCPNN, and sB methods detected relatively fewer signals than the other methods. The ROR and PRR detected the most signals.

Results
The results of applying all methods to two drugs, voglibose and acarbose, are summarized in Table 9. We report only the AEs that were detected by more than two of the signal detection methods. Voglibose had a higher reported count of all AEs than acarbose. The number of AEs detected by at least one method was higher for voglibose (36 AEs) than for acarbose (31 AEs). For both drugs, the common AEs detected were diarrhea, flatulence, and hypoglycemia at the PT level, and metabolic and nutritional disorders at the SOC level. There was only one common AE detected by all methods in acarbose and voglibose: flatulence at the PT level. Both drugs signaled strongly for flatulence, which is an AE commonly observed in patients with type 2 diabetes [19,20]. In addition, the common AEs detected by all methods were dyspepsia and hypoglycemia at the PT level, and metabolic and nutritional disorders at the SOC level in voglibose.

Discussion
A number of disproportionality methods for data mining and the TreeScan method were compared for signal detection during drug surveillance for AEs data grouped into hierarchical structures. We included various frequentist methods such as ROR, PRR, IC, LRT, and TreeScan as well as Bayesian methods such as GPS, BCPNN, and sB. The LRT, GPS, BCPNN, sB, and TreeScan methods detected fewer signals than the ROR, PRR, and IC methods. The power and sensitivity of the GPS, sB, LRT, and TreeScan methods tended to be lower than those of others, which implies that these methods are more conservative. The higher power and sensitivity of the ROR, PRR, and IC methods seemed to be due to the higher type I error rates. The three methods had lower PPV. The TreeScan method controls the type I error rate at the desired level, while other methods cannot control this or find appropriate cutoffs for the desired type I error rate. However, no method was superior to the others in relation to all performance measures.
We observed similar patterns in the analysis results of the KAERS data. The GPS and sB methods detected much fewer signals than the others overall. For the two specific drugs, some common AEs were detected by all methods. The ROR, PRR, and IC methods detected additional signals that were not detected by the GPS, sB, LRT, or TreeScan methods. The ROR and PRR methods detected rather too many signals, even if the number reported was small. Thus, the restriction of three or more cases for the reported count to be a signal for the ROR and PRR methods, which is usually imposed in practice [3], might be sensible.
In terms of computation time, the GPS, LRT, and TreeScan methods are more intensive relative to the other methods. Other methods have a closed form for the confidence interval of each statistic, so only the cell count (n ij ) and marginal count (n i. or n .j ) of the matrix are required to calculate the confidence interval. On the other hand, the GPS method requires all cell counts in the matrix to estimate the parameters of prior distribution. For the LRT and TreeScan method, a Monte Carlo simulation is required to obtain p-values.
The methods considered in this paper are approaches that can be applied to an existing database. In some cases, one may want to continuously or sequentially monitor to detect a signal as early as possible. The sequential probability ratio test (SPRT) [21,22] can be used. The method has also been applied to a spontaneous adverse event reporting system [23,24]. However, the result of the SPRT method is highly dependent on the relative risk used to specify the alternative hypothesis [25]. Although we did not include the SPRT in this study for these reasons, it would be interesting to compare the method in appropriate situations in future research.
The drug safety databases such as KAERS are constructed by a spontaneous reporting system and very few AEs that occur were reported, so it has a large number of zero-count cells. In this situation, a zero-inflated Poisson model could be considered. Hu et al. [11] proposed ZIP-sB and ZIP-DP (Dirichlet process). Huang et al. [26] proposed a zero-inflated Poisson (ZIP) model based on the likelihood ratio test. According to these research findings, ZIP models detected fewer signals in data containing a large number of zero-counts. This means that they are more conservative by considering zero-counts. In a further study, we will evaluate the performance of ZIP models and apply them to real data to compare.
Huang et al. proposed extending the likelihood ratio test-based (LRT) methods [9] that can detect signals for including a single AE or several AEs within one AE group. The extended LRT method could be used for hierarchical structures of AEs for a fixed drug. The threshold for a signal for multiple-layer analysis should be higher than that for single-layer analysis. It will be very interesting to see the simulation results by comparing the Extended LRT vs. TreeScan with multiple layers (PT, SOC, or others). This is a future research topic.
Currently, some drug companies have different AE detection criteria. For example, AstraZeneca detects an AE when the EB05 is greater than 1.8, whereas GlaxoSmithKline detects AE when it is greater than 2 [12]. In our study, it was confirmed that the performance of each method could vary depending on the cutoff, which is the criteria for signal detection in simulation. Therefore, how to set the cutoff for signal detection is very important and worth noting.

Conclusions
In summary, the LRT, GPS, BCPNN, sB, and TreeScan methods are more conservative than the ROR, PRR, and IC methods. Only the TreeScan method controls the type I error rate at the desired level. No method is superior to the others in relation to all performance measures. It is recommended that those conducting drug-AE surveillance use not just one method, but make a decision based on several methods.