Predicting the Severity of Adverse Events on Osteoporosis Drugs Using Attribute Weighted Logistic Regression

Osteoporosis is a serious bone disease that affects many people worldwide. Various drugs have been used to treat osteoporosis. However, these drugs may cause severe adverse events in patients. Adverse drug events are harmful reactions caused by drug usage and remain one of the leading causes of death in many countries. Predicting serious adverse drug reactions in the early stages can help save patients’ lives and reduce healthcare costs. Classification methods are commonly used to predict the severity of adverse events. These methods usually assume independence among attributes, which may not be practical in real-world applications. In this paper, a new attribute weighted logistic regression is proposed to predict the severity of adverse drug events. Our method relaxes the assumption of independence among the attributes. An evaluation was performed on osteoporosis data obtained from the United States Food and Drug Administration databases. The results showed that our method achieved a higher recognition performance and outperformed baseline methods in predicting the severity of adverse drug events.


Introduction
Osteoporosis is a common and dangerous bone disease that can lead to serious pain, disability, hospitalization, or even death. According to the International Osteoporosis Foundation [1], older people and women over the age of 50 are at the greatest risk of developing osteoporosis due to physiological changes that come with aging. To date, this disease has affected 200 million people worldwide, and it is expected to increase in the next 5 to 10 years. Although there are a range of drugs used to treat osteoporosis, they may cause various adverse events. An adverse drug event is defined as an injury that affects a patient due to medical intervention linked to drug. Some adverse events are life-threatening and require medical intervention.
There are studies that attempt to investigate adverse events caused by osteoporosis drugs [2,3]. Classification methods are commonly applied to predict adverse events, where data instances are mapped into one of the possible classes. The majority of these studies assume that all attributes are equally important and have the same contribution to the classification decision [4][5][6]. Such an assumption, however, may not be practical in realworld applications. Methods based on attribute weights have been proposed to relax the independence assumption. This approach assigns a continuous value to each attribute, in which the more significant attribute has a higher weight. Some attribute weighting methods have been successfully implemented in a naïve Bayes classifier [7].
Logistic regression (LR) is one of the most widely used classifiers in the biomedicine domain. The maximum-likelihood estimation is used to determine the probability of class membership in LR [8]. However, there are limited studies that apply attribute weights in LR. Current studies applied LR directly on unweighted attributes, which may result in biased estimates and fall short in predicting adverse events [9,10]. In this paper, a new attribute weighted logistic regression is proposed to predict the severity of an adverse osteoporosis drug event. Our contribution is twofold. First, we propose a method to incorporate attribute weights into LR. Second, we present a method to calculate the attribute weights. Our method takes into account the relevance of each attribute in predicting the severity, which not only reduces the impact of irrelevant attributes but also improves the classification performance. We evaluated our method on an osteoporosis adverse events dataset obtained from the U.S. Food and Drug Administration. We have also compared our method with baseline methods.
The outline of this paper is organized as follows: In the next section, we discuss the related work on attribute weighting methods. Section 3 presents our proposed method. Section 4 describes the osteoporosis dataset used in this study. Section 5 presents the experimental results and discussion. Section 6 concludes our findings.

Related Work
Attribute weight is a continuous value that represents the importance of each attribute in classification. In [11][12][13], the information gain (IG) measure was used to calculate the attribute weights. In the study of [4], their IG-based attribute weight has resulted in some negative values. Ideally, when assigning a weight to an attribute, the weight should not be a negative value.
There are works that used Kullback-Leibler divergence (KL) to calculate the attribute weights for a naïve Bayes classifier [4,5,14]. However, the KL-based attribute weighting method has a longer computational time as this method involves complex calculation steps, including the estimation of weight for each category, the average attribute weight, split information, split weight, and normalized weight, as described in [4].
Ouyed et al. [15] proposed an attribute weighting technique based on the Newton-Raphson method for multi-nominal kernel logistic regression. In this study, each attribute's relevance to classification is estimated using the Newton-Raphson method. Instead of estimating individual attribute weights for multi-nominal kernel logistic regression, ref. [16] extended the method to allow the estimation of group attribute weights by using gradient descent minimization. Such a method, which uses multiple kernel functions, increases the complexity of the optimization when the data size is large.
Although LR is a widely applied classification method, there are limited studies that incorporate the attribute weights in LR. In some of the studies of LR, the attributes were weighted to perform attribute selection by considering the most relevant attributes [17][18][19][20][21][22]. Krishnapuram et al. [17] introduced a sparse multi-nominal logistic regression to perform automatic attribute selection. In this study, irrelevant attributes with weights equal to zero were removed. Ryali et al. [18] developed a new whole-brain classification method based on sparse logistic regression. Their method combined L1 and L2 norm regularizations to reduce the weight of irrelevant attributes for better attribute selection. Liang et al. [19] investigated the L1/2 penalty with sparse logistic regression for gene selection in cancer prediction. In recent studies by Bertsimas et al. [20][21][22], they reformulated the sparse regression problem on a larger dataset. Their proposed binary reformulation provides sparser classifiers with similar accuracy as the Lasso regularization technique [23]. However, it was indicated in the study that their method is not computationally efficient, especially on a smaller dataset.
Machine learning techniques have been implemented for drug discovery. Lin et al. [24] compared four machine learning models (logistic regression (LR), support vector machine (SVM), random forest (RF), and artificial neural network (ANN)) for personalized treatment of osteoporosis. For testing the generalizability of the models, the main analysis (196 patients) and subgroup analysis (154 patients) were conducted. A genetic algorithm was used to select informative attributes of osteoporosis patients treated in a Taiwan hospital. The grid search method was applied to tune the hyperparameters of SVM, RF, and ANN. In terms of accuracy and precision, there were no differences between the four methods. Neveen et al. [2] applied multi-label classification methods to detect adverse events on the Fosamax drug. Their results showed that decision trees (DT) with classifier chains have better recognition and computational performance compared to SVM and naïve Bayes. Jaganathan et al. [25] used the SVM to predict drug toxicity. Pearson correlation was applied to remove redundant and irrelevant attributes. Recursive feature elimination and cross-validation techniques were used to select the most significant attributes. They tuned their SVM using the grid search method. The hyperparameter-tuned SVM achieved better accuracy and f-score. In another study by Cano et al. [26], they performed RF in two ways: one for attribute ranking and selection and the other for detecting the activity of different drugs based on their chemical compounds. The optimal values of RF parameters were selected based on the lowest prediction error. The results of tuned RF on selected attributes outperformed the results of SVM and multi-layer perceptrons. Table 1 provides a summary of studies using the attribute weighting method and the attribute selection method.

Proposed Method
Our method is described in two parts: Section 3.1 describes our approach to incorporating attribute weights into LR, while Section 3.2 describes our approach to calculating attribute weights based on the chi-square statistic.

Weighted Logistic Regresion
Logistic regression is a classification method to predict the logit of a class Y from one or more independent attributes as follows: where α is the intercept, x i (i = 1, . . . , n) is the attributes, and β i (i = 1, . . . , n) is the log odds ratios. Both α and β i are estimated using the maximum-likelihood method, which converts to a probability of belonging to a class Y as follows: Our method incorporates the attribute weight w i of attribute x i as term aln(w i ) into the LR model as follows: logit(Y) = α + (β 1 + aln(w 1 )) × x 1 + (β 2 + aln(w 2 )) × x 2 + . . . + (β n + aln(w n )) × x n (3) where a denotes a positive or negative sign and ln(w i ) is the natural logarithm value of w i .
The coefficient β i in logistic regression is the estimated log odds ratio obtained for a unit change in attribute x i . The β i value determines the type of relationship between x i and the logit of Y. If β i is positive, larger x i values are associated with a larger logit of Y. Conversely, if β i is negative, larger x i values are associated with a smaller logit of Y [27]. Since ln(w i ) is negative when w i is less than 1, we proposed to incorporate the weights differently for different combinations of β i and w i , as shown in Table 2. For the cases where (1) β i is negative with w i < 1 and (2) β i is positive with w i > 1, the weight is incorporated by adding a positive ln(w i ) to β i . For the cases where (3) β i is negative with w i > 1 and (4) β i is positive with w i < 1, the weight is incorporated by adding a negative ln(w i ) to β i . By adding the attribute weights as proposed, the intrinsic relationship between x i and the logit of Y is maintained.

Adding Attribute Weight
The Resulted Value Table 3 shows an example of the resulting parameter values for different combinations of β i and w i . Referring to the example in Table 3, attribute x 1 has a negative β value, while attribute x 2 has a positive β value. For x 1 , if the attribute weight is larger than 1, the weight is incorporated into the model by adding a negative ln(w). Conversely, if the weight of x 1 is less than 1, the weight is incorporated by adding a positive ln(w). For x 2 , we add positive ln(w) to β if the weight of x 2 is larger than 1, and negative ln(w) if the weight is less than 1. By incorporating weights as proposed, the sign of the resulted coefficient, which represents the log odds ratio of the attributes, remains unchanged, and the magnitude of the weight contribution can be incorporated correctly. Table 3. Example of adding attribute weight to different β values.

Attribute Weight Based on Chi-Square
Chi-square (χ 2 ) is a statistic used in various hypothesis tests. One of them is to test if two categorical attributes are dependent. We propose measuring the weight of an attribute by calculating the χ 2 value between this attribute and the target attribute. Given the target attribute T with classes t k (k = 1, . . . , z) and an attribute x i with values b j (j = 1, . . . , s), the joint distribution of T and x i is shown in Table 4. Table 4. Joint distribution of target attribute T and attribute x i .
O kj is the observed number of attribute value b j that belongs to class t k , M Rk is the sum of each row, M Cj is the sum of each column, and M is the total sample size.
The χ 2 statistic of attribute x i is calculated as follows: E kj is the expected number of attribute value b j that belongs to class t k . The final weight w i for attribute x i based on χ 2 is computed as follows: where n is the total number of attributes.
Algorithm 1 shows our proposed attribute weighted logistic regression, where the weights are calculated using χ 2 measure. First, the attribute weight in the training dataset is calculated using χ 2 measure. Then, these attribute weights are incorporated to train the LR model (Equation (3)).

Algorithm 1 Attribute weighted logistic regression
Input: training data 1: For each attribute x i in the training data -Compute χ 2 i following Equation (4) -Compute w i following Equation (6) 2: Incorporate attribute weights to train the weighted LR model (following Equation (3)) If w i = 0 then set aln(w i ) = 1 × 10 −10 Else if (β i > 0 and w i > 1) or (β i < 0 and w i < 1) then a = positive Else if (β i > 0 and w i < 1) or (β i < 0 and w i > 1) then a = negative

Dataset and Evaluation Methods
This section describes the dataset, data preparation, and evaluation methods used in this study.

Description of the Data
The dataset used in this study was obtained from the online U.S. Food and Drug Administration database from 2004 to 2018 [28]. The data files included in our study are patients' demographics, drugs, indication (disease), outcome, and therapy. These files are linked via patient ID.
There are 228 drugs reported for adverse events in this dataset. The top ten drugs that were reported as the primary suspects with the most reported adverse events were included in this study. The resulting dataset has 20,576 records with 36 attributes. In this study, we included attributes that are directly related to patient characteristics (age and gender), the drug that caused the adverse event, drug regimens (dose amount, dose unit (microgram or milligram), and dose frequency), the therapy start date, the date of the adverse event, and the stage of osteoporosis disease. There are three stages of osteoporosis disease, which is measured by a Dual-energy X-ray Absorptiometry machine. Osteopenia (pre-osteoporosis) is the first stage and happens when bone density is between −1.5 and −2.5. The second stage is osteoporosis, in which bone density is −2.5. The third stage is the patients who used the related drugs for protection (osteoporosis prophylaxis).
According to the World Health Organization, a "severe adverse event" concerns the critical cases of patients who need immediate medical consultation, for instance, death, disability, hospitalization, or life-threatening conditions. Otherwise, the event is considered non-severe. Following this definition, we have divided the target attribute into two categories-severe and non-severe. The final dataset used in our study has 11,956 severe events and 8620 non-severe events.

Data Preparation
For each record, we calculated the number of days between the start of therapy and the occurrence of the adverse event and labelled this as "duration". Since the dose amounts were reported in milligrams or micrograms, we have converted those dose amounts from milligrams to micrograms. We have standardized the distribution of the three continuous attributes (i.e., age, duration, and dose amount) to have a mean of 0 and a unit standard deviation to avoid bias towards attributes with a large range. For attribute weights calculation, attributes with continuous values have to be discretized [4][5][6]13,[29][30][31][32]. These three continuous attributes were discretized by applying the Minimum Description Length method [33]. The process of discretization starts by sorting continuous values in ascending order and then evaluating each candidate cut point, which is the midpoint between each successive pair of data. For cut-point evaluation, the data are divided into two partitions, and the resulting class information entropy is estimated. Finally, the cut-point that has the minimum entropy among all potential cut-points will be chosen to discretize the continuous attributes [33]. As a result of discretization, both age and duration attributes have been converted to categories. The dose amount is excluded from this study as there are no cut-points and all the records belong to the same interval after discretization. Table 5 shows the list of attributes used in this study. An overview of our method is shown in Figure 1.

Evaluation Methods
The classification performance was measured in terms of accuracy, precision, recall, and F-score. The severe class is considered the positive class. Following the definition in [34], accuracy is the ratio of correct predictions, precision is the ratio of positive class predictions that actually belong to the positive class, recall is the ratio of positive class predictions out of all positive records, and F-score is the mean between the precision and the recall.

Evaluation Methods
The classification performance was measured in terms of accuracy, precision, recall, and F-score. The severe class is considered the positive class. Following the definition in [34], accuracy is the ratio of correct predictions, precision is the ratio of positive class predictions that actually belong to the positive class, recall is the ratio of positive class predictions out of all positive records, and F-score is the mean between the precision and the recall.

Experiments and Results
The performance of our method was evaluated on the osteoporosis dataset (described in Section 4.1). First, the weights of the attributes were calculated from the training data. Table 6 shows the attribute weights (following Equation (6)) across the 10-fold. These weights are then incorporated into LR.

Experiments and Results
The performance of our method was evaluated on the osteoporosis dataset (described in Section 4.1). First, the weights of the attributes were calculated from the training data. Table 6 shows the attribute weights (following Equation (6)) across the 10-fold. These weights are then incorporated into LR. We have conducted four experiments. The first experiment compared the classification performance of our method against the standard LR, i.e., without applying any attribute weighting method. The second experiment compared our proposed χ 2 attribute weights with two baseline attribute weighing measures: the KL-based attribute weights [4] and the IG-based attribute weights [12]. The third experiment compared our method with three baseline classification algorithms, i.e., random forest, support vector machine, and decision tree. The fourth experiment compared the computational times of our method and all other baseline methods.
The training set is prepared using the balanced sampling technique, in which we randomly selected 7000 severe and 7000 non-severe records. The remainder (i.e., 6576) is used for testing. The training-test ratio is approximately 70:30. The severe adverse event is defined as a true positive, and we have carried out 10-fold cross-validation for each experiment. The results are presented using comparative boxplots. Figure 2 compares the performance of our method, the χ 2 weighted logistic regression (LRCS), with the standard logistic regression (LR). LRCS outperformed LR in accuracy, recall, and F-score. The performance of LRCS is about 10% better than that of LR in accuracy and F-score and 20% better in recall. In terms of precision, LR performed slightly better than LRCS.  Figure 3 compares the performance of LRCS with two baseline attribute weighing measures, i.e., the weights calculated using IG (LRIG) and the weights calculated using KL (LRKL). These weights are incorporated into LR. Referring to Figure 3, LRCS performed equally to LRIG in all the measures. When comparing to LRKL, our method performed better in accuracy, recall and F-score, but not as good in precision.  Figure 3 compares the performance of LRCS with two baseline attribute weighing measures, i.e., the weights calculated using IG (LRIG) and the weights calculated using KL (LRKL). These weights are incorporated into LR. Referring to Figure 3, LRCS performed equally to LRIG in all the measures. When comparing to LRKL, our method performed better in accuracy, recall and F-score, but not as good in precision.  Figure 4 compares the performance of LRCS with three baseline classification methods, i.e., decision tree (DT), random forest (RF), and support vector machine (SVM). Following the approach taken in [26,35], we tuned both the RF and SVM using the grid search method on our training data. For the tuned RF (TRF), the optimal number of trees was 1000, and the optimal number of splits was 2. For the tuned SVM (TSVM), the optimal values for cost and gamma were 0.5. LRCS performed better compared to all the five baseline methods in terms of accuracy (8-15% higher), recall (20-30% higher), and F-score (8-15% higher), but has a slightly lower precision (about 3% lower).  Figure 4 compares the performance of LRCS with three baseline classification methods, i.e., decision tree (DT), random forest (RF), and support vector machine (SVM). Following the approach taken in [26,35], we tuned both the RF and SVM using the grid search method on our training data. For the tuned RF (TRF), the optimal number of trees was 1000, and the optimal number of splits was 2. For the tuned SVM (TSVM), the optimal values for cost and gamma were 0.5. LRCS performed better compared to all the five baseline methods in terms of accuracy (8-15% higher), recall (20-30% higher), and F-score (8-15% higher), but has a slightly lower precision (about 3% lower).

Conclusions
In this study, we have proposed: (1) an attribute weight measure based on the chisquare statistic; and (2) a method to incorporate attribute weights into logistic regression to predict the severity of adverse drug events. Experimental results showed that by incorporating attribute weights, the classification performance of logistic regression has improved. Our χ 2 attribute weights method performed better than the standard logistic regression and KL-based attribute weights, and equally well with the IG-based attribute weights. Our attribute weighted logistic regression performed better than the three baseline methods, i.e., decision tree, random forest, and support vector machine. Our method also outperformed the hyperparameter-tuned random forest and support vector machine. In terms of running time, our method does not affect the computational performance of logistic regression, and the running time is lower compared to random forest, support vector machine, and hyperparameter-tuned models. To the best of our knowledge, this is the first study to propose attribute weighted logistic regression to incorporate the significance of attributes for binary classification. Adverse drug events are sometimes unavoidable, but serious events should be reduced to safeguard patients' health. The experimental results showed that our method performed well in predicting serious adverse drug events in osteoporosis disease, as the recall of our method is the highest, with an increase of at least 15% compared to all other baseline methods. As for future work, we plan to extend our method to other medical datasets.

Conclusions
In this study, we have proposed: (1) an attribute weight measure based on the chisquare statistic; and (2) a method to incorporate attribute weights into logistic regression to predict the severity of adverse drug events. Experimental results showed that by incorporating attribute weights, the classification performance of logistic regression has improved. Our χ 2 attribute weights method performed better than the standard logistic regression and KL-based attribute weights, and equally well with the IG-based attribute weights. Our attribute weighted logistic regression performed better than the three baseline methods, i.e., decision tree, random forest, and support vector machine. Our method also outperformed the hyperparameter-tuned random forest and support vector machine. In terms of running time, our method does not affect the computational performance of logistic regression, and the running time is lower compared to random forest, support vector machine, and hyperparameter-tuned models. To the best of our knowledge, this is the first study to propose attribute weighted logistic regression to incorporate the significance of attributes for binary classification. Adverse drug events are sometimes unavoidable, but serious events should be reduced to safeguard patients' health. The experimental results showed that our method performed well in predicting serious adverse drug events in osteoporosis disease, as the recall of our method is the highest, with an increase of at least 15% compared to all other baseline methods. As for future work, we plan to extend our method to other medical datasets.

Conflicts of Interest:
The authors declare no conflict of interest.