Osteoporosis Pre-Screening Using Ensemble Machine Learning in Postmenopausal Korean Women

As osteoporosis is a degenerative disease related to postmenopausal aging, early diagnosis is vital. This study used data from the Korea National Health and Nutrition Examination Surveys to predict a patient’s risk of osteoporosis using machine learning algorithms. Data from 1431 postmenopausal women aged 40–69 years were used, including 20 features affecting osteoporosis, chosen by feature importance and recursive feature elimination. Random Forest (RF), AdaBoost, and Gradient Boosting (GBM) machine learning algorithms were each used to train three models: A, checkup features; B, survey features; and C, both checkup and survey features, respectively. Of the three models, Model C generated the best outcomes with an accuracy of 0.832 for RF, 0.849 for AdaBoost, and 0.829 for GBM. Its area under the receiver operating characteristic curve (AUROC) was 0.919 for RF, 0.921 for AdaBoost, and 0.908 for GBM. By utilizing multiple feature selection methods, the ensemble models of this study achieved excellent results with an AUROC score of 0.921 with AdaBoost, which is 0.1–0.2 higher than those of the best performing models from recent studies. Our model can be further improved as a practical medical tool for the early diagnosis of osteoporosis after menopause.


Introduction
Osteoporosis is a representative disease that accompanies aging and is closely related to skeletal fractures and deaths [1]. Therefore, a methodology for early diagnosis and prevention has been proposed. Osteoporosis is diagnosed by measuring bone mineral density (BMD) using dual-energy X-ray absorptiometry (DXA) equipment [2]. However, the associated costs are expensive [3]. Hence, with the accelerating growth of aged populations, the financial burdens of individuals and governments are increasing dramatically [4,5]. Notably, a pre-screening diagnosis method that leverages data from surveys and checkups to evaluate osteoporosis risk in advance would greatly benefit prevention and treatment while reducing economic and financial burdens on society. For these reasons, pre-screening diagnosis methods have been actively studied. Thus, many conventional methods of predicting osteoporosis risk are used, including the Osteoporosis Self-Assessment Tool for Asians [6], the osteoporosis risk assessment instrument [7], simple calculated osteoporosis risk estimation [8], and the osteoporosis index of risk [9,10], because these methods rely on only two or three features to predict osteoporosis simply. However, because enormous amounts of medical data are collected nowadays, it is necessary to apply complicated statistical methods to utilize data in advance for better results [3]. Machine learning is an artificial intelligence technique for learning patterns and predicting outcomes based on input data [11]. Machine learning is especially effective at identifying trends and making predictions from multi-dimensional data and has already been applied to osteoporosis diagnosis. For example, E et al. [12] attempted to improve the low accuracy of osteoporosis prevalence predictions using machine learning, and Kim et al. [13] applied machine learning techniques to pre-screen osteoporosis in postmenopausal women in Korea.
Feature selection is important for machine learning efficiency and accuracy [14]. In most machine learning osteoporosis diagnosis methods, selected features known to influence osteoporosis are used to train machine learning models from prepared datasets [12,13,15,16]. This study instead applies a method of selecting features optimized for machine learning from high-dimensional data, rather than by filtering features in advance based on expert knowledge. The performance of this method turns out to be better than those of extant feature selection methods.
The Korea National Health and Nutrition Examination Survey (KNHANES) is a nationwide survey of Korean residents that collects general health and nutrition data, including those of bone densitometry. Lee and Lee [15], Shim et al. [16], and Yoo et al. [17] studied machine learning models that predict osteoporosis based on the features related to osteoporosis, achieving area under the receiver operating characteristic (AUC) curve performances of 0.710, 0.743, and 0.827, respectively. Based on this, the current study trains and evaluates a machine learning model that predicts osteoporosis in postmenopausal women using raw data from the KNHANES (2008-2011).

Data Collection
The KNHANES database was established to identify the health and nutritional statuses of Korean citizens following the 1998 enactment of Article 16 of the National Health Promotion Act. Hence, the survey has been conducted yearly with raw data released online. KNHANES include data from common participant information, health behavior surveys, health examinations, and nutrition surveys [18]. The current study uses raw data from the V-4 (2008-2009) and V-5 (2010-2011) surveys, when osteoporosis tests were performed using DXA equipment [19]. The current study's use of KNHANES data received ethical approval from the Institutional Review Board of the Korea Centers for Disease Control and Prevention (IRB Num. IS19EISI0063). Data were downloaded from the KNHANES website (https://knhanes.kdca.go.kr/knhanes/main.do (accessed on 7 October 2020)).

Study Participants
Among the 21,303 participants, four exclusion criteria were applied to meet the purpose of this study. First, those who were not tested for osteoporosis were excluded. Second, osteopenia patients were excluded because the purpose of this study is to determine whether patients have osteoporosis or normal as binary classification. Third, men were excluded, as the focus of the study is postmenopausal women. Furthermore, patients who had experienced both menopause and a hysterectomy were included as they have a high chance of contracting osteoporosis, based on previous studies [20]. Fourth, only participants aged 40-69 years were included as most over 70 have already suffered osteoporosis. Considering all four criteria, 1431 participants remained.

Bone Mineral Density and T-Score
Osteoporosis is normally diagnosed using BMD tests via DXA, which measures the inorganic content in bone to determine the risk of fracture and identify the prevalence of osteoporosis. BMDs of the lumbar spine and femur are usually measured, but those of the wrist, finger, or heel may be substituted [21].
In KNHANES V-4 and V-5, BMD was measured in three areas: lumbar spine, femur neck, and the whole femur. The individually measured BMDs were used to diagnose Healthcare 2022, 10, 1107 3 of 11 osteoporosis after calculating T-scores and comparing them to the BMDs of healthy adults, according to a recommendation by the World Health Organization. The largest BMD dataset from Asia Japan (DISCOVERY-W; fan-beam densitometer, Hologic, Inc., USA) was used as the healthy adult group [22].
T-scores calculated using the above equation were classified: a T-score of −1 or higher was classified as normal, −1 to −2.5 was classified as osteopenia, and less than −2.5 was classified as osteoporosis [23]. Based on above criteria, T-scores were classified into three different classes, and only two classes (normal and osteoporosis) as a dependent variable were used to train binary classification model. Figure 1 displays a flowchart explaining the study design. Prior to analysis, participants were selected considering the criteria explained in Section 2.2, and data preprocessing was conducted afterward (see Section 2.5). Machine learning algorithms were applied to predict the occurrence of osteoporosis based on 20 features having high classification influence, chosen via discussions with medical specialists and feature importance scoring from trained machine learning algorithms. The features consisted of 10 biochemical screening results (Model A) and 10 survey results (Model B). When combined, Model C is obtained. In KNHANES V-4 and V-5, BMD was measured in three areas: lumbar spine, femur neck, and the whole femur. The individually measured BMDs were used to diagnose osteoporosis after calculating T-scores and comparing them to the BMDs of healthy adults, according to a recommendation by the World Health Organization. The largest BMD dataset from Asia Japan (DISCOVERY-W; fan-beam densitometer, Hologic, Inc., USA) was used as the healthy adult group [22].

Experimental Design
T-score = (BMD − group's BMD mean) * group's BMD stddev, T-scores calculated using the above equation were classified: a T-score of −1 or higher was classified as normal, −1 to −2.5 was classified as osteopenia, and less than −2.5 was classified as osteoporosis [23]. Based on above criteria, T-scores were classified into three different classes, and only two classes (normal and osteoporosis) as a dependent variable were used to train binary classification model. Figure 1 displays a flowchart explaining the study design. Prior to analysis, participants were selected considering the criteria explained in Section 2.2, and data preprocessing was conducted afterward (see Section 2.5). Machine learning algorithms were applied to predict the occurrence of osteoporosis based on 20 features having high classification influence, chosen via discussions with medical specialists and feature importance scoring from trained machine learning algorithms. The features consisted of 10 biochemical screening results (Model A) and 10 survey results (Model B). When combined, Model C is obtained.

Data Preprocessing
Data preprocessing was performed using Python V.3.8 using the pandas, numpy, and scikit-learn libraries. Outliers and non-responses were converted into missing values Healthcare 2022, 10, 1107 4 of 11 ("N/A") based on the KNHANES data guidelines, and multinominal data were analyzed using the one-hot encoding method. Feature engineering was used to integrate features with overlapping meanings by year, and all data were conditions with standard scaling prior to training.

Feature Selection
As shown in Figure 1, feature selection was performed based on feature importance and recursive feature elimination (RFE), following data processing. Feature importance refers a measure of the individual contribution of the corresponding feature for a particular classifier, regardless of the shape or direction of the feature effect [24]. The higher the feature importance, the greater the influence on algorithmic decision-making. RFE is a backward feature selection technique that removes features with low importance, considering the size of the input feature set. The machine learning model was trained on all features initially, and unimportant features were then eliminated from the set.

Machine Learning Algorithms
A total of eight different machine learning models (KNN, Decision Tree, LDA, QDA, SVC, Random Forest, AdaBoost, and Gradient Boosting Machine) were trained and evaluated based on the KNHANES data during study. Among them, three machine learning models (Random Forest, AdaBoost, and Gradient Boosting Machine) were selected with the highest performance. In this study, three ensemble machine learning algorithms were used to analyze KNHANES data: Random Forest (RF), AdaBoost, and Gradient Boosting Machine (GBM). Ensemble learning connects several weak learning algorithms to obtain stronger results, which is effective in solving classification and regression problems. RF generates a strong decision tree by combining the outputs of several randomly generated ones [25]. AdaBoost is a classification-based model that synthesizes a classifier strengthened through weight modification by combining many weak classifiers. GBM sequentially generates trees in a manner that mitigates the errors of previous trees using gradient boosting classifiers [26].

Model Training
The k-fold cross-validation method was used for machine learning training and verification k times by allocating verification data differently for each iteration after dividing the dataset into k folds [27]. In this study, the training and testing datasets were divided 80:20 for learning and performance measurement, and k was set to five. This study repeated this cross-validation method 10 times, followed by an accuracy comparative analysis of 50 total learned models. During training, hyperparameters were optimized using the grid-search approach, a tuning technique that computes the optimal combination of hyperparameters by verifying the performance of all possible combinations using cross-validation [28].

Model Evaluation
Two indicators are normally required to evaluate machine learning performance. The first is the area under the curve (AUC) score from the receiver operating characteristic curve, which is curve-plotting sensitivity vs. one minus specificity. In statistic fields, the accuracy of the machine learning model will improve as the AUC approaches one [29].
Principal component analysis (PCA) is a multivariate analysis method that finds the main components represented by a linear combination of variables by identifying the variation-covariant relationships between large quantitative variables. PCA was used in the present study to visualize the clusters of target patients using two-dimensional reduced principal component variables.

Statistical Analysis
As the dependent variable of this study is the T-score, point-biserial correlation and phi correlation analyses were performed to calculate correlations instead of using the Pearson coefficient. Point-biserial correlation measures the correlation when one variable is a binary variable and the other is continuous [30]. The phi correlation analysis determines the degree of correlation between two variables when both independent and dependent variables are binary [31].

Draft Model Building
Training with 1151 features (original data), the AdaBoost model showed the best performance in terms of the AUC (0.91), followed by the GBM (0.90) and RF (0.86). See Figure 2A. Additionally, the osteoporosis per se (dependent variable) was not clearly classified into two separate groups (normal and osteoporosis) based on only two main features (PC1 and PC2), whereas the PCA was performed on 1151 features ( Figure 2B).
iation-covariant relationships between large quantitative variables. PCA was used in the present study to visualize the clusters of target patients using two-dimensional reduced principal component variables.

Statistical Analysis
As the dependent variable of this study is the T-score, point-biserial correlation and phi correlation analyses were performed to calculate correlations instead of using the Pearson coefficient. Point-biserial correlation measures the correlation when one variable is a binary variable and the other is continuous [30]. The phi correlation analysis determines the degree of correlation between two variables when both independent and dependent variables are binary [31].

Draft Model Building
Training with 1151 features (original data), the AdaBoost model showed the best performance in terms of the AUC (0.91), followed by the GBM (0.90) and RF (0.86). See Figure  2A. Additionally, the osteoporosis per se (dependent variable) was not clearly classified into two separate groups (normal and osteoporosis) based on only two main features (PC1 and PC2), whereas the PCA was performed on 1151 features ( Figure 2B).

Feature Selection and Statistical Analysis
Survey data are questions that the patient can directly respond to and are related to people's life patterns. Checkup data are collected with the biochemical screening result of participants. Table 1 shows the descriptive statistics of the 20 features selected for importance, and Table 2 presents a list of 20 variables selected by referring to the feature importance as well as one-to-one correlation coefficients between each variable and DX_OST (dependent variable). As a result of the point-biserial correlation analysis, the age variable had the highest correlation at 0.540 in the positive direction, followed by age of menarche (0.24) and use of estrogen (0.17). Among the survey data, education level had

Feature Selection and Statistical Analysis
Survey data are questions that the patient can directly respond to and are related to people's life patterns. Checkup data are collected with the biochemical screening result of participants. Table 1 shows the descriptive statistics of the 20 features selected for importance, and Table 2 presents a list of 20 variables selected by referring to the feature importance as well as one-to-one correlation coefficients between each variable and DX_OST (dependent variable). As a result of the point-biserial correlation analysis, the age variable had the highest correlation at 0.540 in the positive direction, followed by age of menarche (0.24) and use of estrogen (0.17). Among the survey data, education level had the greatest negative correlation at −0.34. Serum alkaline phosphatase level was the highest at 0.233 for screening questions with a positive correlation. From the screening questions, weight (HE_wt) scored the highest negative correlation (−0.43) with the DX_OST, followed by height (HE_ht) at −0.37.

Models (A, B, and C) Performance
The three machine learning models were each trained using Models A, B, and C, and grid search and five-fold cross-validation techniques were used to determine the optimized hyperparameters for the best performance. The performance of Model C (Figure 3)

Models (A, B, and C) Performance
The three machine learning models were each trained using Models A, B, and C, and grid search and five-fold cross-validation techniques were used to determine the opti mized hyperparameters for the best performance. The performance of Model C (Figure 3 had a high average AUC of 0.88. Models A and B had AUCs exceeding 0.80 and 0.83 respectively.   Table S2 show the performance of Model C, and using the same process the results of the best model performances of Models A and B can be viewed in the   Table S2 show the performance of Model C, and using the same process, the results of the best model performances of Models A and B can be viewed in the supplementary section (Figures S2 and S3). Figure 4A shows the result of the ROC curves for RF, AdaBoost, and GBM. The AUCs of the RF and AdaBoost algorithms were both 0.92, and the GBM showed no significant difference at 0.91. Referring to Supplementary Table S2, the performance indicators of accuracy, precision, and recall resulted in low variations among algorithms and were stable. Figure 4B shows the results of the two-dimensional PCA for the 20 selected features. Osteoporosis and normal clusters were not completely separated, but two clusters of PC1 could be distinguished between zero and one along the x-axis.
supplementary section (Figures S2 and S3). Figure 4A shows the result of the ROC curves for RF, AdaBoost, and GBM. The AUCs of the RF and AdaBoost algorithms were both 0.92, and the GBM showed no significant difference at 0.91. Referring to Supplementary Table S2, the performance indicators of accuracy, precision, and recall resulted in low variations among algorithms and were stable. Figure 4B shows the results of the two-dimensional PCA for the 20 selected features. Osteoporosis and normal clusters were not completely separated, but two clusters of PC1 could be distinguished between zero and one along the x-axis.

Discussion
In this study, a PCA was performed prior to feature selection and afterward to confirm the relationships between osteoporosis and the selected features. According to the PCA plot ( Figure 2B) of the draft model, the normal and osteoporosis groups could not be clearly distinguished based on the two principal components. However, the results of Model C ( Figure 4B) implied that the clusters were distinguishable on the right side (normal) and the left side (osteoporosis) based on the specific value at the x-axis (between zero and one). There was no significant difference between the AUC of the draft model and Model C. For Model C, the variation in the AUCs among the three machine learning algorithms was small. Thus, training machine learning models with a small number of features is more effective than using all features in terms of model efficiency and stability.
Checkup data would be used to predict the occurrence of osteoporosis with an 80% accuracy when applying Model A and survey data would be used to predict the occurrence of osteoporosis with an 85% accuracy if Model B was applied. Finally, if both checkup and survey data were available, Model C would be appropriate to predict the occurrence of osteoporosis with an 88% accuracy. To sum up, each model would be used practically depending on the type of data collected.
Prior studies selected features that potentially affected osteoporosis based on knowledge of the medical domain. However, in this study, the feature selection step was performed using medical domain knowledge alongside feature importance and RFE techniques. Instead of collecting commonly known significant features, a large dataset was used to describe the participants in as much detail as possible. A manual data preprocessing step was also necessary to improve training and prediction accuracy. For example, the beginning age of drinking is meaningful only within the group that already had

Discussion
In this study, a PCA was performed prior to feature selection and afterward to confirm the relationships between osteoporosis and the selected features. According to the PCA plot ( Figure 2B) of the draft model, the normal and osteoporosis groups could not be clearly distinguished based on the two principal components. However, the results of Model C ( Figure 4B) implied that the clusters were distinguishable on the right side (normal) and the left side (osteoporosis) based on the specific value at the x-axis (between zero and one). There was no significant difference between the AUC of the draft model and Model C. For Model C, the variation in the AUCs among the three machine learning algorithms was small. Thus, training machine learning models with a small number of features is more effective than using all features in terms of model efficiency and stability.
Checkup data would be used to predict the occurrence of osteoporosis with an 80% accuracy when applying Model A and survey data would be used to predict the occurrence of osteoporosis with an 85% accuracy if Model B was applied. Finally, if both checkup and survey data were available, Model C would be appropriate to predict the occurrence of osteoporosis with an 88% accuracy. To sum up, each model would be used practically depending on the type of data collected.
Prior studies selected features that potentially affected osteoporosis based on knowledge of the medical domain. However, in this study, the feature selection step was performed using medical domain knowledge alongside feature importance and RFE techniques. Instead of collecting commonly known significant features, a large dataset was used to describe the participants in as much detail as possible. A manual data preprocessing step was also necessary to improve training and prediction accuracy. For example, the beginning age of drinking is meaningful only within the group that already had experience in drinking. Therefore, in this case, a feature engineering technique was used to convert the variable into a new one combining drinking experience and the beginning age of drinking. Using this feature selection method with preprocessing, the machine learning models of this study had better results, with AUCs of 0.919 (RF), 0.921 (AdaBoost), and 0.908 (GBM). These scores are approximately 0.1-0.2 higher than the scores of the best model performance from previous studies. Although this study did not fully consider the clinical knowledge, the unique feature selection method and data preprocessing step had a positive influence on model performance via the selection of more suitable features and the merging of various raw data into more meaningful data. In particular, the features selected in this study could be classified into two different groups (i.e., checkup and survey), each of which results in an AUC score of at least 0.80. Therefore, the trained machine learning model from this study may serve as an osteoporosis assistant diagnostic program that predicts the occurrence of osteoporosis and determines the necessity of more thorough examinations.
There were several limitations in this study. First, the raw data from the KNHANES were gathered from cross-sectional observational studies performed at limited points in time across a limited sample population. Second, as participant selection was restricted to women between 40 and 69 years old, it may be difficult to generalize the results of this study to all populations in Korea.
The results of this study can be used as an auxiliary diagnosis program for osteoporosis in the future. In a further study, the models will verify if clinical data with the same features collected from medical institutions can be generalized. Furthermore, as medical image data and deep learning technology can be used for osteoporosis diagnosis, combined with the results of this study, it might be used as a more objective and accurate osteoporosis auxiliary diagnostic tool [32,33].

Conclusions
This study generated a prediction model for classifying the osteoporosis using three machine learning algorithms based on 20 features obtained through the feature selection step. The model (Model C) including both checkup and survey features, had an AUROC value (0.92) based on 20 features. Additionally, the model (Model A) with only checkup features, scored an AUROC value (0.81), and the model (Model B) with only survey features, attained an AUROC value (0.85). The trained osteoporosis prediction models when each dataset is available are expected to be useful as an auxiliary diagnostic tool for women after menopause.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/healthcare10061107/s1, Figure S1: Machine Learning Curve for each best models using Grid search method; Figure S2: ROC Curve for three different best models based on 10 selected Checkup features; Figure S3: ROC Curve for three different best models based on 10 selected Survey features; Figure S4: Feature importance bar graph for 20 features measured by three different ensemble machine learning models; Table S1: The number of normal and osteoporosis patients distributed in women; Table S2: Summary of the Model C's Performances on the test set (n = 287).  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available to download at https: //knhanes.kdca.go.kr/knhanes/main.do (accessed on 7 October 2020) for research purposes. It is necessary to obtain IRB registration of the KNHANES dataset for publishing the study.