Next Article in Journal
Genetic Associations of Visfatin Polymorphisms with EGFR Status and Clinicopathologic Characteristics in Lung Adenocarcinoma
Next Article in Special Issue
Trade Flow Optimization Model for Plastic Pollution Reduction
Previous Article in Journal
Occurrence Location and Propagation Inconformity Characteristics of Vibration Events in a Heading Face ofa Coal Mine
Previous Article in Special Issue
Improving Water Quality Index Prediction Using Regression Learning Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Risk Association of Liver Cancer and Hepatitis B with Tree Ensemble and Lifestyle Features

1
School of Industrial and Management Engineering, Korea University, 145 Anamro, Seongbuk-gu, Seoul 02841, Republic of Korea
2
Department of Industrial and Management Systems Engineering, Kyung Hee University, 1732, Deogyeong-daero, Giheung-gu, Yongin-si 17104, Republic of Korea
*
Author to whom correspondence should be addressed.
Int. J. Environ. Res. Public Health 2022, 19(22), 15171; https://doi.org/10.3390/ijerph192215171
Submission received: 13 September 2022 / Revised: 11 November 2022 / Accepted: 15 November 2022 / Published: 17 November 2022
(This article belongs to the Special Issue Data Science for Environment and Health Applications)

Abstract

:
The second-largest cause of death by cancer in Korea is liver cancer, which leads to acute morbidity and mortality. Hepatitis B is the most common cause of liver cancer. About 70% of liver cancer patients suffer from hepatitis B. Early risk association of liver cancer and hepatitis B can help prevent fatal conditions. We propose a risk association method for liver cancer and hepatitis B with only lifestyle features. The diagnostic features were excluded to reduce the cost of gathering medical data. The data source is the Korea National Health and Nutrition Examination Survey (KNHANES) from 2007 to 2019. We use 3872 and 4640 subjects for liver cancer and hepatitis B model, respectively. Random forest is employed to determine functional relationships between liver diseases and lifestyle features. The performance of our proposed method was compared with six machine learning methods. The results showed the proposed method outperformed the other methods in the area under the receiver operator characteristic curve of 0.8367. The promising results confirm the superior performance of the proposed method and show that the proposed method with only lifestyle features provides significant advantages, potentially reducing the cost of detecting patients who require liver health care in advance.

1. Introduction

The liver is the largest internal organ responsible for crucial body functions, such as blood coagulation, protein production, and glycogen synthesis [1,2,3,4]. Despite the importance of the organ, liver cancer is the second-largest cause of death in Korea [5]. The leading cause of liver cancer is a liver disease induced by obesity and alcohol, and hepatitis B [6]. Particularly, 70% of liver cancer patients are afflicted with hepatitis B. Hepatitis B is a disease caused by the immune response of the body to the hepatitis B virus (HBV) or contact with contaminated inanimate objects and blood [7]. It is challenging to treat effectively in the last stage of liver cancer, so that the best way to lower liver cancer mortality is to prevent the transition from hepatitis B or detect liver cancer in the early stage [8].
Disease risk association has taken a significant role in the medical research field [9]. The main objective of disease risk association is detecting diseases in the early stage to prevent leading to fatal conditions of patients [10]. We can obtain a variety of medical data and apply modern machine learning methods to predict disease risk accurately [11]. The machine learning methods have the potentials to find significant relationships between patients’ personalized data and medical diagnosis results [12].
Results of previous studies on the risk association for liver cancer have been introduced as follows: Wu et al. proposed a risk-classifier of liver cancer based on a support vector machine (SVM) with 22 biomarkers such as alpha-fetoprotein [13]. Chen et al. proposed to detect the risk of liver cancer by using clinic data such as alkaline phosphatase and aspartate transaminase and compared performances of logistic regression, decision tree, and SVM. Previous studies predict the risk of liver cancer by using clinic data and machine learning methods [14].
Other studies have attempted to predict risk for various diseases. Ward et al. proposed atherosclerotic cardiovascular disease risk association based on machine learning method with clinical data including cholesterol levels and blood pressure [15]. Perveen et al. attempted to predict the risk of non-alcoholic fatty liver disease with a decision tree and medical diagnosis results such as high-density lipoprotein and triglycerides [16]. Dimopoulos et al. proposed machine learning methods using both clinic and lifestyle features to predict cardiovascular risk [17].
The previous studies have shown promising results in risk association of liver and various diseases; however, they employ clinical diagnosis results. The clinical data enhances the risk association model’s performance with personalized information. The approaches have an explicit limitation to achieving the objective of early screening of patients. Leveraging the data from various treatments in hospitals takes considerable time, and costs [18,19]. In contrast, lifestyle features are relatively cost-effective and publicly available.
In this study, we propose a random forest-based method to predict the risk of liver cancer and hepatitis B only using lifestyle features. In consideration of related works which imply significant correlations between liver disease and lifestyle features [20,21], we assume it is possible to predict disease risk reliably with lifestyle features only. We introduce models that can predict liver cancer and hepatitis B by exploiting the same lifestyle features. The model predicts the risk cost-effectively for high-risk people with liver cancer and hepatitis B. Furthermore, it has a broader impact on managing liver health regardless of the accessibility to medical institutions.
The remainder of this paper is organized as follows. Section 2 details data and prediction methods, describing the input data, modeling, and sensitivity analysis. Section 3 presents a performance evaluation for risk association of hepatitis B and liver cancer. Section 4 discusses comparative study and sensitivity analysis results. Finally, Section 5 offers our conclusive comments.

2. Material and Methods

2.1. Data Source

We use data from the Korea National Health and Nutrition Examination Survey (KNHANES) [22] which is publicly available and consists of health status, examination, and nutrition survey. Data collection is conducted annually to evaluate Koreans’ health and nutritional status and monitor trends in health risk factors and the prevalence of major chronic diseases according to Korean Article 16 of the National Health Promotion Act [22]. Therefore, the target population of KNHANES comprises Korean citizens residing in Korea, and representative samples for the survey are newly extracted every year to obtain generalizable survey data according to the sampling plan. The sampling plan follows a multi-stage clustered probability design. We use KNHANES data gathered over 13 years, from 2007 to 2019, as cross-sectional data to develop risk association models. It has 510,747 cases, and adult data were extracted. We selected 13 features closely related to the lifestyle. The list of desired features is presented in Table 1. We use the KNHANES dataset to develop models which detect potential patients who require liver health care in advance.

2.2. Data Preprocessing

We extracted liver cancer and hepatitis B samples to train risk association models. Figure 1 shows the data preprocessing flow configuring the liver cancer and hepatitis B dataset. The hepatitis B dataset consists of 856 subjects. The 58 subjects in the liver cancer dataset have experienced liver cancer. All cases with hepatitis B and liver cancer are provided by the national health and nutrition survey. Each dataset was divided into train and test sets. Train sets account for 70% of entire data sets, and test sets for 30%. Additionally, we construct train sets by randomly sampling subjects with and without a liver disease diagnosis in equal proportion to handle the class imbalance. By sampling subjects with and without a liver disease diagnosis to construct train sets in equal proportion, we improve the models to learn both majority class and minority class in equal weight. The train set consists of 82 and 1198 subjects sampled from liver cancer and hepatitis B datasets, respectively. Half of the train set is the liver disease cases, and the other half is cases without liver disease.
We imputed missing values in the dataset with mice [23], simple, and K-nearest neighbor (K-NN) imputation [24,25]. The final imputation method is determined by a comparative study. We present the imputation study result in Section 3.

2.3. Risk Association Model

We adopt the random forest for the risk association method. Random forest is representative ensemble method based on decision tree and have shown promising results in disease risk association [26,27]. We validate the random forest with grid search-based cross-validation (GridSearchCV) [28] to determine optimal hyperparameters. GridSearchCV validates all combinations of hyperparameters in terms of validation accuracy. Table 2 and Table 3 show the result of hyperparameter tuning. After tuning the hyperparameters, we compared the performances in terms of accuracy and an area under receiver operator characteristic curve (AUC) [29].

2.4. Performance Evaluation and Model Interpretation

We examine the validity of proposed method by comparing with logistic regression [30], decision tree [31], artificial neural networks(ANN) [32], extreme gradient boosting(XGboost) [33], light gradient boosting machine(lightGBM) [34], and SVM [35]. The metrics for evaluation are AUC and accuracy. The AUC is the representative value-based metric for risk association models by quantitatively measuring the entire two-dimensional area under the receiver operating characteristic (ROC) curve [36,37,38]. It measures how much the model is capable of distinguishing between classes. The higher the AUC, the better the model is in distinguishing between classes. In the medical field, previous studies generally used AUC as an evaluation index for disease risk association [39,40,41]. Accuracy is an index of the proportion of correctly predicted cases among all cases. We averaged the validation results repeated 30 times and calculated the standard deviation.
We conducted a sensitivity analysis after training prediction models to interpret the training results [42]. Random forest is usually more accurate than other models, but it has the disadvantage of being difficult to interpret. Even the decision tree, the base model for random forest, is interpretable, ensembe makes the random forest lose the interpretability. To address the issue, we performed a sensitivity analysis in two methods.
As the first method, we performed a sensitivity analysis by randomly permuting one feature in the dataset and then retraining with the same model. We can analyze the result whether the corresponding feature significantly affects the performance of disease risk association by observing significant drop in the model’s performance after the re-fitting. We identified the performance drop with AUC. The second method is a Shapley additive explanation (SHAP) [43]. SHAP captures the influence of each feature on the model’s performance through the Shapley value, the average marginal contribution of a feature over all possible coalitions of features. Therefore, we can identify significant features having a large Shapley value.

3. Case Study Results

3.1. Results of Imputations

We compare the three imputation methods: simple imputation, K-NN imputation, and mice imputation. Table 4 and Table 5 illustrate the comparison results over imputation methods by AUC. In the risk association model for liver cancer, simple imputation achieves the highest AUC of 0.8317. The overall results imply that the simple imputation is the best for imputing missing values in our dataset in all models as well as random forest. In the risk association model for hepatitis B, mice imputation achieves the highest AUC of 0.8632. Mice imputation shows better performance than simple imputation. We subsequently use the simple imputation and mice imputation for liver cancer and hepatitis B risk association, respectively.

3.2. Performance Evaluation

Table 6 shows evaluation results over comparative methods. Both risk association models with random forest for hepatitis B and liver cancer performed well with AUC values of 0.8367 and 0.88083. Values of accuracy in the test set were 0.9245 for the hepatitis B risk association model and 0.8190 for the risk association model for liver cancer. The overall results demonstrate that the proposed method outperformed other comparative methods.
In addition, to demonstrate the superiority of the proposed method over comparative methods, we conducted Wilcoxon signed-rank test on the random forest and comparative methods with rank statistics. In each validation step, we calculate the rank of the methods, and after validation, the number of rank results for each method is equal to 30. The 30 rank results are statistically compared with Wilcoxon signed-rank test. Table 7 shows the Wilcoxon signed-rank test results between random forest and other comparative methods. The result demonstrates the statistical significance between the random forest and others. In Table 7, all p-values are less than 0.05. The result demonstrates that the performance gap between random forest and others is statistically significant.

3.3. Sensitivity Analysis

We conducted a sensitivity analysis to identify which features significantly improve random forest’s performance using two approaches. The first method randomly permutates a specific feature and then retrains the random forest model with the randomly permuted feature. Sensitivity analysis also used AUC to evaluate the performance. We average the AUC values obtained from 30 retrained models for accurate comparison and calculate the standard deviation. The second method uses SHAP to evaluate the influence of each feature on random forest model learning. Similarly, we average the each feature’s SHAP values obtained from test dataset for an accurate comparison.
We established the hypothesis based on well-known prior studies [44,45,46] on the correlation between liver health and lifestyle features. The established hypotheses are: first, high body mass index (BMI) is closely related to the risk of hepatitis B and liver cancer because it can cause various adult diseases. BMI is the weight (kg) divided by the height (m) square, commonly used to measure a person’s health condition [44]. Second, hepatitis B is generally known as a high-risk factor for liver cancer [45]. It is assumed that hepatitis B is a noteworthy feature in the risk model of liver cancer. Finally, drinking is closely related to liver health [46]. Therefore, we assume that the drinking age will significantly impact hepatitis B and liver cancer risk.
We interpret the sensitivity analysis result according to the above hypothesis. Table 8 shows no significant difference between the sensitivity analysis results of the BMI, and the existing model’s performance. Interestingly, average weekend sleep time per day is the noteworthy feature that causes the most considerable performance degradation in the model. The difference of mean AUC between original model and permutated model by the average weekend sleep time per day is 0.0311. Considering the confidence intervals, the difference is significant.
In Table 9, we present the sensitivity analysis results of the liver cancer risk association. The drinking age, average weekend sleep time per day, hepatitis B, and weight affect the performance of the prediction model. As a result of hypothesis testing, the hepatitis B and BMI have relatively large effects compared to other features, but the drinking age has no significant impact on performance. The hepatitis B causes the most critical performance degradation of mean AUC 0.0432.
As shown in Table 10, we present the SHAP results of the hepatitis B risk association. The average weekend sleep time per day, weekday sleep time per day, and total monthly household income have relatively larger influence on the prediction model than other features. As a result of hypothesis testing, the drinking age has no significant impact on performance. The SHAP value of the drinking age is 0.0190. Table 11 shows the age and hepatitis B have relatively significant influence on prediction model’s performance. However, the drinking age is the feature that causes the no considerable influence on performance.
We interpret the models in sensitivity analysis and SHAP. The overall results are meaningful because it can explain features that have a great effect on the prediction models. Table 12 shows the top 5 features of sensitivity analysis and SHAP about hepatitis B risk association models. There are four in common features (average weekend sleep time per day, average weekday sleep time per day, average weekly working hours, and age) that have relatively significant effects on model’s performance. Table 13 demonstrates that hepatitis B, age, and total monthly household income have relatively large impact on the liver cancer risk association model.
Table 14 and Table 15 present the example of SHAP analysis for each observation and show how to interpret significant features involved in each observation. The SHAP results of each observation with liver and non-liver cancer are shown in Table 14. Age and hepatitis B are common features that significantly affect model performance. However, depending on the personal history of liver cancer for each observation, there is a difference in how lifestyle features affect the risk association of liver cancer with the model. For instance, the SHAP value of hepatitis B in the observation with liver cancer is 0.1450. It shows that hepatitis B leads the model to predict the observation having a high risk of liver cancer. However, the SHAP value of hepatitis B in the observation without liver cancer is −0.0639. It presents that non-hepatitis B forces the model to predict that the observation has a low risk of liver cancer.
Table 15 shows the SHAP results of each observation with hepatitis B or non-hepatitis B. Three common features (average weekday sleep time per day, average weekend sleep time per day, and average weekly working hours) significantly affect the model’s performance. Observation with hepatitis B has shorter weekday and weekend sleep time per day than the observation without hepatitis B. The SHAP values of average weekday sleep time per day and average weekend sleep time per day in observation with hepatitis B are 0.1811 and 0.2252, respectively. Conversely, in the observation without hepatitis B, the SHAP value of average weekday sleep time per day is −0.1255, and the average weekend sleep time per day is −0.1111. Therefore, the results imply a positive correlation between average sleep time per day and the risk of hepatitis B.

3.4. Ablation Studies

We conducted ablation studies to evaluate models’ performance in various experiment settings. We present the changes of model performance according to the number of training datasets in Figure 2 and Figure 3. As can be seen in Figure 2 and Figure 3, random forest outperformed other comparative methods in the all training dataset ratio settings. Note that in Figure 2, the result for random forest and the lightGBM is relatively close. To objectively examine the performance, we refer Table 16 and Table 17, where the AUC results are averaged over 30 validation runs.
The results shown in Table 16 and Table 17 illustrate random forest results over comparative methods. Both risk association models with random forest for hepatitis B and liver cancer performed better than other methods in the all training dataset ratio. AUC values in the test set were 0.8288 for the hepatitis B risk association model and 0.7963 for the risk association model for liver cancer when they used 10% of the training dataset. The results confirm that the proposed approach is robust to real-world small data problems.
In addition, we conducted a Wilcoxon signed-rank test by setting the Type I error α to 0.05 to evaluate the robustness of random forest to small training data. Table 18 shows the Wilcoxon signed-rank test results between random forest and other comparative methods when they used 10% of the training dataset. In Table 18, there was a clear performance difference between the random forest and the other comparative methods in liver cancer risk association. Moreover, the P-values of the comparative methods are less than 0.05 in the hepatitis B risk association. We verify that random forest had robustness to real-world small data problems.
We evaluate the performance of multi-disease risk association models and comparative models. Multi-disease risk association models predict both liver cancer and hepatitis B risk simultaneously. Table 19 illustrates the comparison results over seven models by AUC. Note that the test results of hepatitis B do not exceed 0.61. Even though we optimize hyperparameters of each model, they have no remarkable performance improvements. An AUC value of hepatitis B from a random forest is only about 0.5350. The test results of hepatitis B shown in Table 19 validate that it is not comparable to our method to predict hepatitis B risk. Consequently, we believe that building new models for each liver disease is desirable.

4. Discussion

Risk association models generally use clinical data requiring costly and time-consuming procedures [39,40,47]. However, the lifestyle features are much more accessible and cost-efficient [48]. We selected lifestyle features related to liver disease by referring to previous studies. All citizens in Korea subscribe to health insurance through the national insurance system [49]. The types of national insurance include local subscribers, work subscribers, and medical benefits. According to a preliminary study, the status of medical service usage, such as vaccination, is highly correlated with the type of national insurance subscription [50]. Based on the result, we assume that individuals’ medical life, such as vaccination, can be identified through the type of national insurance subscription and private insurance subscription. Therefore, we included a feature for identifying the subscription status of health insurance.
Additionally, referring to the previous research [51], the transmission of the hepatitis B virus occurs in contact with body fluids or contaminated inanimate organisms. Sufficient protein intake and rest can help recover from hepatitis B [52]. Thus, we included dietary life, sleep time, and working environment features.
The comparison study results verify that the random forest is superior to comparative methods (logistic regression, decision tree, ANN, XGboost, lightGBM, and SVM) in the risk association of liver disease. The logistic regression and decision tree lag behind the random forest because the methods are appropriate for linear or simple data distribution. The logistic regression is based on the linear transformation of input features and nonlinear activation. The capability of representing complex data distribution is generally weaker than the random forest. The decision tree divides data space with a vertical line and has limited capability in representing nonlinear data distributions [53]. Although the ANN can model the nonlinear relationships between input and outputs, it is not enough to perform better than a random forest. Variance reduction of ensemble significantly enhances the prediction performance compared to single classifiers. XGboost and LightGBM are boosting models showing comparable performance against the random forest. However, in the liver disease data, the variance reduction effect of the bagging procedure dominates the boosting effects.
Various studies have used the random forest for disease risk association in the medical field. Xu et al. [26] built a model to predict type II diabetes risk using clinical data, and it had an accuracy of 0.8413. Kabiraj et al. [54] established a risk association model for breast cancer. They trained the random forest and the XGboost classifier. The random forest achieves the best performance with an accuracy of 74.73%. The model was trained with clinical data such as start tumor size, end tumor size, irradiate, and menopause. Hashem et al. [55] developed a model with the accuracy of 0.7381 to predict hepatocellular carcinoma, which is malignant liver cancer. They adopted the random forest and calibrated the model with clinical characteristics, including hemochromatosis, arterial hypertension, and chronic renal insufficiency. Although our proposed method is trained with lifestyle features, it shows comparable or superior performance in terms of AUC and accuracy among related works.

5. Conclusions

We proposed a risk association method for liver cancer and hepatitis B with random forest. The proposed method showed superior performance over comparative methods. In particular, the results of the risk association of hepatitis B verified that the proposed method is more robust than different approaches. The sensitivity analysis demonstrated that some lifestyle features strongly correlate with hepatitis B. The results can potentially reduce the cost of detecting potential patients who require liver health care in advance.

Author Contributions

Conceptualization, E.K. and Y.K.; methodology, E.K. and Y.K.; investigation, E.K. and Y.K.; writing-original draft preparation, E.K.; writing-review and editing, Y.K.; supervision, Y.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2022-00155911, Artificial Intelligence Convergence Innovation Human Resources Development (Kyung Hee University)) and the Ministry of Trade, Industry and Energy, Korea, under the “Regional Innovation Cluster Development Program (R&D, P0015346)” supervised by the Korea Institute for Advancement of Technology (KIAT).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study is publicly available (https://knhanes.kdca.go.kr/knhanes/sub03/sub03_01.do, accessed on 1 February 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Singh, A.S.; Irfan, M.; Chowdhury, A. Prediction of liver disease using classification algorithms. In Proceedings of the 2018 4th International Conference on Computing Communication and Automation (ICCCA), Greater Noida, India, 14–15 December 2018; pp. 1–3. [Google Scholar]
  2. Cornelius, C.E. Liver function. In Clinical Biochemistry of Domestic Animals; Elsevier: Amsterdam, The Netherlands, 1980; pp. 201–257. [Google Scholar]
  3. Schiff, L.; Schiff, E.R. Diseases of the Liver; Lippincott: Philadelphia, PA, USA, 1993; Volume 2. [Google Scholar]
  4. Parker, G.A.; Picut, C.A. Liver immunobiology. Toxicol. Pathol. 2005, 33, 52–62. [Google Scholar] [CrossRef] [PubMed]
  5. Kim, B.H.; Park, J.W. Epidemiology of liver cancer in South Korea. Clin. Mol. Hepatol. 2018, 24, 1–9. [Google Scholar] [CrossRef] [PubMed]
  6. Yeo, Y.; Gwack, J.; Kang, S.; Koo, B.; Jung, S.J.; Dhamala, P.; Ko, K.P.; Lim, Y.K.; Yoo, K.Y. Viral hepatitis and liver cancer in Korea: An epidemiological perspective. Asian Pac. J. Cancer Prev. 2013, 14, 6227–6231. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Tiollais, P.; Pourcel, C.; Dejean, A. The hepatitis B virus. Nature 1985, 317, 489–495. [Google Scholar] [CrossRef] [PubMed]
  8. Alter, M.J. Epidemiology and prevention of hepatitis B. In Seminars in Liver Disease; Thieme Medical Publishers, Inc.: New York, NY, USA, 2003; Volume 23, pp. 39–46. [Google Scholar]
  9. Venkatesh, R.; Balasubramanian, C.; Kaliappan, M. Development of Big Data Predictive Analytics Model for Disease Prediction using Machine learning Technique. J. Med. Syst. 2019, 43, 272. [Google Scholar] [CrossRef] [PubMed]
  10. Kourou, K.; Exarchos, T.P.; Exarchos, K.P.; Karamouzis, M.V.; Fotiadis, D.I. Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 2015, 13, 8–17. [Google Scholar] [CrossRef] [Green Version]
  11. Dahiwade, D.; Patle, G.; Meshram, E. Designing disease prediction model using machine learning approach. In Proceedings of the 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 27–29 March 2019; pp. 1211–1215. [Google Scholar]
  12. Shukla, S.; Gupta, D.L.; Prasad, B.R. Comparative study of recent trends on cancer disease prediction using data mining techniques. Int. J. Database Theory Appl. 2016, 9, 107–118. [Google Scholar] [CrossRef]
  13. Wu, D.; Cao, J.; Li, W.; Wang, X. Development of a Prediction Classifier for the Early Diagnosis of Liver Cancer. 2018. Available online: https://easychair.org/publications/preprint/sDf7 (accessed on 9 October 2018).
  14. Chen, K.H.; Wang, H.W.; Liu, C.M. Applying Artificial Intelligence to Survival Prediction of Hepatocellular Carcinoma Patients. In Proceedings of the 2020 4th International Conference on Deep Learning Technologies (ICDLT), Beijing, China, 10–12 July 2020; pp. 135–139. [Google Scholar]
  15. Ward, A.; Sarraju, A.; Chung, S.; Li, J.; Harrington, R.; Heidenreich, P.; Palaniappan, L.; Scheinker, D.; Rodriguez, F. Machine learning and atherosclerotic cardiovascular disease risk prediction in a multi-ethnic population. NPJ Digit. Med. 2020, 3, 125. [Google Scholar] [CrossRef]
  16. Perveen, S.; Shahbaz, M.; Keshavjee, K.; Guergachi, A. A systematic machine learning based approach for the diagnosis of non-alcoholic fatty liver disease risk and progression. Sci. Rep. 2018, 8, 2112. [Google Scholar] [CrossRef] [Green Version]
  17. Dimopoulos, A.C.; Nikolaidou, M.; Caballero, F.F.; Engchuan, W.; Sanchez-Niubo, A.; Arndt, H.; Ayuso-Mateos, J.L.; Haro, J.M.; Chatterji, S.; Georgousopoulou, E.N.; et al. Machine learning methodologies versus cardiovascular risk scores, in predicting disease risk. BMC Med. Res. Methodol. 2018, 18, 179. [Google Scholar] [CrossRef]
  18. Sharma, M.; Singh, G.; Singh, R. Stark assessment of lifestyle based human disorders using data mining based learning techniques. IRBM 2017, 38, 305–324. [Google Scholar] [CrossRef]
  19. Ajana, S.; Cougnard-Grégoire, A.; Colijn, J.M.; Merle, B.M.; Verzijden, T.; de Jong, P.T.; Hofman, A.; Vingerling, J.R.; Hejblum, B.P.; Korobelnik, J.F.; et al. Predicting progression to advanced age-related macular degeneration from clinical, genetic, and lifestyle factors using machine learning. Ophthalmology 2021, 128, 587–597. [Google Scholar] [CrossRef] [PubMed]
  20. Liaw, Y.F.; Sollano, J.D. Factors influencing liver disease progression in chronic hepatitis B. Liver Int. 2006, 26, 23–29. [Google Scholar] [CrossRef]
  21. Chen, C.J.; Liang, K.Y.; Chang, A.S.; Chang, Y.C.; Lu, S.N.; Liaw, Y.F.; Chang, W.Y.; Sheen, M.C.; Lin, T.M. Effects of hepatitis B virus, alcohol drinking, cigarette smoking and familial tendency on hepatocellular carcinoma. Hepatology 1991, 13, 398–406. [Google Scholar] [CrossRef] [PubMed]
  22. Korea Centers for Disease Control and Prevention. The Seventh Korea National Health and Nutrition Examination Survey (KNHANES V-3); Korea Disease Control and Prevention Agency: Cheongju, Republic of Korea, 2018.
  23. Rubin, D.B. Multiple Imputation for Nonresponse in Surveys; John Wiley & Sons: Hoboken, NJ, USA, 2004; Volume 81. [Google Scholar]
  24. Waljee, A.K.; Mukherjee, A.; Singal, A.G.; Zhang, Y.; Warren, J.; Balis, U.; Marrero, J.; Zhu, J.; Higgins, P.D. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 2013, 3, e002847. [Google Scholar] [CrossRef]
  25. Batista, G.E.; Monard, M.C. An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell. 2003, 17, 519–533. [Google Scholar] [CrossRef]
  26. Xu, W.; Zhang, J.; Zhang, Q.; Wei, X. Risk prediction of type II diabetes based on random forest model. In Proceedings of the 2017 Third International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB), Chennai, India, 27–28 February 2017; pp. 382–386. [Google Scholar]
  27. Li, R.; Shen, S.; Zhang, X.; Li, R.; Wang, S.; Zhou, B.; Wang, Z. Cardiovascular disease risk prediction based on random forest. In Proceedings of the The International Conference on Healthcare Science and Engineering, Guilin, China, 10–12 September 2018; pp. 31–43. [Google Scholar]
  28. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  29. Kumar, R.; Indrayan, A. Receiver operating characteristic (ROC) curve for medical researchers. Indian Pediatr. 2011, 48, 277–287. [Google Scholar] [CrossRef]
  30. Menard, S. Applied Logistic Regression Analysis; SAGE: Thousand Oaks, CA, USA, 2002; Volume 106. [Google Scholar]
  31. Safavian, S.R.; Landgrebe, D. A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 1991, 21, 660–674. [Google Scholar] [CrossRef] [Green Version]
  32. Rumelhart, D.E.; McClelland, J.L.; PDP Research Group. Parallel Distributed Processing; IEEE: New York, NY, USA, 1988; Volume 1. [Google Scholar]
  33. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  34. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates: Montreal, QC, Canada, 2017. [Google Scholar]
  35. Suthaharan, S. Support vector machine. In Machine Learning Models and Algorithms for Big Data Classification; Springer: Berlin/Heidelberg, Germany, 2016; pp. 207–235. [Google Scholar]
  36. Yala, A.; Lehman, C.; Schuster, T.; Portnoi, T.; Barzilay, R. A deep learning mammography-based model for improved breast cancer risk prediction. Radiology 2019, 292, 60–66. [Google Scholar] [CrossRef] [Green Version]
  37. Dite, G.S.; MacInnis, R.J.; Bickerstaffe, A.; Dowty, J.G.; Allman, R.; Apicella, C.; Milne, R.L.; Tsimiklis, H.; Phillips, K.A.; Giles, G.G.; et al. Breast cancer risk prediction using clinical models and 77 independent risk-associated SNPs for women aged under 50 years: Australian Breast Cancer Family Registry. Cancer Epidemiol. Prev. Biomarkers 2016, 25, 359–365. [Google Scholar] [CrossRef]
  38. Sipeky, C.; Talala, K.M.; Tammela, T.L.; Taari, K.; Auvinen, A.; Schleutker, J. Prostate cancer risk prediction using a polygenic risk score. Sci. Rep. 2020, 10, 17075. [Google Scholar] [CrossRef]
  39. Weng, S.F.; Reps, J.; Kai, J.; Garibaldi, J.M.; Qureshi, N. Can machine-learning improve cardiovascular risk prediction using routine clinical data? PLoS ONE 2017, 12, e0174944. [Google Scholar] [CrossRef] [Green Version]
  40. Hou, Y.; Zhang, Q.; Gao, F.; Mao, D.; Li, J.; Gong, Z.; Luo, X.; Chen, G.; Li, Y.; Yang, Z.; et al. Artificial neural network-based models used for predicting 28-and 90-day mortality of patients with hepatitis B-associated acute-on-chronic liver failure. BMC Gastroenterol. 2020, 20, 75. [Google Scholar] [CrossRef]
  41. Adler, E.D.; Voors, A.A.; Klein, L.; Macheret, F.; Braun, O.O.; Urey, M.A.; Zhu, W.; Sama, I.; Tadel, M.; Campagnari, C.; et al. Improving risk prediction in heart failure using machine learning. Eur. J. Heart Fail. 2020, 22, 139–147. [Google Scholar] [CrossRef]
  42. Saltelli, A. Sensitivity analysis for importance assessment. Risk Anal. 2002, 22, 579–590. [Google Scholar] [CrossRef]
  43. Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates: Montreal, QC, Canada, 2017. [Google Scholar]
  44. Litwak, S.A.; Pang, L.; Galic, S.; Igoillo-Esteve, M.; Stanley, W.J.; Turatsinze, J.V.; Loh, K.; Thomas, H.E.; Sharma, A.; Trepo, E.; et al. JNK activation of BIM promotes hepatic oxidative stress, steatosis, and insulin resistance in obesity. Diabetes 2017, 66, 2973–2986. [Google Scholar] [CrossRef] [Green Version]
  45. Choi, J.; Han, S.; Kim, N.; Lim, Y.S. Increasing burden of liver cancer despite extensive use of antiviral agents in a hepatitis B virus-endemic population. Hepatology 2017, 66, 1454–1463. [Google Scholar] [CrossRef] [Green Version]
  46. Åberg, F.; Helenius-Hietala, J.; Puukka, P.; Jula, A. Binge drinking and the risk of liver events: A population-based cohort study. Liver Int. 2017, 37, 1373–1381. [Google Scholar] [CrossRef]
  47. Karpagavalli, S.; Jamuna, K.; Vijaya, M. Machine learning approach for preoperative anaesthetic risk prediction. Int. J. Recent Trends Eng. 2009, 1, 19. [Google Scholar]
  48. Aleksandrova, K.; Reichmann, R.; Kaaks, R.; Jenab, M.; Bueno-de Mesquita, H.B.; Dahm, C.C.; Eriksen, A.K.; Tjønneland, A.; Artaud, F.; Boutron-Ruault, M.C.; et al. Development and validation of a lifestyle-based model for colorectal cancer risk prediction: The LiFeCRC score. BMC Med. 2021, 19, 1. [Google Scholar] [CrossRef]
  49. Song, Y.J. The South Korean health care system. Jpn. Med. Assoc. J. 2009, 52, 206–209. [Google Scholar]
  50. Walker, T.Y.; Elam-Evans, L.D.; Yankey, D.; Markowitz, L.E.; Williams, C.L.; Fredua, B.; Singleton, J.A.; Stokley, S. National, regional, state, and selected local area vaccination coverage among adolescents aged 13–17 years—United States, 2018. Morb. Mortal. Wkly. Rep. 2019, 68, 718. [Google Scholar] [CrossRef] [Green Version]
  51. Sinn, D.H.; Cho, E.J.; Kim, J.H.; Kim, D.Y.; Kim, Y.J.; Choi, M.S. Current status and strategies for viral hepatitis control in Korea. Clin. Mol. Hepatol. 2017, 23, 189. [Google Scholar] [CrossRef]
  52. Wilkins, T.; Zimmerman, D.; Schade, R.R. Hepatitis B: Diagnosis and treatment. Am. Fam. Physician 2010, 81, 965–972. [Google Scholar]
  53. James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: Berlin/Heidelberg, Germany, 2013; Volume 112. [Google Scholar]
  54. Kabiraj, S.; Raihan, M.; Alvi, N.; Afrin, M.; Akter, L.; Sohagi, S.A.; Podder, E. Breast cancer risk prediction using XGBoost and random forest algorithm. In Proceedings of the 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kharagpur, India, 1–3 July 2020; pp. 1–4. [Google Scholar]
  55. Hashem, E.M.; Aboel-fotouh, M.R. A Random Forest-Genetic algorithm integration approach for Hepatocellular Carcinoma Early prediction. Ann. Rom. Soc. Cell Biol. 2021, 25, 13500–13512. [Google Scholar]
Figure 1. Workflow of producing data sets for the prediction modeling. The liver cancer and hepatitis B datasets consist of 3872 and 4640 subjects, respectively. The datasets are divided into 70% training data and 30% testing data.
Figure 1. Workflow of producing data sets for the prediction modeling. The liver cancer and hepatitis B datasets consist of 3872 and 4640 subjects, respectively. The datasets are divided into 70% training data and 30% testing data.
Ijerph 19 15171 g001
Figure 2. Line plot of hepatitis B risk association model’s performance according to the number of train dataset. x and y axes represent the proportion of training data and AUC, respectively.
Figure 2. Line plot of hepatitis B risk association model’s performance according to the number of train dataset. x and y axes represent the proportion of training data and AUC, respectively.
Ijerph 19 15171 g002
Figure 3. Line plot of liver cancer risk association model’s performance according to the number of train dataset. x and y axes represent the proportion of training data and AUC, respectively.
Figure 3. Line plot of liver cancer risk association model’s performance according to the number of train dataset. x and y axes represent the proportion of training data and AUC, respectively.
Ijerph 19 15171 g003
Table 1. Input features for the proposed methods. There are nine continuous features and three categorical features.
Table 1. Input features for the proposed methods. There are nine continuous features and three categorical features.
TypeFeature
ContinuousAge
Height
Weight
BMI
Drinking age
Average weekday sleep time per day
Average weekend sleep time per day
Total monthly household income
Average weekly working hours
CategoricalNational health insurance type
Private health insurance status
Hepatitis B
Sex
Table 2. List of optimized hyperparameters for the random forest.
Table 2. List of optimized hyperparameters for the random forest.
ModelParameterLiver CancerHepatitis B
Random forestN estimators300100
Max depth3010
Criterionentropyentropy
Min samples split510
Min sample leaf11
BootstrapTrueTrue
Warm stratTrueFalse
Max featuresautoauto
Table 3. List of optimized hyperparameters for the comparative models.
Table 3. List of optimized hyperparameters for the comparative models.
ModelParameterLiver CancerHepatitis B
Decision treeMax depth1010
Criterionentropyentropy
Min samples split25
Min sample leaf11
Logistic regressionClass weightnonebalanced
Max iter5050
ANNSloversgdadam
Activationtanhtanh
Alpha0.00010001
Learning rateconstantconstant
Hidden layer size(100, 50)(50, 25)
XGboostEta0.10.3
Min child weight05
Max depth66
Gamma05
Colsample bytree0.50.6
N estimators100100
LightGBMNum leaves315
Max depth−1−1
N estimator1100
Learning rate0.050.1
Min data in leaf1020
Num iteration100100
Colsample bytree0.61.0
Boostingdartgbdt
SVMKernelsigmoidrbf
Degree11
Gammaautoscale
Table 4. Average AUC values for each combination of the liver cancer risk association model and missing value imputer. Standard deviation is calculated in the parentheses.
Table 4. Average AUC values for each combination of the liver cancer risk association model and missing value imputer. Standard deviation is calculated in the parentheses.
Risk Association Model for Liver Cancer
Simple ImputationK-NN ImputationMice Imputation
Random forest0.8317 (0.0345)0.8062 (0.0501)0.8248 (0.0276)
Decision tree0.7671 (0.0531)0.7546 (0.0569)0.7710 (0.0543)
Logistic regression0.8211 (0.0447)0.8186 (0.0469)0.8199 (0.0539)
ANN0.8129 (0.0431)0.8066 (0.0463)0.8129 (0.0393)
XGboost0.8153 (0.0433)0.7788 (0.0587)0.8154 (0.0411)
lightGBM0.7992 (0.0464)0.7636 (0.0419)0.7925 (0.0447)
SVM0.8098 (0.0555)0.8025 (0.0581)0.8055 (0.0470)
Table 5. Average AUC values for each combination of the hepatitis B prediction model and missing value imputer. Standard deviation is calculated in the parentheses.
Table 5. Average AUC values for each combination of the hepatitis B prediction model and missing value imputer. Standard deviation is calculated in the parentheses.
Risk Association Model for Hepatitis B
Simple ImputationK-NN ImputationMice Imputation
Random forest0.8567 (0.0124)0.7050 (0.0147)0.8632 (0.0113)
Decision tree0.7818 (0.0132)0.6295 (0.0178)0.7947 (0.0157)
Logistic regression0.6384 (0.0144)0.6313 (0.0147)0.6338 (0.0137)
ANN0.8015 (0.0121)0.6548 (0.0120)0.7990 (0.0150)
XGboost0.8469 (0.0108)0.7505 (0.0119)0.8611 (0.0094)
lightGBM0.8534 (0.0115)0.7520 (0.0126)0.8636 (0.0098)
SVM0.7401 (0.0149)0.6560 (0.0167)0.7235 (0.0127)
Table 6. Performances of risk association models and comparative models. Standard deviation is calculated in the parentheses.
Table 6. Performances of risk association models and comparative models. Standard deviation is calculated in the parentheses.
Liver Cancer Risk AssociationHepatitis B Risk Association
AUCAccuracyAUCAccuracy
Random forest0.8367
(0.0357)
0.8190
(0.0384)
0.8803
(0.0030)
0.9245
(0.0060)
Decision tree0.7921
(0.0568)
0.7807
(0.0481)
0.8224
(0.0206)
0.8423
(0.0305)
Logistic regression0.8211
(0.0447)
0.8302
(0.0289)
0.6341
(0.0136)
0.6541
(0.0159)
ANN0.8206
(0.0526)
0.8312
(0.0311)
0.8306
(0.0142)
0.8466
(0.0111)
XGboost0.8260
(0.0392)
0.8075
(0.0409)
0.8778
(0.0103)
0.9309
(0.0057)
LightGBM0.8234
(0.0474)
0.8118
(0.0485)
0.8764
(0.0117)
0.9299
(0.0068)
SVM0.8130
(0.0507)
0.8408
(0.0351)
0.7235
(0.0127)
0.6919
(0.0137)
Table 7. Results of Wilcoxon signed-rank test between random forest and comparative models.
Table 7. Results of Wilcoxon signed-rank test between random forest and comparative models.
p-Value
Liver Cancer Risk AssociationHepatitis B Risk Association
Decision tree0.00000.0000
Logistic regression0.02100.0000
ANN0.03360.0000
XGboost0.00740.0000
LightGBM0.04080.0022
SVM0.00420.0000
Table 8. Sensitivity analysis results of hepatitis B risk association model. Standard deviation is presented in parentheses.
Table 8. Sensitivity analysis results of hepatitis B risk association model. Standard deviation is presented in parentheses.
Permutated FeatureMean of AUC
-0.8803 (0.0030)
Age0.8750 (0.0069)
Sex0.8799 (0.0050)
Height0.8795 (0.0051)
Weight0.8803 (0.0051)
BMI0.8788 (0.0062)
Total monthly household income0.8773 (0.0070)
Average weekly working hours0.8664 (0.0060)
Average weekday sleep time per day0.8516 (0.0090)
Average weekend sleep time per day0.8492 (0.0087)
Drinking age0.8710 (0.0154)
Private health insurance status0.8767 (0.0064)
National health insurance type0.8790 (0.0054)
Table 9. Sensitivity analysis result of the risk association model for liver cancer. Standard deviation is presented in parentheses.
Table 9. Sensitivity analysis result of the risk association model for liver cancer. Standard deviation is presented in parentheses.
FeatureMean of AUC
-0.8367 (0.0357)
Age0.8126 (0.0483)
Sex0.8216 (0.0368)
Height0.8327 (0.0376)
Weight0.8255 (0.0429)
BMI0.8253 (0.0424)
Total monthly household income0.8229 (0.0444)
Average weekly working hours0.8312 (0.0352)
Average weekday sleep time per day0.8288 (0.0429)
Average weekend sleep time per day0.8301 (0.0428)
Drinking age0.8307 (0.0378)
Private health insurance status0.8327 (0.0374)
Hepatitis B0.7935 (0.0379)
National health insurance type0.8344 (0.0374)
Table 10. SHAP values of hepatitis B risk association model.
Table 10. SHAP values of hepatitis B risk association model.
FeatureSHAP Value
Age0.0476
Sex0.0093
Height0.0089
Weight0.0122
BMI0.0116
Total monthly household income0.0482
Average weekly working hours0.0349
Average weekday sleep time per day0.1240
Average weekend sleep time per day0.1175
Drinking age0.0190
Private health insurance status0.0290
National health insurance type0.0019
Table 11. SHAP values of the risk association model for liver cancer.
Table 11. SHAP values of the risk association model for liver cancer.
FeatureSHAP Value
Age0.0940
Sex0.0090
Height0.0159
Weight0.0219
BMI0.0147
Total monthly household income0.0421
Average weekly working hours0.0499
Average weekday sleep time per day0.0327
Average weekend sleep time per day0.0748
Drinking age0.0200
Private health insurance status0.0229
Hepatitis B0.0776
National health insurance type0.0021
Table 12. The top 5 significant features of the sensitivity analysis for hepatitis B risk association model.
Table 12. The top 5 significant features of the sensitivity analysis for hepatitis B risk association model.
Sensitivity AnalysisSHAP
Average weekend sleep time per dayAverage weekday sleep time per day
Average weekday sleep time per dayAverage weekend sleep time per day
Average weekly working hoursTotal monthly household income
Drinking ageAge
AgeAverage weekly working hours
Table 13. The top 5 significant features of the sensitivity analysis for liver cancer risk association model.
Table 13. The top 5 significant features of the sensitivity analysis for liver cancer risk association model.
Sensitivity AnalysisSHAP
Hepatitis BAge
AgeHepatitis B
SexAverage weekend sleep time per day
Total monthly household incomeAverage weekly working hours
BMITotal monthly household income
Table 14. Example of SHAP analysis for observations of the liver cancer risk association model.
Table 14. Example of SHAP analysis for observations of the liver cancer risk association model.
Liver CancerNon-Liver Cancer
FeatureValueSHAP ValueValueSHAP Value
Age720.145153−0.0269
SexMan0.0087Man0.0094
Height169.70.0180168.3−0.0088
Weight710.019370.40.0072
BMI24.70.004224.9−0.0104
Total monthly
household income
2500.0172390−0.0281
Average weekly
working hours
210.0206400.0286
Average weekday
sleep time per day
4200.03194400.0134
Average weekend
sleep time per day
4800.0590440−0.0268
Drinking age260.0217210.0153
Private health
insurance status
Subscriber−0.0096Subscriber−0.0326
Hepatitis BYes0.1450No−0.0639
National health
insurance type
Employee−0.0015Employee−0.0073
Table 15. Example of SHAP analysis for observations of the hepatitis B risk association model.
Table 15. Example of SHAP analysis for observations of the hepatitis B risk association model.
Hepatitis BNon-Hepatitis B
FeatureValueSHAP ValueValueSHAP Value
Age520.0295570.0093
SexMan0.0063Woman−0.0121
Height164.10.0063155.9−0.0204
Weight56.70.013858.6−0.0038
BMI21.10.003624.1−0.0027
Total monthly
household income
126.70.0846166.70.0159
Average weekly
working hours
20−0.077060−0.0315
Average weekday
sleep time per day
426.50.1811540−0.1255
Average weekend
sleep time per day
460.70.2252540−0.1111
Drinking age16−0.0034450.0186
Private health
insurance status
Subscriber−0.0070Subscriber−0.0246
National health
insurance type
Non-employee0.0011Non-employee−0.0058
Table 16. Changes of hepatitis B risk association model’s performance according to the number of train dataset.
Table 16. Changes of hepatitis B risk association model’s performance according to the number of train dataset.
Train Dataset
Ratio
10%20%30%40%50%60%70%
(Default)
Random
forest
0.8288
(0.0025)
0.8513
(0.0043)
0.8614
(0.0023)
0.8691
(0.0038)
0.8732
(0.0044)
0.8771
(0.0022)
0.8803
(0.0030)
Decision
tree
0.7723
(0.0269)
0.7854
(0.0241)
0.8039
(0.0133)
0.8066
(0.0130)
0.8161
(0.0147)
0.8172
(0.0135)
0.8224
(0.0206)
Logistic
regression
0.6040
(0.0173)
0.6223
(0.0101)
0.6274
(0.0119)
0.6309
(0.0086)
0.6351
(0.0100)
0.6343
(0.0121)
0.6341
(0.0136)
ANN0.6633
(0.0273)
0.7390
(0.0253)
0.7808
(0.0161)
0.8055
(0.0105)
0.8148
(0.0084)
0.8250
(0.0091)
0.8306
(0.0143)
XGboost0.8157
(0.0108)
0.8415
(0.0075)
0.8515
(0.0066)
0.8567
(0.0066)
0.8610
(0.0074)
0.8647
(0.0099)
0.8778
(0.0103)
Light
GBM
0.8253
(0.0167)
0.8497
(0.0098)
0.8597
(0.0086)
0.8659
(0.0079)
0.8678
(0.0071)
0.8712
(0.0107)
0.8764
(0.0117)
SVM0.6576
(0.0165)
0.6855
(0.0119)
0.7001
(0.0092)
0.7083
(0.0106)
0.7156
(0.0129)
0.7202
(0.0119)
0.7235
(0.0127)
Table 17. Changes of liver cancer risk association model’s performance according to the number of train dataset.
Table 17. Changes of liver cancer risk association model’s performance according to the number of train dataset.
Train Dataset
Ratio
10%20%30%40%50%60%70%
(Default)
Random
forest
0.7963
(0.0102)
0.8011
(0.0371)
0.8151
(0.0276)
0.8191
(0.0321)
0.8199
(0.0326)
0.8296
(0.0418)
0.8367
(0.0357)
Decision
tree
0.6855
(0.0941)
0.7345
(0.0628)
0.7523
(0.0531)
0.7541
(0.0536)
0.7557
(0.0493)
0.7574
(0.0582)
0.7921
(0.0568)
Logistic
regression
0.7343
(0.0575)
0.7847
(0.0312)
0.7981
(0.0356)
0.7984
(0.0371)
0.7951
(0.0529)
0.7986
(0.0440)
0.8211
(0.0447)
ANN0.7362
(0.0625)
0.7770
(0.0423)
0.7970
(0.0326)
0.8023
(0.0308)
0.8136
(0.0435)
0.8141
(0.0436)
0.8206
(0.0526)
XGboost0.7428
(0.0545)
0.7884
(0.0312)
0.8045
(0.0301)
0.8083
(0.0304)
0.8074
(0.0375)
0.8207
(0.0283)
0.8026
(0.0392)
Light
GBM
0.5000
(0.0000)
0.7194
(0.0683)
0.7564
(0.0410)
0.7923
(0.0396)
0.8104
(0.0410)
0.8267
(0.0338)
0.8234
(0.0474)
SVM0.7319
(0.0482)
0.7745
(0.0490)
0.7891
(0.0304)
0.7943
(0.0408)
0.7965
(0.0452)
0.8005
(0.0569)
0.8130
(0.0507)
Table 18. Results of Wilcoxon signed-rank test between random forest and comparative models using 10% of the training dataset.
Table 18. Results of Wilcoxon signed-rank test between random forest and comparative models using 10% of the training dataset.
p-Value
Liver Cancer Risk AssociationHepatitis B Risk Association
Decision tree0.00000.0000
Logistic regression0.00000.0000
ANN0.00000.0000
XGboost0.00000.0000
LightGBM0.00000.0000
SVM0.00000.0000
Table 19. AUC values of multi-disease risk association models and comparative models.
Table 19. AUC values of multi-disease risk association models and comparative models.
Liver CancerHepatitis B
Random forest0.7680 (0.0535)0.5350 (0.0579)
Decision tree0.7430 (0.0577)0.5437 (0.0669)
Logistic regression0.7518 (0.0580)0.6019 (0.0632)
ANN0.7106 (0.0625)0.5480 (0.0591)
XGboost0.7944 (0.0531)0.5472 (0.0444)
LightGBM0.8037 (0.0482)0.5564 (0.0450)
SVM0.6949 (0.0696)0.5183 (0.0344)
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Koh, E.; Kim, Y. Risk Association of Liver Cancer and Hepatitis B with Tree Ensemble and Lifestyle Features. Int. J. Environ. Res. Public Health 2022, 19, 15171. https://doi.org/10.3390/ijerph192215171

AMA Style

Koh E, Kim Y. Risk Association of Liver Cancer and Hepatitis B with Tree Ensemble and Lifestyle Features. International Journal of Environmental Research and Public Health. 2022; 19(22):15171. https://doi.org/10.3390/ijerph192215171

Chicago/Turabian Style

Koh, Eunji, and Younghoon Kim. 2022. "Risk Association of Liver Cancer and Hepatitis B with Tree Ensemble and Lifestyle Features" International Journal of Environmental Research and Public Health 19, no. 22: 15171. https://doi.org/10.3390/ijerph192215171

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop