1. Introduction
As quality of life has increased, interest in health and illnesses has also increased [
1]. Moreover, as the healthcare industry pursues fundamental changes in medical services through the technologies of the Fourth Industrial Revolution, such as big data, artificial intelligence (AI), and the internet of things (IoT), its paradigm is shifting from a diagnosis-and-treatment-centered model to a prevention-and-management-centered model [
2]. This is a shift from the traditional healthcare paradigm in which people go to hospitals when they feel sick to one in which people predict and prevent illnesses using information technology (IT) [
3].
Recently, the utilization of IT has become essential in various fields, and services that create new value by using data collected with IT are emerging all over the world. Therefore, big data is believed to be the innovative factor that will create new national values and solve various countries’ fundamental problems [
4,
5]. In Korea, public data from various fields are provided to private companies and researchers regardless of the original purpose of the data collection, and they are typically available through public data portal sites [
6]. The provided data are thus accessible for anyone to collect and analyze. As part of a pilot project for the usage of medical big data, the Korean government has utilized the Korean Meteorological Administration’s (KMA) health meteorological index and developed an early warning service for drug safety [
5,
7]. The utilization of healthcare data has been increasing since it was made public. If this rapidly growing collection of healthcare-related data is utilized well, innovation in medical services is possible, such as complete diagnosis customized for each patient and illness prevention services [
2]. The following examples of innovation in medical services are expected: (1) Complete diagnosis customized for each patient [
8]; (2) Illness prevention real-time service: Real-time digital disease surveillance data and public data sharing could shape predictions about future epidemics [
9]; (3) Disease prediction using wearable devices: Sensor data can help analyze trends in seasonal respiratory infections, such as influenza [
10].
This study aimed to analyze healthcare data using data mining. Data mining refers to a technique that reveals the underlying useful knowledge in big data [
11] and is effective at deriving hidden patterns and discovering descriptive, predictive, and understandable models [
12,
13]. A key example is machine learning. Machine learning has advantages in such tasks as diagnosis, classification, and survival prediction. It has a particular advantage in the healthcare field, as it can analyze various data types and provide predictions regarding daily illness risks, diagnosis, prognosis, and appropriate treatments [
14]. Moreover, machine learning may be able to serve as a treatment option that can help people make better decisions in consideration of their healthcare needs and diagnosis [
15]. Therefore, the expectation in this study is that an increase in the healthy lifespan and a reduction in medical expenses of humans can be obtained if healthcare data are analyzed with machine learning.
The purpose of this study was to analyze the illnesses that most commonly occur in children and older adults, two socially vulnerable groups that need more protection than other age groups [
16], and to provide the degree of daily risk for illnesses based on the number of patients suffering from them in each region. In this process, the optimal machine-learning model was selected after extensive comparison, the optimal hyperparameter was derived through GridSearchCV, and both were verified based on their R
2 value. This study’s strengths are that the degree of daily risk for each illness for each age group can be provided, and the most common illness for each age group can be determined. As a matter unrelated to actual diagnosis by a doctor, this is expected to raise awareness of and prevent illnesses based on climate data.
2. Related Work
Several prior studies have been conducted on minimizing and preventing the occurrence of epidemics by predicting trends in illnesses based on web data. A representative example is a study by Iso et al. [
17] in which the authors attempted to monitor influenza trends based on the online social network platform Twitter; another is a study by Achrekar et al. [
18,
19] in which the authors attempted to predict the trend of the flu using Twitter data. However, considering the characteristics of web data, there are limitations to this type of data, as advertisements may be included, and some data may be duplicated. To supplement these limitations, Lee et al. utilized actual chicken pox occurrence data provided by the Korea Disease Control and Prevention Agency, as well as data from Naver News and Twitter, to predict the occurrence of chicken pox [
20].
Numerous factors are involved in health, and the World Health Organization noted that climate change, such as air pollution, is a significant health threat that leads to various illnesses and premature deaths [
21]. Subsequently, many studies have been conducted to predict the possibility of illnesses according to the meteorological and atmospheric environments. Kim et al. analyzed precipitation, lowest and highest temperature, relative humidity, and daily temperature range with Naive Bayes to provide the degree of risk for the common cold [
22]. Choi et al. proposed an influenza-like disease monitoring system through statistical analysis [
23], and Vidotto et al. studied the effect of air pollution on juvenile rheumatism using statistical analysis [
24]. Jeong et al. utilized the common cold, eye disease, and middle ear inflammation as autoregressive integrated moving average models to propose a model that calculates the expected number of patients for each illness by utilizing drug prescription data [
5]. Jang et al. developed a cold index by predicting the number of patients suffering from the common cold [
25]. There are also studies that have confirmed the correlation between air pollution and cardiovascular disease [
26,
27,
28]. However, these studies conducted analyses with data from individuals of all ages and did not consider differences among the age groups.
Currently, the KMA provides the health meteorological index. This is a collection of data that has indexed the level of possibility of occurrence of illnesses such as food poisoning, asthma, cerebral apoplexy, the common cold, and pollen allergies. However, this service fails to reflect various climatic environments, as it excludes atmospheric environment information and only considers weather conditions, such as lowest temperature and air pressure, in its analysis. Additional limitations of this service are that it does not provide optimized illness information for each age group and that it only provides information on the illnesses most commonly suffered from by people in general.
3. Materials and Methods
To derive the degree of daily risk for illnesses, this study obtained past weather information, past atmospheric information and medical treatment data from the KMA, a public data portal, and the National Health Insurance Corporation using application programing interface (API).
The medical treatment data provided by the National Health Insurance Corporation are sample data from one million people every year and medical data for five years, the most recent of the publicly available data, were used for analysis.
3.1. Data Description
This study defined children as those aged 0–9 years and older adults as those aged 65 years or above. Similar to the definition of a child in the previous study [
29], 0 to 9 years old was defined as a child. Currently, in many countries, 65 years of age or older is regarded as the older adults [
30], so older adults in this study were also defined as 65 years old or older.
When this study was selecting illnesses for analysis based solely on the 2016 medical data, there was concern regarding a likely distortion with respect to certain illnesses that occurred that year. Accordingly, this study selected illnesses for analysis based on medical treatment data over the last five years. The illnesses in the medical data are listed only in the form of code numbers.
Table 1 matches the illness code to the illness name as provided by the Korean Informative Classification of Diseases. The top 10 illnesses that are most common in older adults were found to be illnesses unrelated to seasonal factors, such as Alzheimer’s and orthopedic illnesses. Thus, of the top 20 illnesses, those unrelated to seasonal factors were excluded, and the final 8 illnesses were selected as the subjects of analysis.
The regions included in the analysis of this study were Seoul, Busan, Daegu, Incheon, Gwangju, Daejeon, Ulsan, Gyeonggi-do, Gangwon-do, Chungcheongbuk-do, Chungcheongnam-do, Jeollabuk-do, Jeollanam-do, Gyeongsangbuk-do, Gyeongsangnam-do, and Jeju. These areas are defined by the area identifier city/province code used by the public data portal and the KMA.
3.2. Data Preprocessing
The meteorological data (average temperature, lowest temperature, highest temperature, average cloudiness and precipitation level), atmospheric information data (sulfurous acid gas, carbon monoxide, ozone, nitrogen dioxide, and micro dust levels), and regional identifiers were selected as candidates for independent variables. Data such as atmospheric information and meteorological information have high uncertainty, so it is necessary to select variables through weighting and correlation analysis [
31,
32]. Therefore, for the variables other than the regional identifier, which serves as a simple identifier rather than a numeric number, multicollinearity was evaluated using the variance inflation factor (VIF) index, and the variables that were not in violation were defined as the final independent variables. The lowest temperature and the average temperature had a VIF index of 10 or above and were correlated with the highest temperature; thus, they were excluded. The final independent variables selected included regional identifier, highest temperature, average cloudiness, precipitation level, sulfurous acid gas level, carbon monoxide level, ozone level, nitrogen dioxide level, and micro dust level. Day and season were also included in the final list of independent variables. The independent and dependent variables for each age group are listed in
Table 2.
3.3. Methods
In this study, three different models—linear regression [
33], random forest [
34], and XGBoost [
35]—were comparatively analyzed.
In linear regression, a regression equation is modeled using a linear prediction function, and parameters are estimated through data, which is referred to as a regression equation. Therefore, if the ultimate goal is to predict the value, a predictive model can be constructed using linearity.
Random forest is a non-linear regression model that is an ensemble technique based on decision trees. Multiple instances of training were performed by extracting a certain number of samples of training data with bootstrap sampling, a method of extracting data a given number of times. Prediction values are then derived for each model, and the final prediction value is determined based on the average value of all derived prediction values. Random forest is easy to use and was selected as a candidate model, as it is proficient in dealing with multiple independent variables.
XGBoost is one of the machine-learning techniques used for predicting various diseases in recent years [
36] and has the advantage of being able to perform both non-linear and linear regression. XGBoost is a model that improves the performance and speed of gradient boosting by applying the boosting technique to the decision tree model. The boosting technique combines weak predictive models to create a strong predictive model. After data are predicted with a weak predictive model, the error from this process is trained on another weak predictive model. This process is repeated to reduce error and is applied sequentially. Hence, model t is determined by the error from model t-1. Ultimately, the model is trained by finding the weight that minimizes the loss between the actual value and the predicted value.
Hyperparameters refer to values set by the user. However, there is a limit to repeated testing due to the user changing the values continuously, and it is difficult to intuitively observe the best value. GridSearchCV allows the user to determine the optimal hyperparameter. If the user inputs values that can be input into the hyperparameter, GridSearchCV determines the best parameter through the combination of parameters.
4. Results and Discussion
4.1. Model Selection
After the data were divided into the children group and the older adults group, the system was trained with linear regression, random forest, and XGBoost. For selection of the optimal model for each group, the R
2 value in Equation (1), which calculates the ratio of the variance of the predicted value to the variance of the actual value as an index, was used. The closer the value of this index is to 1, the more accurate it is. The ultimately derived R
2 is shown in
Table 3. XGBoost was selected as the final model, as it had a large R
2 value for most illnesses.
: actual value,
: expected value.
Table 3.
Comparison of R2 value for each machine-learning model.
Table 3.
Comparison of R2 value for each machine-learning model.
a. Children |
Illness | Linear Regression | Random Forest | XGBoost |
Acute bronchitis | 0.010 | 0.501 | 0.504 |
Common cold | 0.242 | 0.610 | 0.632 |
Rhinitis, tonsillitis | 0.013 | 0.447 | 0.498 |
Inner ear inflammation | 0.147 | 0.239 | 0.338 |
Chronic bronchitis | 0.452 | 0.221 | 0.271 |
Intestinal inflammation | 0.117 | 0.138 | 0.162 |
Influenza, pneumonia | 0.075 | 0.109 | 0.153 |
Eczema, dermatitis | 0.147 | 0.237 | 0.282 |
Chicken pox | 0.085 | 0.012 | 0.089 |
Conjunctivitis | 0.090 | 0.142 | 0.199 |
b. Older adults |
High blood pressure, heart disease | 0.655 | 0.842 | 0.842 |
Common cold | 0.343 | 0.545 | 0.558 |
Esophageal inflammation, gastritis | 0.552 | 0.728 | 0.739 |
Acute bronchitis | 0.262 | 0.475 | 0.477 |
Eczema, dermatitis | 0.465 | 0.606 | 0.635 |
Chronic bronchitis | 0.439 | 0.564 | 0.589 |
Rhinitis, tonsillitis | 0.342 | 0.473 | 0.505 |
Brain disease | 0.284 | 0.340 | 0.341 |
4.2. Deriving the Optimal Hyperparameter
GridSearchCV was used to adjust the hyperparameter to derive the optimal hyperparameter based on XGBoost.
Table 4 shows the types of hyperparameters that were to be adjusted, along with their descriptions.
The optimal hyperparameters for each illness were derived using GridSearchCV.
Table 5 shows the R
2 for the final hyperparameters. Comparison with
Table 4 shows the improved R
2. This study found that a significant prediction was derived only when R
2 was above 0.65, and illnesses with an R
2 below 0.65 were excluded. Thus, a total of four illnesses for children acute bronchitis, the common cold, rhinitis and tonsillitis, and inner ear inflammation, and a total of six illnesses for older adults high blood pressure and heart disease, the common cold, esophageal inflammation and gastritis, acute bronchitis, eczema and dermatitis, and chronic bronchitis are regarded as having been significantly analyzed.
4.3. Designing the Final System
The number of treatments that correspond to the dependent variable of the predictive models in the previous preprocessing stage was scaled with a mean of 0 and a variance of 1. Using normal distribution, the
p-value points were set to 0.2, 0.4, 0.6, and 0.8 as the cutoff standard, and the possibility of illness was categorized as very low, low, normal, high, and very high, as seen in
Table 6.
Figure 1 shows the system user interface proposed in this paper.
4.4. Discussion
In this study, a system for providing disease risk for children and the older adults was proposed using XGBoost. Initially, we set the goal that was to provide various disease risk levels using 10 disease categories customized for children and 8 disease categories customized for the older adults. However, diseases in which R2 did not exceed 0.65 were judged to be insignificant, and finally, four diseases for children and six diseases for the older adults were defined as diseases that could provide disease risk.
In chronic bronchitis, linear regression yielded more significant results according to
Table 3, but XGBoost, which performed excellent in most diseases, was selected as the final model. When hyperparameter tuning using GridSearchCV was performed, XGBoost obtained better results than linear regression, which was excellent in the past. However, R
2 did not exceed 0.65, so there was a limit to providing the disease risk.
Identifying the illness that seeks to provide the ultimate risk of illness confirms that the illness with a large amount of data yielded better results. If more data are collected on diseases that are not currently selected, it is expected that more meaningful results will be possible.
5. Conclusions
In this study, an approach to provide customized daily disease risk for children and the older adults using sample medical data of one million patients provided by the National Health Insurance and weather environment and atmospheric information provided by the Korea Meteorological Administration (KMA) and public data portal API was proposed. Existing studies and systems have performed comprehensive analyses for all age groups without dividing the age groups. This study overcomes that limitation. By comparing the predictive R2 of the three models, XGBoost was selected as the final disease risk prediction model. When R2 was greater than 0.65, it was defined as a significant analysis. For children, it can provide four illness risk: acute bronchitis, the common cold, rhinitis and tonsillitis, and inner ear inflammation. Additionally, for older adults, it can provide six illness risk: high blood pressure and heart disease, the common cold, esophageal inflammation and gastritis, acute bronchitis, eczema and dermatitis, and chronic bronchitis.
This study has the following limitations. First, the existing system of the KMA provides a health meteorological index for each city and state. However, this study also has limitations, as the medical data utilized in this study only provide information on province and metropolitan city subdivisions. Second, there is a limitation in not being able to analyze which factors have a great influence on specific diseases in specific seasons and days of the week. Lastly, the data used in this study are data up to 2016, the most recent among currently available data, and there is a limitation that there may be differences from the current situation. Therefore, if the latest data are updated later, it can be expected to be predictable according to the current situation.
In future research, it is expected that it can become an interpretable model if a method of identifying which data affect disease for disease prediction is added. For this, the importance of features should be analyzed using techniques such as additive explanation (SHAP) [
37] and permutation importance [
38].
Although treatment can only be provided by a professional, illnesses may be prevented by heightened awareness of illnesses through the daily illness risks for each age group presented by this study based on collectible meteorological and atmospheric environment-related information.
Author Contributions
Conceptualization, M.K.; methodology, M.K.; formal analysis, M.K., J.J. and S.J.; data curation, J.J. and S.J.; writing—original draft preparation, M.K.; writing—review and editing, J.J., S.J. and S.Y.; supervision, S.Y. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the National Research Foundation of Korea (NRF-2020R1A2C2010471).
Data Availability Statement
For weather data, we used the neighborhood forecast inquiry service API, and for atmospheric data, the Korea Environment Corporation Air Korea air pollution information API was used. Weather data and atmospheric data can be provided by using the API after applying for data utilization on the public data portal and obtaining approval (
https://www.data.go.kr/index.do, accessed on 24 May 2022). And as for the medical treatment data used as the dependent variable, the National Health Insurance Corporation published the medical treatment history sample data for 1 million people per year until 2016. Therefore, medical treatment data up to 2016 can be downloaded from the National Health Insurance website in the form of a csv file (
https://nhiss.nhis.or.kr/op/up/opup300.do?data_pttn_cd=01, accessed on 24 May 2022).
Conflicts of Interest
The authors declare no conflict of interest.
References
- Jang, M.; Cho, G.S.; Lee, Y.S.; Kim, M.K.; Oh, S.N. A Study on Predicting Local Cold Patients Using Meteorological Elements. In Proceedings of the Korean Meteorological Society Conference, Seoul, Korea, 23–24 April 2011; pp. 292–293. [Google Scholar]
- Choi, J.H. The Era of Fourth Industrial Revolution: Healthcare Industry and ICT Technology. Telco J. 2017, 5, 75–96. [Google Scholar]
- Lee, M.H. Case Studies of Advanced Countries in the Fourth Industrial Revolution and Korea’s Response Strategy. Adv. Policy Ser. 2017, 41, 14–107. [Google Scholar]
- Kang, H.J. Policy Status and Tasks for Healthcare Big Data. Health Welf. Policy Forum 2016, 2016, 55–71. [Google Scholar]
- Chang, J.H.; Kim, Y.J.; Choi, J.H.; Kim, C.S.; Aziz, N. A Prediction of Number of Patients and Risk of Disease in Each Region Based on Pharmaceutical Prescription Data. J. Korea Multimed. Soc. 2018, 21, 271–280. [Google Scholar]
- Park, M.J. A Study on Measures to Improve Laws and Policies on the Use of Healthcare Big Data. Korean Med. Law Soc. J. 2018, 26, 163–192. [Google Scholar] [CrossRef]
- Song, T. Big Data Trend and Utilization Plan for Korean Health and Welfare. Sci. Technol. Policy 2013, 192, 56–73. [Google Scholar]
- Ahamed, F.; Farid, F. Applying internet of things and machine-learning for personalized healthcare: Issues and challenges. In Proceedings of the 2018 International Conference on Machine Learning and Data Engineering (iCMLDE), Sydney, Australia, 3–7 December 2018; pp. 19–21. [Google Scholar]
- Desai, A.N.; Kraemer, M.U.; Bhatia, S.; Cori, A.; Nouvellet, P.; Herringer, M.; Cohn, E.L.; Carrion, M.; Brownstein, J.S.; Madoff, L.C. Real-time epidemic forecasting: Challenges and opportunities. Health Secur. 2019, 17, 268–275. [Google Scholar] [CrossRef]
- Radin, J.M.; Wineinger, N.E.; Topol, E.J.; Steinhubl, S.R. Harnessing wearable device data to improve state-level real-time surveillance of influenza-like illness in the USA: A population-based study. Lancet Digit. Health 2020, 2, e85–e93. [Google Scholar] [CrossRef] [Green Version]
- Kim, W.-S.; Lee, S.-W. An in-depth survey analysis applying data mining techniques. J. Eng. Educ. Res. 2006, 9, 71–82. [Google Scholar]
- Dehkordi, S.K.; Sajedi, H. Prediction of disease based on prescription using data mining methods. Health Technol. 2019, 9, 37–44. [Google Scholar] [CrossRef]
- Zaki, M.J.; Meira, W., Jr.; Meira, W. Data Mining and Analysis: Fundamental Concepts and Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
- Ngiam, K.Y.; Khor, W. Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 2019, 20, e262–e273. [Google Scholar] [CrossRef]
- Jamgade, A.C.; Zade, S. Disease Prediction Using Machine Learning. Int. Res. J. Eng. Technol. 2019, 6, 6937–6938. [Google Scholar]
- Hongladarom, S. Buddhist Perspective on Four Vulnerable Groups: Children, Women, the Elderly and the Disabled. In Religious Perspectives on Human Vulnerability in Bioethics; Springer: Berlin/Heidelberg, Germany, 2014; pp. 117–133. [Google Scholar]
- Wakamiya, S.; Kawai, Y.; Aramaki, E. Twitter-based influenza detection after flu peak via tweets with indirect information: Text mining study. JMIR Public Health Surveill. 2018, 4, e8627. [Google Scholar] [CrossRef] [PubMed]
- Achrekar, H.; Gandhe, A.; Lazarus, R.; Yu, S.-H.; Liu, B. Predicting flu trends using twitter data. In Proceedings of the 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Shanghai, China, 10–15 April 2011; pp. 702–707. [Google Scholar]
- Gao, Y.; Wang, S.; Padmanabhan, A.; Yin, J.; Cao, G. Mapping spatiotemporal patterns of events using social media: A case study of influenza trends. Int. J. Geogr. Inf. Sci. 2018, 32, 425–449. [Google Scholar] [CrossRef]
- Lee, M.R.; Kim, J.W.; Jang, B.C. Predicting Chicken Pox Based on Deep Learning. J. Electr. Soc. 2020, 69, 127–137. [Google Scholar]
- Chae, S. Beginning Climate Health Impact Assessment and Related Tasks. Health Welf. Policy Forum 2019, 2019, 43–54. [Google Scholar]
- Kim, H.J.; Jung, J.S.; Lee, G.H.; Lee, W.C.; Lee, H.C.; Lee, S.W. Tendency Predictive Analysis of the Risk of the Common Cold According to the Weather. Korea Inf. Sci. Soc. J. 2017, 44, 1947–1949. [Google Scholar]
- Choi, H.; Choi, W.S.; Han, E. Suggestion of a simpler and faster influenza-like illness surveillance system using 2014–2018 claims data in Korea. Sci. Rep. 2021, 11, 11243. [Google Scholar] [CrossRef]
- Vidotto, J.; Pereira, L.; Braga, A.; Silva, C.; Sallum, A.; Campos, L.; Martins, L.; Farhat, S. Atmospheric pollution: Influence on hospital admissions in paediatric rheumatic diseases. Lupus 2012, 21, 526–533. [Google Scholar] [CrossRef]
- Jang, M.; Tak, S.J.; Park, J.M.; Wi, J.B.; Park, D.H.; Seo, S.; Choi, J.H. A Study on the Development of the Cold Index. In Proceedings of the Korean Meteorological Society Conference, Seoul, Korea, 23–24 April 2011; pp. 294–295. [Google Scholar]
- Brauer, M.; Casadei, B.; Harrington, R.A.; Kovacs, R.; Sliwa, K.; Group, W.A.P.E. Taking a stand against air pollution—The impact on cardiovascular disease: A joint opinion from the world heart federation, american college of cardiology, american heart association, and the european society of cardiology. Circulation 2021, 143, e800–e804. [Google Scholar] [CrossRef]
- Downward, G.S.; van Nunen, E.J.; Kerckhoffs, J.; Vineis, P.; Brunekreef, B.; Boer, J.M.; Messier, K.P.; Roy, A.; Verschuren, W.M.M.; van der Schouw, Y.T. Long-term exposure to ultrafine particles and incidence of cardiovascular and cerebrovascular disease in a prospective study of a Dutch cohort. Environ. Health Perspect. 2018, 126, 127007. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wettstein, Z.S.; Hoshiko, S.; Fahimi, J.; Harrison, R.J.; Cascio, W.E.; Rappold, A.G. Cardiovascular and cerebrovascular emergency department visits associated with wildfire smoke exposure in California in 2015. J. Am. Heart Assoc. 2018, 7, e007492. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Fomon, S.J.; Haschke, F.; Ziegler, E.E.; Nelson, S.E. Body composition of reference children from birth to age 10 years. Am. J. Clin. Nutr. 1982, 35, 1169–1175. [Google Scholar] [CrossRef] [PubMed]
- Sanderson, W.; Scherbov, S. Rethinking Age and Aging; Population Reference Bureau: Washington, DC, USA, 2008. [Google Scholar]
- Qu, S.; Li, Y.; Ji, Y. The mixed integer robust maximum expert consensus models for large-scale GDM under uncertainty circumstances. Appl. Soft Comput. 2021, 107, 107369. [Google Scholar] [CrossRef]
- Ji, Y.; Li, H.; Zhang, H. Risk-averse two-stage stochastic minimum cost consensus models with asymmetric adjustment cost. Group Decis. Negot. 2022, 31, 261–291. [Google Scholar] [CrossRef] [PubMed]
- Weisberg, S. Applied Linear Regression; John Wiley & Sons: Hoboken, NJ, USA, 2005; Volume 528. [Google Scholar]
- Liaw, A.; Wiener, M. Classification and regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
- Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K. R Package, Version 0.4-2; Xgboost: Extreme Gradient Boosting. 2015. Available online: https://scholar.google.com/scholar_lookup?hl=en&publication_year=2015&author=T.+Chen&author=T.+He&author=M.+Benesty&author=V.+Khotilovich&author=Y.+Tang&title=Xgboost%3A+Extreme+Gradient+Boosting (accessed on 24 May 2022).
- Alim, M.; Ye, G.-H.; Guan, P.; Huang, D.-S.; Zhou, B.-S.; Wu, W. Comparison of ARIMA model and XGBoost model for prediction of human brucellosis in mainland China: A time-series study. BMJ Open 2020, 10, e039676. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]
- Altmann, A.; Toloşi, L.; Sander, O.; Lengauer, T. Permutation importance: A corrected feature importance measure. Bioinformatics 2010, 26, 1340–1347. [Google Scholar] [CrossRef]
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).