A Study on Customized Prediction of Daily Illness Risk Using Medical and Meteorological Data

Kim, Minji; Jang, Jiwon; Jeon, Seungjin; Youm, Sekyoung

doi:10.3390/app12126060

Open AccessArticle

A Study on Customized Prediction of Daily Illness Risk Using Medical and Meteorological Data

Department of Industrial and Systems Engineering, Dongguk University, Seoul 04620, Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(12), 6060; https://doi.org/10.3390/app12126060

Submission received: 25 May 2022 / Revised: 10 June 2022 / Accepted: 12 June 2022 / Published: 15 June 2022

(This article belongs to the Special Issue Artificial Intelligence for Sustainable Services, Applications and Education)

Download

Browse Figure

Versions Notes

Abstract

:

This study selected the most common illnesses in children and older adults and aimed to provide a customized degree of daily risk for each illness based on patient data for specific regions and illnesses. Sample medical data of one million people provided by the National Health Insurance Corporation and information regarding the meteorological environment and atmosphere from the Korea Meteorological Administration and a public data portal using application programing interface were collected. Learning and predictions were carried out with machine learning. Models with high R² were selected and tuned to determine the optimal hyperparameter for predicting the degree of daily risk of an illness. Illnesses with an R² value greater than 0.65 were considered significant. For children, these consisted of acute bronchitis, the common cold, rhinitis and tonsillitis, and middle ear inflammation. For older adults, they consisted of high blood pressure and heart disease, the common cold, esophageal inflammation and gastritis, acute bronchitis, eczema and dermatitis, and chronic bronchitis. This study provides the degree of daily risk for the most common illnesses in each age group. Furthermore, the results of this study are expected to raise awareness of illnesses that occur in certain climates and to help prevent them.

Keywords:

daily illness risk; medical information; meteorological data; prediction

1. Introduction

As quality of life has increased, interest in health and illnesses has also increased [1]. Moreover, as the healthcare industry pursues fundamental changes in medical services through the technologies of the Fourth Industrial Revolution, such as big data, artificial intelligence (AI), and the internet of things (IoT), its paradigm is shifting from a diagnosis-and-treatment-centered model to a prevention-and-management-centered model [2]. This is a shift from the traditional healthcare paradigm in which people go to hospitals when they feel sick to one in which people predict and prevent illnesses using information technology (IT) [3].

Recently, the utilization of IT has become essential in various fields, and services that create new value by using data collected with IT are emerging all over the world. Therefore, big data is believed to be the innovative factor that will create new national values and solve various countries’ fundamental problems [4,5]. In Korea, public data from various fields are provided to private companies and researchers regardless of the original purpose of the data collection, and they are typically available through public data portal sites [6]. The provided data are thus accessible for anyone to collect and analyze. As part of a pilot project for the usage of medical big data, the Korean government has utilized the Korean Meteorological Administration’s (KMA) health meteorological index and developed an early warning service for drug safety [5,7]. The utilization of healthcare data has been increasing since it was made public. If this rapidly growing collection of healthcare-related data is utilized well, innovation in medical services is possible, such as complete diagnosis customized for each patient and illness prevention services [2]. The following examples of innovation in medical services are expected: (1) Complete diagnosis customized for each patient [8]; (2) Illness prevention real-time service: Real-time digital disease surveillance data and public data sharing could shape predictions about future epidemics [9]; (3) Disease prediction using wearable devices: Sensor data can help analyze trends in seasonal respiratory infections, such as influenza [10].

This study aimed to analyze healthcare data using data mining. Data mining refers to a technique that reveals the underlying useful knowledge in big data [11] and is effective at deriving hidden patterns and discovering descriptive, predictive, and understandable models [12,13]. A key example is machine learning. Machine learning has advantages in such tasks as diagnosis, classification, and survival prediction. It has a particular advantage in the healthcare field, as it can analyze various data types and provide predictions regarding daily illness risks, diagnosis, prognosis, and appropriate treatments [14]. Moreover, machine learning may be able to serve as a treatment option that can help people make better decisions in consideration of their healthcare needs and diagnosis [15]. Therefore, the expectation in this study is that an increase in the healthy lifespan and a reduction in medical expenses of humans can be obtained if healthcare data are analyzed with machine learning.

The purpose of this study was to analyze the illnesses that most commonly occur in children and older adults, two socially vulnerable groups that need more protection than other age groups [16], and to provide the degree of daily risk for illnesses based on the number of patients suffering from them in each region. In this process, the optimal machine-learning model was selected after extensive comparison, the optimal hyperparameter was derived through GridSearchCV, and both were verified based on their R² value. This study’s strengths are that the degree of daily risk for each illness for each age group can be provided, and the most common illness for each age group can be determined. As a matter unrelated to actual diagnosis by a doctor, this is expected to raise awareness of and prevent illnesses based on climate data.

2. Related Work

Several prior studies have been conducted on minimizing and preventing the occurrence of epidemics by predicting trends in illnesses based on web data. A representative example is a study by Iso et al. [17] in which the authors attempted to monitor influenza trends based on the online social network platform Twitter; another is a study by Achrekar et al. [18,19] in which the authors attempted to predict the trend of the flu using Twitter data. However, considering the characteristics of web data, there are limitations to this type of data, as advertisements may be included, and some data may be duplicated. To supplement these limitations, Lee et al. utilized actual chicken pox occurrence data provided by the Korea Disease Control and Prevention Agency, as well as data from Naver News and Twitter, to predict the occurrence of chicken pox [20].

Numerous factors are involved in health, and the World Health Organization noted that climate change, such as air pollution, is a significant health threat that leads to various illnesses and premature deaths [21]. Subsequently, many studies have been conducted to predict the possibility of illnesses according to the meteorological and atmospheric environments. Kim et al. analyzed precipitation, lowest and highest temperature, relative humidity, and daily temperature range with Naive Bayes to provide the degree of risk for the common cold [22]. Choi et al. proposed an influenza-like disease monitoring system through statistical analysis [23], and Vidotto et al. studied the effect of air pollution on juvenile rheumatism using statistical analysis [24]. Jeong et al. utilized the common cold, eye disease, and middle ear inflammation as autoregressive integrated moving average models to propose a model that calculates the expected number of patients for each illness by utilizing drug prescription data [5]. Jang et al. developed a cold index by predicting the number of patients suffering from the common cold [25]. There are also studies that have confirmed the correlation between air pollution and cardiovascular disease [26,27,28]. However, these studies conducted analyses with data from individuals of all ages and did not consider differences among the age groups.

Currently, the KMA provides the health meteorological index. This is a collection of data that has indexed the level of possibility of occurrence of illnesses such as food poisoning, asthma, cerebral apoplexy, the common cold, and pollen allergies. However, this service fails to reflect various climatic environments, as it excludes atmospheric environment information and only considers weather conditions, such as lowest temperature and air pressure, in its analysis. Additional limitations of this service are that it does not provide optimized illness information for each age group and that it only provides information on the illnesses most commonly suffered from by people in general.

3. Materials and Methods

To derive the degree of daily risk for illnesses, this study obtained past weather information, past atmospheric information and medical treatment data from the KMA, a public data portal, and the National Health Insurance Corporation using application programing interface (API).

The medical treatment data provided by the National Health Insurance Corporation are sample data from one million people every year and medical data for five years, the most recent of the publicly available data, were used for analysis.

3.1. Data Description

This study defined children as those aged 0–9 years and older adults as those aged 65 years or above. Similar to the definition of a child in the previous study [29], 0 to 9 years old was defined as a child. Currently, in many countries, 65 years of age or older is regarded as the older adults [30], so older adults in this study were also defined as 65 years old or older.

When this study was selecting illnesses for analysis based solely on the 2016 medical data, there was concern regarding a likely distortion with respect to certain illnesses that occurred that year. Accordingly, this study selected illnesses for analysis based on medical treatment data over the last five years. The illnesses in the medical data are listed only in the form of code numbers. Table 1 matches the illness code to the illness name as provided by the Korean Informative Classification of Diseases. The top 10 illnesses that are most common in older adults were found to be illnesses unrelated to seasonal factors, such as Alzheimer’s and orthopedic illnesses. Thus, of the top 20 illnesses, those unrelated to seasonal factors were excluded, and the final 8 illnesses were selected as the subjects of analysis.

The regions included in the analysis of this study were Seoul, Busan, Daegu, Incheon, Gwangju, Daejeon, Ulsan, Gyeonggi-do, Gangwon-do, Chungcheongbuk-do, Chungcheongnam-do, Jeollabuk-do, Jeollanam-do, Gyeongsangbuk-do, Gyeongsangnam-do, and Jeju. These areas are defined by the area identifier city/province code used by the public data portal and the KMA.

3.2. Data Preprocessing

The meteorological data (average temperature, lowest temperature, highest temperature, average cloudiness and precipitation level), atmospheric information data (sulfurous acid gas, carbon monoxide, ozone, nitrogen dioxide, and micro dust levels), and regional identifiers were selected as candidates for independent variables. Data such as atmospheric information and meteorological information have high uncertainty, so it is necessary to select variables through weighting and correlation analysis [31,32]. Therefore, for the variables other than the regional identifier, which serves as a simple identifier rather than a numeric number, multicollinearity was evaluated using the variance inflation factor (VIF) index, and the variables that were not in violation were defined as the final independent variables. The lowest temperature and the average temperature had a VIF index of 10 or above and were correlated with the highest temperature; thus, they were excluded. The final independent variables selected included regional identifier, highest temperature, average cloudiness, precipitation level, sulfurous acid gas level, carbon monoxide level, ozone level, nitrogen dioxide level, and micro dust level. Day and season were also included in the final list of independent variables. The independent and dependent variables for each age group are listed in Table 2.

3.3. Methods

In this study, three different models—linear regression [33], random forest [34], and XGBoost [35]—were comparatively analyzed.

In linear regression, a regression equation is modeled using a linear prediction function, and parameters are estimated through data, which is referred to as a regression equation. Therefore, if the ultimate goal is to predict the value, a predictive model can be constructed using linearity.

Random forest is a non-linear regression model that is an ensemble technique based on decision trees. Multiple instances of training were performed by extracting a certain number of samples of training data with bootstrap sampling, a method of extracting data a given number of times. Prediction values are then derived for each model, and the final prediction value is determined based on the average value of all derived prediction values. Random forest is easy to use and was selected as a candidate model, as it is proficient in dealing with multiple independent variables.

XGBoost is one of the machine-learning techniques used for predicting various diseases in recent years [36] and has the advantage of being able to perform both non-linear and linear regression. XGBoost is a model that improves the performance and speed of gradient boosting by applying the boosting technique to the decision tree model. The boosting technique combines weak predictive models to create a strong predictive model. After data are predicted with a weak predictive model, the error from this process is trained on another weak predictive model. This process is repeated to reduce error and is applied sequentially. Hence, model t is determined by the error from model t-1. Ultimately, the model is trained by finding the weight that minimizes the loss between the actual value and the predicted value.

Hyperparameters refer to values set by the user. However, there is a limit to repeated testing due to the user changing the values continuously, and it is difficult to intuitively observe the best value. GridSearchCV allows the user to determine the optimal hyperparameter. If the user inputs values that can be input into the hyperparameter, GridSearchCV determines the best parameter through the combination of parameters.

4. Results and Discussion

4.1. Model Selection

After the data were divided into the children group and the older adults group, the system was trained with linear regression, random forest, and XGBoost. For selection of the optimal model for each group, the R² value in Equation (1), which calculates the ratio of the variance of the predicted value to the variance of the actual value as an index, was used. The closer the value of this index is to 1, the more accurate it is. The ultimately derived R² is shown in Table 3. XGBoost was selected as the final model, as it had a large R² value for most illnesses.

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(t_{i} - y_{i})}^{2}}{\sum_{i = 1}^{n} {(t_{i} - \bar{t_{i}})}^{2}}

(1)

t_{i}

: actual value,

y_{i}

: expected value.

Table 3. Comparison of R² value for each machine-learning model.

a. Children
Illness	Linear Regression	Random Forest	XGBoost
Acute bronchitis	0.010	0.501	0.504
Common cold	0.242	0.610	0.632
Rhinitis, tonsillitis	0.013	0.447	0.498
Inner ear inflammation	0.147	0.239	0.338
Chronic bronchitis	0.452	0.221	0.271
Intestinal inflammation	0.117	0.138	0.162
Influenza, pneumonia	0.075	0.109	0.153
Eczema, dermatitis	0.147	0.237	0.282
Chicken pox	0.085	0.012	0.089
Conjunctivitis	0.090	0.142	0.199
b. Older adults
High blood pressure, heart disease	0.655	0.842	0.842
Common cold	0.343	0.545	0.558
Esophageal inflammation, gastritis	0.552	0.728	0.739
Acute bronchitis	0.262	0.475	0.477
Eczema, dermatitis	0.465	0.606	0.635
Chronic bronchitis	0.439	0.564	0.589
Rhinitis, tonsillitis	0.342	0.473	0.505
Brain disease	0.284	0.340	0.341

Bold text means the best performance when comparing the three models.

4.2. Deriving the Optimal Hyperparameter

GridSearchCV was used to adjust the hyperparameter to derive the optimal hyperparameter based on XGBoost. Table 4 shows the types of hyperparameters that were to be adjusted, along with their descriptions.

The optimal hyperparameters for each illness were derived using GridSearchCV. Table 5 shows the R² for the final hyperparameters. Comparison with Table 4 shows the improved R². This study found that a significant prediction was derived only when R² was above 0.65, and illnesses with an R² below 0.65 were excluded. Thus, a total of four illnesses for children acute bronchitis, the common cold, rhinitis and tonsillitis, and inner ear inflammation, and a total of six illnesses for older adults high blood pressure and heart disease, the common cold, esophageal inflammation and gastritis, acute bronchitis, eczema and dermatitis, and chronic bronchitis are regarded as having been significantly analyzed.

4.3. Designing the Final System

The number of treatments that correspond to the dependent variable of the predictive models in the previous preprocessing stage was scaled with a mean of 0 and a variance of 1. Using normal distribution, the p-value points were set to 0.2, 0.4, 0.6, and 0.8 as the cutoff standard, and the possibility of illness was categorized as very low, low, normal, high, and very high, as seen in Table 6. Figure 1 shows the system user interface proposed in this paper.

4.4. Discussion

In this study, a system for providing disease risk for children and the older adults was proposed using XGBoost. Initially, we set the goal that was to provide various disease risk levels using 10 disease categories customized for children and 8 disease categories customized for the older adults. However, diseases in which R² did not exceed 0.65 were judged to be insignificant, and finally, four diseases for children and six diseases for the older adults were defined as diseases that could provide disease risk.

In chronic bronchitis, linear regression yielded more significant results according to Table 3, but XGBoost, which performed excellent in most diseases, was selected as the final model. When hyperparameter tuning using GridSearchCV was performed, XGBoost obtained better results than linear regression, which was excellent in the past. However, R² did not exceed 0.65, so there was a limit to providing the disease risk.

Identifying the illness that seeks to provide the ultimate risk of illness confirms that the illness with a large amount of data yielded better results. If more data are collected on diseases that are not currently selected, it is expected that more meaningful results will be possible.

5. Conclusions

In this study, an approach to provide customized daily disease risk for children and the older adults using sample medical data of one million patients provided by the National Health Insurance and weather environment and atmospheric information provided by the Korea Meteorological Administration (KMA) and public data portal API was proposed. Existing studies and systems have performed comprehensive analyses for all age groups without dividing the age groups. This study overcomes that limitation. By comparing the predictive R² of the three models, XGBoost was selected as the final disease risk prediction model. When R² was greater than 0.65, it was defined as a significant analysis. For children, it can provide four illness risk: acute bronchitis, the common cold, rhinitis and tonsillitis, and inner ear inflammation. Additionally, for older adults, it can provide six illness risk: high blood pressure and heart disease, the common cold, esophageal inflammation and gastritis, acute bronchitis, eczema and dermatitis, and chronic bronchitis.

This study has the following limitations. First, the existing system of the KMA provides a health meteorological index for each city and state. However, this study also has limitations, as the medical data utilized in this study only provide information on province and metropolitan city subdivisions. Second, there is a limitation in not being able to analyze which factors have a great influence on specific diseases in specific seasons and days of the week. Lastly, the data used in this study are data up to 2016, the most recent among currently available data, and there is a limitation that there may be differences from the current situation. Therefore, if the latest data are updated later, it can be expected to be predictable according to the current situation.

In future research, it is expected that it can become an interpretable model if a method of identifying which data affect disease for disease prediction is added. For this, the importance of features should be analyzed using techniques such as additive explanation (SHAP) [37] and permutation importance [38].

Although treatment can only be provided by a professional, illnesses may be prevented by heightened awareness of illnesses through the daily illness risks for each age group presented by this study based on collectible meteorological and atmospheric environment-related information.

Author Contributions

Conceptualization, M.K.; methodology, M.K.; formal analysis, M.K., J.J. and S.J.; data curation, J.J. and S.J.; writing—original draft preparation, M.K.; writing—review and editing, J.J., S.J. and S.Y.; supervision, S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF-2020R1A2C2010471).

Data Availability Statement

For weather data, we used the neighborhood forecast inquiry service API, and for atmospheric data, the Korea Environment Corporation Air Korea air pollution information API was used. Weather data and atmospheric data can be provided by using the API after applying for data utilization on the public data portal and obtaining approval (https://www.data.go.kr/index.do, accessed on 24 May 2022). And as for the medical treatment data used as the dependent variable, the National Health Insurance Corporation published the medical treatment history sample data for 1 million people per year until 2016. Therefore, medical treatment data up to 2016 can be downloaded from the National Health Insurance website in the form of a csv file (https://nhiss.nhis.or.kr/op/up/opup300.do?data_pttn_cd=01, accessed on 24 May 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Jang, M.; Cho, G.S.; Lee, Y.S.; Kim, M.K.; Oh, S.N. A Study on Predicting Local Cold Patients Using Meteorological Elements. In Proceedings of the Korean Meteorological Society Conference, Seoul, Korea, 23–24 April 2011; pp. 292–293. [Google Scholar]
Choi, J.H. The Era of Fourth Industrial Revolution: Healthcare Industry and ICT Technology. Telco J. 2017, 5, 75–96. [Google Scholar]
Lee, M.H. Case Studies of Advanced Countries in the Fourth Industrial Revolution and Korea’s Response Strategy. Adv. Policy Ser. 2017, 41, 14–107. [Google Scholar]
Kang, H.J. Policy Status and Tasks for Healthcare Big Data. Health Welf. Policy Forum 2016, 2016, 55–71. [Google Scholar]
Chang, J.H.; Kim, Y.J.; Choi, J.H.; Kim, C.S.; Aziz, N. A Prediction of Number of Patients and Risk of Disease in Each Region Based on Pharmaceutical Prescription Data. J. Korea Multimed. Soc. 2018, 21, 271–280. [Google Scholar]
Park, M.J. A Study on Measures to Improve Laws and Policies on the Use of Healthcare Big Data. Korean Med. Law Soc. J. 2018, 26, 163–192. [Google Scholar] [CrossRef]
Song, T. Big Data Trend and Utilization Plan for Korean Health and Welfare. Sci. Technol. Policy 2013, 192, 56–73. [Google Scholar]
Ahamed, F.; Farid, F. Applying internet of things and machine-learning for personalized healthcare: Issues and challenges. In Proceedings of the 2018 International Conference on Machine Learning and Data Engineering (iCMLDE), Sydney, Australia, 3–7 December 2018; pp. 19–21. [Google Scholar]
Desai, A.N.; Kraemer, M.U.; Bhatia, S.; Cori, A.; Nouvellet, P.; Herringer, M.; Cohn, E.L.; Carrion, M.; Brownstein, J.S.; Madoff, L.C. Real-time epidemic forecasting: Challenges and opportunities. Health Secur. 2019, 17, 268–275. [Google Scholar] [CrossRef]
Radin, J.M.; Wineinger, N.E.; Topol, E.J.; Steinhubl, S.R. Harnessing wearable device data to improve state-level real-time surveillance of influenza-like illness in the USA: A population-based study. Lancet Digit. Health 2020, 2, e85–e93. [Google Scholar] [CrossRef] [Green Version]
Kim, W.-S.; Lee, S.-W. An in-depth survey analysis applying data mining techniques. J. Eng. Educ. Res. 2006, 9, 71–82. [Google Scholar]
Dehkordi, S.K.; Sajedi, H. Prediction of disease based on prescription using data mining methods. Health Technol. 2019, 9, 37–44. [Google Scholar] [CrossRef]
Zaki, M.J.; Meira, W., Jr.; Meira, W. Data Mining and Analysis: Fundamental Concepts and Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
Ngiam, K.Y.; Khor, W. Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 2019, 20, e262–e273. [Google Scholar] [CrossRef]
Jamgade, A.C.; Zade, S. Disease Prediction Using Machine Learning. Int. Res. J. Eng. Technol. 2019, 6, 6937–6938. [Google Scholar]
Hongladarom, S. Buddhist Perspective on Four Vulnerable Groups: Children, Women, the Elderly and the Disabled. In Religious Perspectives on Human Vulnerability in Bioethics; Springer: Berlin/Heidelberg, Germany, 2014; pp. 117–133. [Google Scholar]
Wakamiya, S.; Kawai, Y.; Aramaki, E. Twitter-based influenza detection after flu peak via tweets with indirect information: Text mining study. JMIR Public Health Surveill. 2018, 4, e8627. [Google Scholar] [CrossRef] [PubMed]
Achrekar, H.; Gandhe, A.; Lazarus, R.; Yu, S.-H.; Liu, B. Predicting flu trends using twitter data. In Proceedings of the 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Shanghai, China, 10–15 April 2011; pp. 702–707. [Google Scholar]
Gao, Y.; Wang, S.; Padmanabhan, A.; Yin, J.; Cao, G. Mapping spatiotemporal patterns of events using social media: A case study of influenza trends. Int. J. Geogr. Inf. Sci. 2018, 32, 425–449. [Google Scholar] [CrossRef]
Lee, M.R.; Kim, J.W.; Jang, B.C. Predicting Chicken Pox Based on Deep Learning. J. Electr. Soc. 2020, 69, 127–137. [Google Scholar]
Chae, S. Beginning Climate Health Impact Assessment and Related Tasks. Health Welf. Policy Forum 2019, 2019, 43–54. [Google Scholar]
Kim, H.J.; Jung, J.S.; Lee, G.H.; Lee, W.C.; Lee, H.C.; Lee, S.W. Tendency Predictive Analysis of the Risk of the Common Cold According to the Weather. Korea Inf. Sci. Soc. J. 2017, 44, 1947–1949. [Google Scholar]
Choi, H.; Choi, W.S.; Han, E. Suggestion of a simpler and faster influenza-like illness surveillance system using 2014–2018 claims data in Korea. Sci. Rep. 2021, 11, 11243. [Google Scholar] [CrossRef]
Vidotto, J.; Pereira, L.; Braga, A.; Silva, C.; Sallum, A.; Campos, L.; Martins, L.; Farhat, S. Atmospheric pollution: Influence on hospital admissions in paediatric rheumatic diseases. Lupus 2012, 21, 526–533. [Google Scholar] [CrossRef]
Jang, M.; Tak, S.J.; Park, J.M.; Wi, J.B.; Park, D.H.; Seo, S.; Choi, J.H. A Study on the Development of the Cold Index. In Proceedings of the Korean Meteorological Society Conference, Seoul, Korea, 23–24 April 2011; pp. 294–295. [Google Scholar]
Brauer, M.; Casadei, B.; Harrington, R.A.; Kovacs, R.; Sliwa, K.; Group, W.A.P.E. Taking a stand against air pollution—The impact on cardiovascular disease: A joint opinion from the world heart federation, american college of cardiology, american heart association, and the european society of cardiology. Circulation 2021, 143, e800–e804. [Google Scholar] [CrossRef]
Downward, G.S.; van Nunen, E.J.; Kerckhoffs, J.; Vineis, P.; Brunekreef, B.; Boer, J.M.; Messier, K.P.; Roy, A.; Verschuren, W.M.M.; van der Schouw, Y.T. Long-term exposure to ultrafine particles and incidence of cardiovascular and cerebrovascular disease in a prospective study of a Dutch cohort. Environ. Health Perspect. 2018, 126, 127007. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wettstein, Z.S.; Hoshiko, S.; Fahimi, J.; Harrison, R.J.; Cascio, W.E.; Rappold, A.G. Cardiovascular and cerebrovascular emergency department visits associated with wildfire smoke exposure in California in 2015. J. Am. Heart Assoc. 2018, 7, e007492. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Fomon, S.J.; Haschke, F.; Ziegler, E.E.; Nelson, S.E. Body composition of reference children from birth to age 10 years. Am. J. Clin. Nutr. 1982, 35, 1169–1175. [Google Scholar] [CrossRef] [PubMed]
Sanderson, W.; Scherbov, S. Rethinking Age and Aging; Population Reference Bureau: Washington, DC, USA, 2008. [Google Scholar]
Qu, S.; Li, Y.; Ji, Y. The mixed integer robust maximum expert consensus models for large-scale GDM under uncertainty circumstances. Appl. Soft Comput. 2021, 107, 107369. [Google Scholar] [CrossRef]
Ji, Y.; Li, H.; Zhang, H. Risk-averse two-stage stochastic minimum cost consensus models with asymmetric adjustment cost. Group Decis. Negot. 2022, 31, 261–291. [Google Scholar] [CrossRef] [PubMed]
Weisberg, S. Applied Linear Regression; John Wiley & Sons: Hoboken, NJ, USA, 2005; Volume 528. [Google Scholar]
Liaw, A.; Wiener, M. Classification and regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K. R Package, Version 0.4-2; Xgboost: Extreme Gradient Boosting. 2015. Available online: https://scholar.google.com/scholar_lookup?hl=en&publication_year=2015&author=T.+Chen&author=T.+He&author=M.+Benesty&author=V.+Khotilovich&author=Y.+Tang&title=Xgboost%3A+Extreme+Gradient+Boosting (accessed on 24 May 2022).
Alim, M.; Ye, G.-H.; Guan, P.; Huang, D.-S.; Zhou, B.-S.; Wu, W. Comparison of ARIMA model and XGBoost model for prediction of human brucellosis in mainland China: A time-series study. BMJ Open 2020, 10, e039676. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]
Altmann, A.; Toloşi, L.; Sander, O.; Lengauer, T. Permutation importance: A corrected feature importance measure. Bioinformatics 2010, 26, 1340–1347. [Google Scholar] [CrossRef]

Figure 1. Application.

Table 1. Illnesses in children and older adults selected for analysis.

Illnesses for Children	Code	Illnesses for Older Adults	Code
Acute bronchitis	J20–J22	High blood pressure, heart disease	I10–I15
Common cold	J00–J06	Common cold	J00–J06
Rhinitis, tonsillitis	J30–J39	Esophageal inflammation, gastritis	K20–K31
Inner ear inflammation	H65–H75	Acute bronchitis	J20–J22
Chronic bronchitis	J40–J47	Eczema, dermatitis	L20–L30
Intestinal inflammation	A00–A09	Chronic bronchitis	J40–J47
Influenza, pneumonia	J10–J18	Rhinitis, tonsillitis	J30–J39
Eczema, dermatitis	L20–L30	Brain disease	I60–I69
Chicken pox	B00–B09
Conjunctivitis	H10–H13

Table 2. Independent and dependent variables for children and older adults.

Independent variables		Day, season, city/province code, highest temperature, average cloudiness, precipitation level, sulfurous acid gas level, carbon monoxide level, ozone level, nitrogen dioxide level, and micro dust level
Dependent variables	Children (aged 0–9 years)	Number of acute bronchitis patients
		Number of common cold patients
		Number of rhinitis and tonsillitis patients
		Number of inner ear inflammation patients
		Number of chronic bronchitis patients
		Number of intestinal inflammation patients
		Number of influenza and pneumonia patients
		Number of eczema and dermatitis patients
		Number of chicken pox patients
		Number of conjunctivitis patients
	Older adults (aged 65 years and over)	Number of high blood pressure and heart disease patients
		Number of common cold patients
		Number of esophageal inflammation and gastritis patients
		Number of acute bronchitis patients
		Number of eczema and dermatitis patients
		Number of chronic bronchitis patients
		Number of rhinitis and tonsillitis patients
		Number of brain disease patients

Table 4. Description of hyperparameters.

Hyperparameter	Parameter Name	Description	Value
General	booster	Tree model based (gbtree), Linear model based (gblinear)	[gbtree, gblinear]
Booster	colsample_bytree	Random sampling of columns required to create trees	[0.5, 0.7, 0.9, 1]
Booster	gamma	Minimum loss reduction value that will determine whether to further split the leaf nodes of trees	[0, 0.1, 0.2, 0.3]
Booster	learning_rate	Coefficient used by weak predictors to correct error values	[0.001, 0.01]
Booster	n_estimator	Number of weak predictors	[900, 950, 1000, 1050]
Booster	subsample	Ratio of sampling data	[0.5, 0.7, 0.9, 1]

Table 5. Final hyperparameter for each illness.

a. Children
Illness	Final Hyperparameter						R²
Illness	Booster	Colsample Bytree	Gamma	Learning Rate	n_esti Mator	SUB Sample	R²
Acute bronchitis	gbtree	1	0.2	0.001	1050	0.9	0.7768
Common cold	gbtree	1	0.1	0.001	1050	0.7	0.7308
Rhinitis, tonsillitis	gbtree	0.9	0.2	0.001	1050	0.9	0.6679
Inner ear inflammation	gbtree	1	0.3	0.001	1050	0.7	0.6605
Chronic bronchitis	gbtree	0.9	0.2	0.001	1050	1	0.4612
Intestinal inflammation	gbtree	1	0.2	0.001	1050	0.9	0.3141
Influenza, pneumonia	gbtree	1	0.1	0.001	1050	0.9	0.4092
Eczema, dermatitis	gbtree	0.7	0.1	0.001	1050	1	0.3641
Chicken pox	gbtree	0.5	0.3	0.001	1050	0.7	0.2768
Conjunctivitis	gbtree	0.9	0	0.001	1050	0.9	0.278
b. Older adults
High blood pressure, heart disease	gbtree	1	0.1	0.001	1050	0.9	0.8613
Common cold	gbtree	0.7	0	0.001	1050	0.9	0.7383
Esophageal inflammation, gastritis	gbtree	0.9	0	0.001	1050	0.7	0.7491
Acute bronchitis	gbtree	1	0.2	0.001	1050	0.9	0.7328
Eczema, dermatitis	gbtree	1	0	0.001	1050	0.9	0.6579
Chronic bronchitis	gbtree	0.9	0.2	0.001	1050	0.7	0.6706
Rhinitis, tonsillitis	gbtree	0.9	0.2	0.001	1050	0.7	0.6395
Brain disease	gbtree	1	0.3	0.001	1050	0.9	0.4716

Bold text means that the prediction for the disease is a significant prediction because R² is greater than 0.65.

Table 6. Cutoff standard and definition of risk of illness.

p-Value	~0.2	0.2~0.4	0.4~0.6	0.6~0.8	0.8~
Z	~−0.83	−0.83~−0.25	−0.25~0.25	0.25~0.83	0.83~
Level	Very low	Low	Normal	High	Very high

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, M.; Jang, J.; Jeon, S.; Youm, S. A Study on Customized Prediction of Daily Illness Risk Using Medical and Meteorological Data. Appl. Sci. 2022, 12, 6060. https://doi.org/10.3390/app12126060

AMA Style

Kim M, Jang J, Jeon S, Youm S. A Study on Customized Prediction of Daily Illness Risk Using Medical and Meteorological Data. Applied Sciences. 2022; 12(12):6060. https://doi.org/10.3390/app12126060

Chicago/Turabian Style

Kim, Minji, Jiwon Jang, Seungjin Jeon, and Sekyoung Youm. 2022. "A Study on Customized Prediction of Daily Illness Risk Using Medical and Meteorological Data" Applied Sciences 12, no. 12: 6060. https://doi.org/10.3390/app12126060

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Study on Customized Prediction of Daily Illness Risk Using Medical and Meteorological Data

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Data Description

3.2. Data Preprocessing

3.3. Methods

4. Results and Discussion

4.1. Model Selection

4.2. Deriving the Optimal Hyperparameter

4.3. Designing the Final System

4.4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI