# Identification of Risk Factors Associated with Obesity and Overweight—A Machine Learning Overview

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

^{2}” and is used to assess body composition [3,4,5,6,7]. However, BMI is rather a bad indicator of percent of body fat, as BMI does not capture information on the mass of fat in different body sites and is highly dependent on age [8]. In 2012, as identified by the Institute of Medicine, population-based obesity prevention strategies, such as physical activity, healthy diet, models of healthy social rules, and context-based and tailored recommendations by setting have the potential to combat “obesity and overweight” [9]. Thus, health behavior change should be given precedence to circumvent severe damages.

- (1)
- (2)
- Understanding how the identified risk factors are correlated to weight change with regression analysis and data visualization techniques;
- (3)
- Reviewing various machine learning (ML) models for the classification and regression of the same selected datasets.

## 2. Methodology for Study Selection

## 3. Related Work

## 4. Methods

#### 4.1. Data Collection

#### 4.2. Data Processing

- Data preprocessing includes data integration, the removal of noisy data that are incomplete and inconsistent, data normalization and feature scaling, encoding of the categorical data, feature selection after correlation analysis, and split data for training and testing a machine learning model.
- Training of a machine learning model and testing its accuracy with a k-fold cross validation.
- Data postprocessing includes pattern evaluation, pattern selection, pattern interpretation, and pattern visualization.

#### 4.3. Statistical Analysis

**H**) that tells us there is nothing different or significant about the data. On the contrary, the alternative hypothesis (

_{0}**Ha**) directly contradicts H

_{0}. The confidence factor or value of significance (α) is used to decide whether to accept or reject an

**H**. The value of α is usually kept as 0.05% or 5%, as 100% accuracy is impossible to achieve whether accepting or rejecting

_{0}**H**. Popular widely used hypothesis testing methods, a short description, and the required sample size are demonstrated in Table 7. A hypothesis test can be either a one-tailed test or a two-tailed test. For each of the testing methods, the resulting probability value (P-value) is compared with “α” to accept or reject a null hypothesis. However, it may carry type-I error (false positive) or type-II error (false negative) [31,32,39].

_{0}**H**:

_{0}**H**:

_{a}_{xy}) that measures the strength of the linear relationship between two variables.

#### 4.4. Model Training and Testing

- Load data.
- Data pre-processing:
- ⚬
- remove missing values from the loaded data;
- ⚬
- encode categorical features;
- ⚬
- check distribution of data and features;
- ⚬
- remove data from outliers;
- ⚬
- remove data redundancy;
- ⚬
- correlation analysis among features and feature scaling if required. We compared the correlation between features and removed one of two features that had a correlation higher than 0.9;
- ⚬
- column/feature selection based on the p-value with the help of “regressor_OLS”;
- ⚬
- visualize the distribution of selected features;
- ⚬
- shuffle the data.

- Split data for training and testing (80:20) with some random state.
- Machine learning model selection, as described in Table 6, based on regression or classification problem statement and building the model.
- K-fold cross validation on data (in our study, K = 5).
- Perform a prediction.
- Evaluate the model performance with metrics, as described in Section 4.5, following discrimination and calibration-based performance measures.
- Perform model tuning with a “grid search” parameter optimization technique.

**Note:**

**a.**Selection of learning rate (α): if too small then it slows the convergence in the gradient descent (GD), and if too large then it slows the convergence in GD or the GD may diverge.

**b.**Let “m” training samples have “n” features. If there are too many features (m <= n), then delete some features or use regularization with the regularization factor ‘λ’.

**c.**If ‘λ’ is too large then the algorithm fails to eliminate overfitting, or even sometimes underfits and the GD fails to converge. ‘λ’ (∞) increases to lead a high bias and decreases to lead a high variance.

**d.**Underfitting results in a high bias and overfitting leads to a high variance.

**e.**If a learning algorithm is suffering from a high bias, more training data will not help much. If a learning algorithm is suffering from high variance, more training data is likely to help.

**f.**C = (1/λ) = line separation effect in SVM: large “C” leads to a lower bias and high variance, small ‘C’ leads to a higher bias and low variance.

**g.**Gradient descent follows the convex optimization technique with upper bound (L) and lower bound (µ) on the curvature f:

#### 4.5. Model Evaluation

- Classification metrics: accuracy score, classification report, and confusion matrix.
- Regression metrics: mean absolute error (MAE), mean squared error (MSE), and R
^{2}-score. - Calibration: Goodness-of-fit statistics with a Brier score metric for binary classification. The Brier score is a metric which is a combination of the calibration loss and refinement loss. Calibration loss is the mean squared deviation from the empirical probabilities derived from the slope of the ‘Receiver Operating Characteristic (ROC)’ segments. Refinement loss is the expected optimal loss as measured by the area under the optimal cost curve [41].

- TP—both actual class and predicted class of data point is 1.
- TN—both actual class and predicted class of data point is 0.
- FP—actual class of data point is 0 and predicted class of data point is 1.
- FN—actual class of data point is 1 and predicted class of data point is 0.

^{2}regression metric has been used for an explanatory purpose to provide an indication of the fitness in the predicted output values to the actual output values. It is calculated with a formula with the numerator as the MSE and the denominator as the variance in Y values.

#### 4.6. Model Store and Reuse

#### 4.7. Assessment of Body Composition

- Weak: BMI < 18.5.
- Normal weight: BMI is 18.5 to 24.9.
- Overweight: BMI is 25 to 29.9.
- Obesity class I: BMI >= 30.0 and BMI <= 34.9.
- Obesity class II: BMI >= 35.0 and BMI <= 39.9.
- Obesity class III: BMI >= 40.0.

## 5. Results and Discussions

^{2}= 0.96, and MAE = 0.06 with the 5-fold cross-validation technique. The best parameters of SVM are {’C’: 0.01, ’gamma’: 0.001, ‘kernel’: linear}, with a score of 95% following the grid search method. The resultant performance metrics of “SVM” are depicted in Figure 4.

## 6. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Willett, W.C.; Hu, F.B.; Thun, M. Overweight, obesity, and all-cause mortality. JAMA
**2013**, 309, 1681–1682. [Google Scholar] [CrossRef] [PubMed] - GBD 2015 Obesity Collaborators. Health effects of overweight and obesity in 195 countries over 25 years. N. Engl. J. Med.
**2017**, 377, 13–27. [Google Scholar] [CrossRef] [PubMed] - Ward, Z.J.; Bleich, S.N.; Cradock, A.L.; Barrett, J.L.; Giles, C.M.; Flax, C.; Long, M.W.; Gortmaker, S.L. Projected US State-Level Prevalence of Adult Obesity and Severe Obesity. N. Engl. J. Med.
**2019**, 381, 2440–2450. [Google Scholar] [CrossRef] [PubMed] - WHO Page. Available online: https://www.who.int/news-room/fact-sheets/detail/obesity-and-overweight; https://www.who.int/nmh/publications/ncd_report_chapter1.pdf (accessed on 18 March 2020).
- CDC Page. Available online: https://www.cdc.gov/obesity/adult/index.html (accessed on 18 March 2020).
- NICE Page. Available online: https://www.nice.org.uk/guidance/cg189 (accessed on 18 March 2020).
- Csige, I.; Ujvárosy, D.; Szabó, Z.; Lőrincz, I.; Paragh, G.; Harangi, M.; Somodi, S. The impact of obesity on the cardiovascular system. J. Diabetes Res.
**2018**, 2018, 3407306. [Google Scholar] [CrossRef] [PubMed][Green Version] - Nuttall, F.Q. Body Mass Index: Obesity, BMI, and Health: A Critical Review. Nutr. Today
**2015**, 50, 117–128. [Google Scholar] [CrossRef] [PubMed][Green Version] - Yang, L.; Colditz, G.A. Prevalence of overweight and obesity in the United States, 2007–2012. JAMA Intern. Med.
**2015**, 175, 1412–1413. [Google Scholar] [CrossRef] [PubMed][Green Version] - Gerdes, M.; Martinez, S.; Tjondronegoro, D. Conceptualization of a personalized ecoach for wellness promotion. In Proceedings of the 11th EAI International Conference on Pervasive Computing Technologies for Healthcare, Barcelona, Spain, 23–26 May 2017; pp. 365–374. [Google Scholar]
- Chatterjee, A.; Gerdes, M.W.; Martinez, S. eHealth Initiatives for The Promotion of Healthy Lifestyle and Allied Implementation Difficulties. In Proceedings of the 2019 IEEE International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob), Barcelona, Spain, 21–23 October 2019; pp. 1–8. [Google Scholar]
- Kaggle Data Page. Available online: https://www.kaggle.com/data (accessed on 18 March 2020).
- Dua, D.; Graff, C. UCI Machine Learning Repository; University of California, School of Information and Computer Science: Irvine, CA, USA, 2019. [Google Scholar]
- Moher, D.; Liberati, A.; Tetzlaff, J.; Altman, D.G.; The PRISMA Group. Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. PLoS Med.
**2009**, 6, e1000097. [Google Scholar] [CrossRef] [PubMed][Green Version] - PRISMA Page. Available online: www.prisma-statement.org (accessed on 18 March 2020).
- Woodward, M. Epidemiology: Study Design and Data Analysis; CRC Press: Boca Raton, FL, USA, 2013. [Google Scholar]
- Epidemiology Page. Available online: https://www.bmj.com/about-bmj/resources-readers/publications/epidemiology-uninitiated/1-what-epidemiology (accessed on 18 March 2020).
- Grabner, M. BMI trends, socioeconomic status, and the choice of dataset. In Obesity Facts; Karger Publishers: Basel, Switzerland, 2012; pp. 112–126. [Google Scholar]
- Singh, B.; Tawfik, H. A Machine Learning Approach for Predicting Weight Gain Risks in Young Adults. In Proceedings of the 10th IEEE International Conference on Dependable Systems, Services and Technologies (DESSERT), Leeds, UK, 5–7 June 2019; pp. 231–234. [Google Scholar]
- Farran, B.; AlWotayan, R.; Alkandari, H.; Al-Abdulrazzaq, D.; Channanath, A.; Thangavel, A.T. Use of Non-invasive Parameters and Machine-Learning Algorithms for Predicting Future Risk of Type 2 Diabetes: A Retrospective Cohort Study of Health Data From Kuwait. Front. Endocrinol.
**2019**, 10, 624. [Google Scholar] [CrossRef] - Padmanabhan, M.; Yuan, P.; Chada, G.; van Nguyen, H. Physician-friendly machine learning: A case study with cardiovascular disease risk prediction. J. Clin. Med.
**2019**, 8, 1050. [Google Scholar] [CrossRef][Green Version] - Selya, A.S.; Anshutz, D. Machine Learning for the Classification of Obesity from Dietary and Physical Activity Patterns. In Advanced Data Analytics in Health; Springer: Cham, Switzerland, 2018; pp. 77–97. [Google Scholar]
- Jindal, K.; Baliyan, N.; Rana, P.S. Obesity Prediction Using Ensemble Machine Learning Approaches. In Recent Findings in Intelligent Computing Techniques; Springer: Singapore, 2018; pp. 355–362. [Google Scholar]
- Zheng, Z.; Ruggiero, K. Using machine learning to predict obesity in high school students. In Proceedings of the 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Kansas City, MO, USA, 13–16 November 2017; pp. 2132–2138. [Google Scholar]
- Dunstan, J.; Aguirre, M.; Bastías, M.; Nau, C.; Glass, T.A.; Tobar, F. Predicting nationwide obesity from food sales using machine learning. Health Inform. J.
**2019**. Available online: https://journals.sagepub.com/doi/full/10.1177/1460458219845959 (accessed on 18 March 2020). [CrossRef] [PubMed][Green Version] - DeGregory, K.W.; Kuiper, P.; DeSilvio, T.; Pleuss, J.D.; Miller, R.; Roginski, J.W.; Fisher, C.B.; Harness, D.; Viswanath, S.; Heymsfield, S.B.; et al. A review of machine learning in obesity. Obes. Rev.
**2018**, 19, 668–685. [Google Scholar] [CrossRef] [PubMed] - Golino, H.F.; Amaral, L.S.D.B.; Duarte, S.F.P.; Gomes, C.M.A.; Soares, T.D.J.; Reis, L.A.D.; Santos, J. Predicting increased blood pressure using machine learning. J. Obes.
**2014**, 2014, 637635. [Google Scholar] [CrossRef] [PubMed] - Pleuss, J.D.; Talty, K.; Morse, S.; Kuiper, P.; Scioletti, M.; Heymsfield, S.B.; Thomas, D.M. A machine learning approach relating 3D body scans to body composition in humans. Eur. J. Clin. Nutr.
**2019**, 73, 200–208. [Google Scholar] [CrossRef] [PubMed] - Maharana, A.; Nsoesie, E.O. Use of deep learning to examine the association of the built environment with prevalence of neighborhood adult obesity. JAMA Netw. Open
**2018**, 1, e181535. [Google Scholar] [CrossRef] [PubMed] - Pouladzadeh, P.; Kuhad, P.; Peddi, S.V.B.; Yassine, A.; Shirmohammadi, S. Food calorie measurement using deep learning neural network. In Proceedings of the 2016 IEEE International Instrumentation and Measurement Technology Conference, Taipei, Taiwan, 23–26 May 2016; pp. 1–6. [Google Scholar]
- Schapire, R.E.; Freund, Y. Boosting: Foundations and algorithms. In Kybernetes; Emerald Insight: Bingley, UK, 2013. [Google Scholar]
- Brandt, S. Statistical and Computational Methods in Data Analysis; No. 04; QA273, B73 1976; North-Holland Publishing Company: Amsterdam, The Netherlands, 1976. [Google Scholar]
- Gulis, G.; Fujino, Y. Epidemiology, population health, and health impact assessment. J. Epidemiol.
**2015**, 25, 179–180. [Google Scholar] [CrossRef] [PubMed][Green Version] - Physio Net Page. Available online: https://physionet.org/about/database/ (accessed on 18 March 2020).
- BMI data GitHub page. Available online: https://github.com/chriswmann/datasets/blob/master/500_Person_Gender_Height_Weight_Index.csv (accessed on 18 March 2020).
- Insurance dataset page. Available online: http://www.sci.csueastbay.edu/~esuess/stat6620/#week-6 (accessed on 18 March 2020).
- Eating-Health-Module-Dataset Description. Available online: https://www.bls.gov/tus/ehmintcodebk1416.pdf (accessed on 18 March 2020).
- Python Page. Available online: https://docs.python.org/ (accessed on 18 March 2020).
- Sklearn Page. Available online: https://scikit-learn.org/stable/supervised_learning.html (accessed on 18 March 2020).
- Steyerberg, E.W.; Vickers, A.J.; Cook, N.R.; Gerds, T.; Gonen, M.; Obuchowski, N.; Pencina, M.J.; Kattan, M.W. Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology
**2010**, 21, 128. [Google Scholar] [CrossRef] [PubMed][Green Version] - Sklearn Probability Calibration Page. Available online: https://scikit-learn.org/stable/modules/calibration.html (accessed on 18 March 2020).
- Vidaurre, C.; Sannelli, C.; Müller, K.R.; Blankertz, B. Machine-learning-based coadaptive calibration for brain-computer interfaces. Neural Comput.
**2011**, 23, 791–816. [Google Scholar] [CrossRef] [PubMed] - Zimmerman, N.; Presto, A.A.; Kumar, S.P.; Gu, J.; Hauryliuk, A.; Robinson, E.S.; Robinson, A.L.; Subramanian, R. A machine learning calibration model using random forests to improve sensor performance for lower-cost air quality monitoring. Atmos. Meas. Tech.
**2018**, 11, 291–313. [Google Scholar] [CrossRef][Green Version] - Bella, A.; Ferri, C.; Hernández-Orallo, J.; Ramírez-Quintana, M.J. Calibration of machine learning models. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques; IGI Global: Hershey, PA, USA, 2010; pp. 128–146. [Google Scholar]

**Figure 1.**Prisma flowchart for the article selection process [16].

**Figure 2.**The focused epidemiological study triangle [33].

**Figure 5.**(

**a**) Reliability curve to classify the “BMI” data with different ML classifiers. (

**b**) Reliability curve to classify the “BMI” data with “Calibrated Decision Tree”.

**Figure 6.**Correlation heatmap and classification accuracy of ML models to classify “Insurance” data.

**Figure 11.**(

**a**) Obese condition by smoking status; (

**b**) distribution of obese patient group by smoking status.

**Figure 12.**(

**a**) Reliability curve to classify “Insurance” data with different ML classifiers. (

**b**) Reliability curve to classify “Insurance” data with the “Calibrated Decision Tree”.

**Figure 14.**(

**a**) Reliability curve to classify the “Eating-health-module” data with different ML classifiers. (

**b**) Reliability curve to classify the “Eating-health-module” data with the “Calibrated Decision Tree”.

**Figure 15.**(

**a**) Relationship of the outcome (obesity) with blood glucose; (

**b**) relationship of the outcome (obesity) with blood pressure; (

**c**) relationship of the outcome (obesity) with age.

**Figure 16.**(

**a**) Reliability curve to classify the “Diabetes” data with different ML classifiers. (

**b**) Reliability curve to classify the “Diabetes” data with the “Calibrated LR”.

**Figure 17.**Correlation heatmap and classification accuracy of ML models to classify the “Cardiovascular-disease” data.

**Figure 18.**(

**a**) Performance metric of the “SVM” classification with a 5-fold cross validation. (

**b**) Performance metric of the “Logistic Regression” classification with a 5-fold cross validation.

**Figure 19.**(

**a**) Reliability curve to classify the “Cardiovascular disease” data with different ML classifiers. (

**b**) Reliability curve to classify the “Cardiovascular disease” data with the “Calibrated LR”.

**Table 1.**Epidemiological study design [16].

Study Design | Type of Information Collected | Usage of the Information |
---|---|---|

Meta-analysis and systematic reviews | - Summary of the evidence of predominance of obesity/overweight worldwide
- Summary of the evidence of physiological risks associated to obesity/overweight
- Summary of the evidence of risk factors associated to obesity/overweight
- Summary of the evidence of effectiveness of obesity/overweight prevention plan
| Strategy and guideline planning |

Qualitative and quantitative studies | - Burden of obesity/overweight in society
- Correlation of risk factors with body energy imbalance
- Distribution of obesity prevalence among different age groups and socio-economic groups
- Identification of key risk factors, high risk groups of people, and related datasets
- Identification of used artificial intelligence (AI) models with their accuracy for classification and regression
| Policy, algorithm selection, data selection, controlled trial selection, feasibility study, goal setting, planning, resource allocation, priority setting, impact analysis, and evaluation |

Researcher | Model Use | Risk Factors |
---|---|---|

DeGregory et al. | Linear and logistic regression, artificial neural networks, deep learning, decision tree analysis, cluster analysis, principal component analysis (PCA), network science, and topological data analysis | Inactivity, improper diet |

Singh et al. | Multivariate regression methods and multilayer perceptron (MLP) feed-forward neural network models | BMI |

Bassam et al. | Logistic regression, k-nearest neighbor (KNN), support vector machine (SVM) | Age, sex, body mass index (BMI), pre-existing hypertension, family history of hypertension, and diabetes (type II) |

Meghana et al. | Automatic machine learning (AutoML) | cardiovascular diseases (CVDs) |

Seyla et al. | SVM | Activity, nutrition |

Jindal et al. | Random Forest | Age, height, weight, BMI |

Zheng et al. | improved decision tree (IDT), KNN, artificial neural network (ANN) | Inactivity, improper diet |

Dunstan et al. | SVM, Random Forest (RF), Extreme Gradient Boosting (XGB) | Unhealthy diet |

Golino et al. | Classification tree, logistic regression | Blood Pressure (BP), BMI, Waist Circumference (WC), Hip Circumference (HC), Waist–Hip Ratio (WHR) |

Pleuss et al. | Machine learning (ML) and 3D image processing | BMI, WC, HC |

Maharana et al. | convolutional neural network (CNN) | Environment, context |

Pouladzadeh et al. | CNN | Nutrition |

Repository | Name | Source | Category |
---|---|---|---|

Kaggle | 500_Person_Gender_Height_Weight_Index | www.github.com [35] | Obesity |

Kaggle | Insurance | www.csueastbay.edu [36] | Obesity |

Kaggle | Eating-health-module-dataset [37] | US Bureau of Labor Statistics | Obesity |

Kaggle/UCI | Pima-Indians-diabetes-database | UCI Machine Learning | Diabetes |

Kaggle | Cardiovascular-disease-dataset | Ryerson University | CVDs |

Type | Sample Size | Key Features |
---|---|---|

Person_Gender_Height_Weight_Index | 500 | Gender, height, weight |

Insurance | 1338 | Age, sex, BMI, smoking, charge, location |

Eating-health-module-dataset | 11212 | Sweet beverages, economic condition, fast food, sleeping, meat and milk consumption, drinking habit, exercise |

Pima-Indians-diabetes-database | 768 | Blood glucose, blood pressure, insulin intake, and age |

Cardiovascular-disease-dataset | 462 | Blood pressure, tobacco consumption, lipid profile, adiposity, family history, obesity, drinking habit, and age |

**Table 5.**Python libraries for data processing [38].

No. | Libraries | Purpose |
---|---|---|

1 | Pandas | Data importing, structuring, and analysis |

2 | NumPy | Computing with multidimensional array object |

3 | Matplotlib | Python 2-D plotting |

4 | SciPy | Statistical analysis |

5 | Seaborn, plotly | Plotting of high-level statistical graphs |

6 | Scikit-learn (Sklearn) | Machine learning, preprocessing, cross-validation, and evaluating the model’s performance |

7 | Graph Viz | Plotting of decision trees |

No. | Methods | Purpose |
---|---|---|

1 | Mean, standard deviation, skewness | Distribution test |

2 | t-test, z-test, F-test, Chi-square | Hypothesis test |

3 | Shapiro–Wilk, D’Agostino’s K^2, and Anderson–Darling test | Normality test |

4 | Covariance, correlation | Association test |

5 | Histogram, Swarm, Violin, Bee Swarm, Joint, Box, Scatter | Distribution plot |

6 | Quantile analysis | Outlier detection |

Method | Description | Samples |
---|---|---|

T Test | Test if the mean of a normally distributed value is different from a specified value (µ0) | Sample size < 30 |

Z Test | Test if two samples are equal or not | Sample size > 30 |

ANOVA or F Test | Test multiple groups at the same time | More than 2 samples |

Chi-Square Test | Check if observed patterns (O) of data fit some given distribution (E) or not. | Two categorical variables from a sample |

|r| Value | Meaning |
---|---|

0.00–0.2 | Very weak |

0.2–0.4 | Weak to moderate |

0.4–0.6 | Medium to substantial |

0.6–0.8 | Very strong |

0.8–1.0 | Extremely strong |

Type | Name | Optimization Method |
---|---|---|

Classification | SVM (kernel = linear or rbf) | Gradient descent |

Classification | Naïve Bayes | Gradient descent |

Classification | Decision Tree (entropy or gini) | Information Gain, Gini |

Classification | Logistic | Gradient descent |

Classification | KNN | Gradient descent |

Classification | Random Forest (RF) | Ensemble |

Calibration Classification | Calibrated Classifier (CV) | Probability (sigmoid, isotonic) |

Regression | Linear Regression | Gradient descent |

Regression | KNeighbors Regressor | Gradient descent |

Regression | Support Vector Regressor | Gradient descent |

Regression | Decision Tree Regressor | Gain, Gini |

Regression | Random Forest Regressor | Ensemble |

Regression | Bayesian Regressor | Gradient descent |

Regularization | Lasso (L1), Ridge (L2) | Gradient descent |

**Table 10.**Machine learning model store [27].

Method | Implementation |
---|---|

Pickle string | Import pickle library |

Pickled model | Import joblib from the sklearn.externals library |

Name of the Dataset | Data Processing Reason | Best ML Model with Performance Metrics | Identified Risk Factors |
---|---|---|---|

Person_Gender_Height_Weight_Index | - To check correlation between BMI and weight change.
- Comparing the performance of multiclass classifiers.
| SVM classifier Accuracy: 95% Mean squared error (MSE): 0.08 Mean absolute error (MAE): 0.06 R ^{2}: 0.96 | BMI |

Insurance | - To check the impact of identified health risk factors on weight change using regression and correlation.
- Comparing the performance of multiclass classifiers
- Comparing the performance of regression algorithms.
- To check if BMI has any relation with age or not.
| Decision tree (DTree) classifier Accuracy: 99.64% MSE: 1.0 MAE: 0.0 R ^{2}: 1.0RF regressor Accuracy: 82% MSE: 27240902.29 MAE: 3129.09 R ^{2}: 0.809 | Age, sex, BMI, smoking habit, economic condition |

Eating-health-module-dataset | - To check the impact of the identified health risk factors on weight change using regression and correlation.
- Comparing the performance of multiclass classifiers.
| DTree classifier Accuracy: 99.7% MSE:1.0 MAE: 0.0 R ^{2}: 1.0 | Sweet beverages, economic condition, fast food, sleeping, meat and milk consumption, drinking habit, exercise |

Pima-Indians-diabetes-database | - To check the impact of the identified health risk factors on weight change using regression and correlation.
- Comparing the performance of multiclass classifiers.
- To check the relationship between diabetes type II and obesity.
| SVM, Naïve Bayes, Logistic Regression (LR) Accuracy: 78% MSE: 0.209 MAE: 0.209 R ^{2}: 0.080 | Blood glucose, blood pressure, and age |

Cardiovascular-disease-dataset | - To check the impact of the identified health risk factors on weight change using regression and correlation.
- Comparing the performance of multiclass classifiers.
- To check the relationship between heart disease and obesity.
| SVM and Logistic regression Accuracy: 72% MSE: 0.275 MAE: 0.275 R ^{2}: −0.239 | Blood pressure, tobacco consumption, lipid profile, adiposity, family history, obesity, drinking habit, and age |

Name of the Dataset | Best ML Model | Best Calibration Method | Uncalibrated Brier Score | Calibrated Brier Score |
---|---|---|---|---|

Person_Gender_Height_Weight_Index | Decision Tree | Isotonic | 0.000 | 0.000 |

Insurance | Decision Tree | Isotonic | 0.000 | 0.000 |

Eating-health-module-dataset | Decision Tree | Isotonic, Sigmoid | 0.000 | 0.000 |

Pima-Indians-diabetes-database | Logistic Regression | Isotonic | 0.144 | 0.143 |

Cardiovascular-disease-dataset | Logistic Regression | Isotonic | 0.198 | 0.187 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Chatterjee, A.; Gerdes, M.W.; Martinez, S.G. Identification of Risk Factors Associated with Obesity and Overweight—A Machine Learning Overview. *Sensors* **2020**, *20*, 2734.
https://doi.org/10.3390/s20092734

**AMA Style**

Chatterjee A, Gerdes MW, Martinez SG. Identification of Risk Factors Associated with Obesity and Overweight—A Machine Learning Overview. *Sensors*. 2020; 20(9):2734.
https://doi.org/10.3390/s20092734

**Chicago/Turabian Style**

Chatterjee, Ayan, Martin W. Gerdes, and Santiago G. Martinez. 2020. "Identification of Risk Factors Associated with Obesity and Overweight—A Machine Learning Overview" *Sensors* 20, no. 9: 2734.
https://doi.org/10.3390/s20092734