Obesity Prediction Using Machine Learning Algorithms

Zeeshan, Adeen; Mir, Azka; Hattamurrahman, M. Putra Sani

doi:10.3390/engproc2025107125

Open AccessProceeding Paper

Obesity Prediction Using Machine Learning Algorithms^†

by

Adeen Zeeshan

^1,*,

Azka Mir

¹ and

M. Putra Sani Hattamurrahman

²

¹

Department of Software Engineering, University of Sialkot, Sialkot 51040, Pakistan

²

Department of Electrical Engineering, Nusa Putra University, Sukabumi 43152, West Java, Indonesia

^*

Author to whom correspondence should be addressed.

^†

Presented at the 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society, Aizuwakamatsu City, Japan, 20–26 January 2025.

Eng. Proc. 2025, 107(1), 125; https://doi.org/10.3390/engproc2025107125

Published: 10 October 2025

(This article belongs to the Proceedings of The 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society)

Download

Browse Figures

Versions Notes

Abstract

Accurate forecasting and management are crucial as obesity represents a global health issue linked to many chronic diseases. This research examines the use of machine learning algorithms such as Decision Tree, Random Forest, K-NN (K Nearest Neighbor), and Naive Bayes for predicting obesity by using a carefully selected dataset that includes factors like height, weight, dietary choices, and activity levels. The data processing stage involved feature selection, normalizing values, and dealing with missing data to improve the performance of the models. Among all evaluated algorithms, the Decision Tree method outperformed the other algorithms with an accuracy of 98.33%, surpassing Random Forest (98.27%), K-NN (98.03%), and Naive Bayes (90.08%). The findings reveal that tree-based models are more effective for obesity classification than traditional BMI-based approaches. The results indicate that tree-based models perform more accurate classification for obesity than traditional BMI-based methods, which makes them good substitutes. This study demonstrates the potential utility of machine learning in better managing obesity and indicates points of potential improvement, including using more data sources and complex models to increase the accuracy of the predictions.

Keywords:

obesity prediction; machine learning; decision tree; data preprocessing; classification accuracy

1. Introduction

Since obesity contributes substantially to the global burden of chronic illnesses like metabolic syndrome, type 2 diabetes, and cardiovascular conditions, it has emerged as a significant public health concern. Although BMI (Body Mass Index) and other traditional techniques of assessing obesity are widely used, they often fail to consider body composition and other important criteria, which leads to misclassification [1]. One promising tool is ML (machine learning), which makes use of different types of datasets in the pursuit of subtle patterns and the interconnectedness of causal factors related to obesity. Recent studies have shown how well machine learning algorithms predict obesity by considering variables including metabolic markers, physical activity, and eating patterns. However, most of the current methods either use a small number of datasets or concentrate on a small number of algorithms, which leaves gaps in reaching the best possible predicted accuracy. By evaluating the performance of four machine learning algorithms, decision tree, naïve Bayes, random forest, and K-NN (K Nearest Neighbor), on a carefully chosen dataset that contains several traits associated with obesity, this work seeks to close these gaps. Finding the optimal algorithm to predict obesity is the aim of this study, with an emphasis on appropriate feature selection and data processing. The findings add to the expanding corpus of research on machine learning applications in obesity management and lay the groundwork for creating practical, evidenced-based strategies to address the obesity pandemic. The methodology of the dataset, results, and potential avenues for future research are covered in depth in the sections that follow. The workflow of our research paper is illustrated in Figure 1.

2. Literature Review

By addressing the shortcomings of BMI on height and weight while ignoring body composition, which alters accurate obesity management, Jeon et al. uncovered a gap in the use of extensive machine learning models that involve 3D body scans, dual-energy X-ray absorptiometry (DXA), and bioelectrical impedance analysis (BIA) data in a unique framework that has been enhanced through featuring the selection of a genetic algorithm. The authors’ model excelled compared to BMI- and BIA-based techniques by a significant amount, reaching an accuracy rate of 80%. With its robust machine learning methodology, this study exhibited better obesity division than earlier research with narrow scopes or less model comparisons [2].

Jeon et al. addressed the limitations of the model that only includes lifestyle variables when trying to predict obesity by choosing to emphasize metabolic parameters that are distinct to age and gender. They detected a research gap in the adoption of metabolic data for individualized obesity categorization and employed six machine learning modules implementing to predict obesity risk. With an accuracy of more than 70% for younger groups and less for older ones, their result demonstrates the significance of triglyceride, glycated hemoglobin, alanine aminotransferase i.e., ALT (SGPT), and uric acid as predictors. Unlike earlier studies that concentrated on a smaller number of models or lifestyle factors, this metabolic approach enables more thorough predictions over a wide range of demographics [3]. Du et al. designed machine learning-based obesity risk prediction in response to the need for accurate and reasonably priced treatments for obesity and the chronic conditions it is associated with.

This strategy includes 10 models with extensive data, choosing XGBoost for its better performance compared to conventional approaches that have low predictive availability and frequently regard obesity as a binary outcome. By using SHAP (SHapley Additive exPlanations) analysis to identify important variables like hip circumference and triglycerides, the system demonstrates great prediction accuracy. In contrast to earlier research, it provides a more comprehensive, multi-stage classification of obesity and improves useful applications for medical practitioners [4]. Ref. [5]’s study tackles the problem of differentiating between obese and overweight people, which is a crucial gap in the literature on obesity since most studies focus on comparison with individuals’ normal weight. The scientist used a machine learning technique that integrates the study of metabolism and neuroimaging to identify unique brain connection patterns and gut-derived chemicals as important distinctions. Their models show accuracy above 90%, suggesting that obesity may require different brain-gut processes than those seen in overweight people. This shows that there is a possibility for focused intervention techniques based on unique neuro-metabolic markers, in contrast to earlier research that suggested obesity was just an extension of the overweight phenotype [6].

The need to predict obesity-related conditions like hypertension, dyslipidemia, and type 2 diabetes based on eating habits has increased as a result of the growth in cardiometabolic illnesses. This issue is tackled in the research performed by Ref [7]. This study addresses a research gap by utilizing a deep neural network (DNN) to uncover complex relationships in dietary data collected from the Korea National Health and Nutrition Examination Survey, where previous research often relied on traditional statistical approaches. Regarding conditions like hypertension and type 2 diabetes, the DNN model demonstrated superior prediction accuracy compared to logistic regression and decision tree models. By implementing deep learning for more accurate classification and forecasting across various health conditions, this research expands upon previous studies conducted by Panaretos and Choe [8]. The difficulty of predicting obesity and providing personalized meal suggestions to prevent obesity-related conditions is covered in a study by [4]. This research fills in the missing piece by integrating predictions related to meal planning. Earlier studies used standard machine learning techniques for forecasting obesity. The author used two machine learning approaches: gradient boosting (GB) and XGBoost. GB achieved a notable accuracy of 98.11% based on a 90:10 train/test split. Furthermore, they recommended meal options based on caloric needs using the Harris–Benedict equation. This method is better than any previous methods, including those used by Pang et al. and Ferdowsy et al., as they offer a comprehensive tool for obesity management that combines forecasting with practical dietary recommendations [9].

The worldwide issue of overweight and obesity and lifestyle-related diseases like cardiovascular issues and type 2 diabetes that people suffer with as a result are covered in this study. The study focuses on a research gap related to the inadequate use of data-related machine learning methods to analyze obesity risk factors using public health data. The author looks at the datasets from Kaggle and UCI (University of California) with various machine learning models to point out the risk factors and underscore the application of regression and visualization methods to enhance the strategies for obesity prevention.

This research provides a valuable perspective on data-driven health interventions by evaluating the anticipated effectiveness of different machine learning techniques [7]. The main issues related to childhood obesity, which leads to the likelihood of developing long-term health issues such as diabetes and cardiovascular diseases, are examined in this study by Singh and Tawfik. They pointed to issues such as uneven BMI standards and data imbalance as restrictions in the use of sophisticated machine learning algorithms for identifying obesity in children. Their method forecasts obesity at age 14 based on past BMI records and improves prediction accuracy by utilizing techniques like SMOTE (Synthetic Minority Over-sampling Technique) and analyzing data from the “UK’s Millennium Cohort” Study. The outcome demonstrates that for “at-risk” youngsters, their sophisticated forecast accuracy outperforms more basic statistical models. This research is different from other studies as it uses sophisticated machine learning techniques to enhance obesity risk prediction [10]. Colmenarejo’s research focuses on the rising issues of obesity among teenagers, which causes significant health risks such as metabolic disorders and cardiovascular diseases.

This study points to the limitations of traditional statistical models that struggle to capture the complex and nonlinear relationships present in obesity data. By exploring various models, including those that utilize deep learning methods and extensive datasets, the research shows the better predictive capabilities of machine learning for identifying obesity risk. When dealing with high-dimensional, multivariate data, machine learning (ML) outperforms simpler statistical methods by offering enhanced insights into risk factors and delivering more accurate predictions [11,12,13].

The growing issue of obesity was tackled by Ref [8], who highlighted the importance of identifying which type of lifestyle results in the majority of weight gain across different populations [14]. The author points out a gap in existing research by employing interpretable machine learning techniques on data from the U.S National Health and Nutrition Survey (CHNS), keeping in mind the limitations of previous studies that are only based on a single country or condition statistical methods [15]. Using a gradient boosting decision tree, their work indicates that the main factors contributing to obesity risk are protein intake, alcohol consumption, and lack of physical activity. This research aligns with earlier studies recognizing alcohol and inactivity as an obesity risk factor but surpasses them regarding cross-national scope and predictive performance.

3. Research Methodology

To predict obesity, machine learning classifiers are used on a carefully chosen dataset in this work. The goal of this project is to develop an efficient and understandable machine learning-based system that predicts obesity using Rapid Miner. The methodology includes choosing features, training models, and processing data and evaluation. The following algorithms were used in our experiment:

Decision Tree

The dataset was divided relative to feature values utilizing a decision tree classifier, thereby creating a tree structure to categorize obesity levels. The dataset was periodically divided into subgroups by the tree according to the characteristics that decrease impurity or optimize information gain. Information gain in Equation (1):

(X, Z) = H (X) - \sum_{z \in Z} P (z) H (X| Z = z)

(1)

where

H (X)

is the entropy of the dataset,

P (z) = \frac{X_{z}}{X}

the proportion of samples where feature

Z

takes value

z

, and

H (X | Z = z)

is the entropy of the subset

X_{z}

.

Random Forest

Random forest is an ensemble learning method that builds several decision trees during training and chooses the most common class for classification tasks or average predictions for regression tasks to obtain the final result. Random forest formula shown in Equation (2):

\hat{y} = M o d e (R_{1} (a), R_{2} (a), \dots, R_{n} (a))

(2)

where

\hat{y}

is the final predictedoutput,

R_{i} (a) =

prediction of the

i^{t h}

decision tree for input a.

n =

number of decision trees.

K-Nearest Neighbors (K-NNs)

The K-NN algorithm classifies a data point based on most classes of its k-nearest neighbors. Distance of metric: the degree of resemblance between data points is determined using the Euclidean distance, the Equation (3) below:

d (A, B) = \sqrt{Σ {(A_{i} - B_{i})}^{2}}

(3)

where

d

is the distance measure,

d (A, B)

is the distance (a non-negative real number) between data points

A

and

B

.

A_{i}

and

B_{i}

= the

i^{t h}

coordinate (feature value) of data points

A

and

B

, respectively.

Naïve Bayes

Assuming feature independence, the naïve Bayes classification technique is based on Bayes’ theorem. It works well with large data and uses feature likelihoods to estimate the likelihood that a data point belongs to a class. The Equation (4) below:

P (A| K) = \frac{P (K| A) \cdot P (A)}{P (K)}

(4)

where

P(A|K) = posterior probability of class A given feature vector K;
P(K|A) = likelihood of K given class A;
P(A) = prior probability of class A;
P(K) = marginal probability of K.

3.1. Framework

The procedure of the proposed framework, which integrates several machine learning classifiers for predicting obesity, includes gathering and preprocessing obesity-related data; feature engineering, i.e., handling all missing values, selecting features, and selecting relevant predictors; model training and evaluation using machine learning algorithms to train and validate modes; and generating predictions using the model with the best performance. Figure 2 shows framework of the proposed methodology, illustrating the sequence of processes from data collection and preprocessing through feature selection, data splitting, model training, validation, prediction, performance evaluation, and final conclusion.

3.2. Attribute with Description

Twenty-one characteristics, including physical stature, family history of obesity, and lifestyle choices, were included in the dataset utilized for this investigation. People are categorized by the goal variables into classes like normal, overweight, and obese. Table 1 shows the features used in this research along with feature type and description.

3.3. Replace Missing Values

We used different techniques to tackle the missing values in the dataset to make it clear in order to achieve better performance and accuracy. The techniques we used to handle the missing values are as follows: to deal with the numeric features, we replaced the mean of the respective feature, and for categorical features, we replaced this with the most frequent category, ensuring that no data point is lost so that the dataset maintains its integrity.

3.4. Split Data

To guarantee proper model training and evaluation, the dataset was divided into two sections: training and testing. The workflow is shown in Figure 3. Twenty percent of the data were set aside for performance evaluation, and the remaining eighty percent were used for training.

3.5. Machine Learning

Four distinct machine learning algorithms decision tree, random forest, K-NN, and naive Bayes—are used in the suggested methodology to forecast obesity based on the carefully selected dataset. Each of these algorithms was implemented within RapidMiner, and their respective performances were estimated based on accuracy as shown in Table 2. The best accuracy of 98.33% was achieved with the decision tree algorithm, with a tree-based structure for the classification of the data. Random forest is another ensemble technique in which many decision trees are aggregated to obtain an improved version in terms of robustness; 98.27% accuracy was observed from this. In the K-NN algorithm, classification is made using the majority class of its neighbors, achieving a high accuracy of 98.03%. Finally, the probabilistic classifier was naive Bayes is the Bayes theorem-based method. It obtained only 90.08% accuracy; this was low because of the independence assumption among the features. The results show that tree-based models perform well in predicting obesity effectiveness; therefore, for this dataset, decision tree will be the appropriate algorithm.

4. Results

Random forest, K-NN, naive Bayes, and decision tree were the four machine learning algorithms that were examined in this study. At 98.33% accuracy, decision tree had the highest score, followed by random forest at 98.27% and K-NN at 98.03%. Naive Bayes, which assumed independence, performed the worst, with a score of 90.08%. All things considered, tree-based models performed well, so decision tree was the ideal option for this task.

5. Conclusions and Future Work

Decision tree achieved the highest accuracy of 98.33% in predicting obesity, followed by random forest, K-NN, naive Bayes, and decision tree. Compared with more conventional techniques such as BMI, the results have shown that tree-based models are often the best for the classification of obesity. Future research could focus on the inclusion of more data sources, such as genetic and metabolic markers, and the exploration of cutting-edge methods like deep learning to further enhance model performance. Practical applications in healthcare will be further supported by creating customized obesity management systems and tackling issues like data imbalance and real-time deployment.

Author Contributions

A.Z. contributed to the conceptualization of the study, development of the methodology, formal analysis, drafting of the original manuscript, supervision, and project administration. A.M. was responsible for data curation, investigation, validation, visualization, and contributed to reviewing and editing the manuscript. M.P.S.H. provided software support, resources, validation, funding acquisition, and also assisted in reviewing and editing the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available from the corresponding author upon reasonable request.

Conflicts of Interest

Authors declare no conflict of interest.

References

Jeon, J.; Lee, S.; Oh, C. Age-specific risk factors for the prediction of obesity using a machine learning approach. Front. Public Health 2023, 10, 998782. [Google Scholar] [CrossRef] [PubMed]
Colmenarejo, G. Machine learning models to predict childhood and adolescent obesity: A review. Nutrients 2020, 12, 2466. [Google Scholar] [CrossRef] [PubMed]
Jeon, S.; Kim, M.; Yoon, J.; Lee, S.; Youm, S. Machine learning-based obesity classification considering 3D body scanner measurements. Sci. Rep. 2023, 13, 3299. [Google Scholar] [CrossRef] [PubMed]
Kaur, R.; Kumar, R.; Gupta, M. Predicting risk of obesity and meal planning to reduce the obese in adulthood using artificial intelligence. Endocrine 2022, 78, 458–469. [Google Scholar] [CrossRef] [PubMed]
Osadchiy, V.; Bal, R.; Mayer, E.A.; Kunapuli, R.; Dong, T.; Vora, P.; Petrasek, D.; Liu, C.; Stains, J.; Gupta, A. Machine learning model to predict obesity using gut metabolite and brain microstructure data. Sci. Rep. 2023, 13, 5488. [Google Scholar] [CrossRef] [PubMed]
Cheng, E.; Steinhardt, R.; Miled, Z. Predicting childhood obesity using machine learning: Practical considerations. BioMedInformatics 2022, 2, 184–203. [Google Scholar] [CrossRef]
Kim, H.; Lim, D.; Kim, Y. Classification and prediction on the effects of nutritional intake on overweight/obesity, dyslipidemia, hypertension and type 2 diabetes mellitus using deep learning model: 4–7th Korea National Health and Nutrition Examination Survey. Int. J. Environ. Res. Public Health 2021, 18, 5597. [Google Scholar] [CrossRef] [PubMed]
Sun, Z.; Yuan, Y.; Farrahi, V.; Herold, F.; Xia, Z.; Xiong, X.; Qiao, Z.; Shi, Y.; Yang, Y.; Qi, K.; et al. Using interpretable machine learning methods to identify the relative importance of lifestyle factors for overweight and obesity in adults: Pooled evidence from CHNS and NHANES. BMC Public Health 2024, 24, 3034. [Google Scholar] [CrossRef] [PubMed]
Singh, B.; Tawfik, H. Machine learning approach for the early prediction of the risk of overweight and obesity in young people. In Computational Science—ICCS 2020, Proceedings of the ICCS 2020, Virtual, 18-21 September 2020; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; Volume 12140, pp. 523–535. ISSN 0302-9743. [Google Scholar] [CrossRef]
Du, J.; Yang, S.; Zeng, Y.; Ye, C.; Chang, X.; Wu, S. Visualization obesity risk prediction system based on machine learning. Sci. Rep. 2024, 14, 22424. [Google Scholar] [CrossRef] [PubMed]
Cheng, X.; Lin, S.Y.; Liu, J.; Liu, S.; Zhang, J.; Nie, P.; Fuemmeler, B.F.; Wang, Y.; Xue, H. Does physical activity predict obesity—A machine learning and statistical method-based analysis. Int. J. Environ. Res. Public Health 2021, 18, 3966. [Google Scholar] [CrossRef] [PubMed]
Pang, X.; Forrest, C.; Lê-Scherban, F.; Masino, A. Prediction of early childhood obesity with machine learning and electronic health record data. Int. J. Med. Inform. 2021, 150, 104454. [Google Scholar] [CrossRef] [PubMed]
Diwaker, C.; Tomar, P.; Solanki, A.; Nayyar, A.; Jhanjhi, N.Z.; Abdullah, A.; Supramaniam, M. A new model for predicting component-based software reliability using soft computing. IEEE Access 2019, 7, 147191–147203. [Google Scholar] [CrossRef]
Kok, S.H.; Abdullah, A.; Jhanjhi, N.Z.; Supramaniam, M. A review of intrusion detection system using machine learning approach. Int. J. Eng. Res. Technol. 2019, 12, 8–15. [Google Scholar]
Airehrour, D.; Gutierrez, J.; Ray, S.K. GradeTrust: A secure trust based routing protocol for MANETs. In Proceedings of the 2015 International Telecommunication Networks and Applications Conference (ITNAC), Sydney, Australia, 18–20 November 2015; pp. 65–70. [Google Scholar] [CrossRef]

Figure 1. Research Paper Workflow.

Figure 2. Framework.

Figure 3. Split data.

Table 1. Attribute with description.

Feature Name	Description	Type
Height (cm)	Height of the individual	Numerical
Weight (kg)	Weight of the individual	Numerical
Family History (Overweight)	Whether there’s a family history of obesity	Categorical
Meals per Day	Number of main meals consumed daily	Numerical
Hydration (Normalized)	Normalized water intake	Numerical
FAF (Normalized)	Physical activity frequency (normalized)	Numerical
Transport (Walking)	Whether walking is a mode of transport	Binary
Age (Normalized)	Normalized age	Numerical
Obesity Class	Target variable (Normal, Overweight, Obese)	Categorical

Table 2. Algorithm Accuracy and Key Observations.

Algorithm	Accuracy (%)	Key Observations
Decision Tree	98.33	Achieved the highest accuracy; highly interpretable.
Random Forest	98.27	Robust and comparable to Decision Tree.
K-Nearest Neighbors	98.03	Effective but computationally expensive for large datasets.
Naive Bayes	90.08	Lowest accuracy due to its independence assumption.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeeshan, A.; Mir, A.; Hattamurrahman, M.P.S. Obesity Prediction Using Machine Learning Algorithms. Eng. Proc. 2025, 107, 125. https://doi.org/10.3390/engproc2025107125

AMA Style

Zeeshan A, Mir A, Hattamurrahman MPS. Obesity Prediction Using Machine Learning Algorithms. Engineering Proceedings. 2025; 107(1):125. https://doi.org/10.3390/engproc2025107125

Chicago/Turabian Style

Zeeshan, Adeen, Azka Mir, and M. Putra Sani Hattamurrahman. 2025. "Obesity Prediction Using Machine Learning Algorithms" Engineering Proceedings 107, no. 1: 125. https://doi.org/10.3390/engproc2025107125

APA Style

Zeeshan, A., Mir, A., & Hattamurrahman, M. P. S. (2025). Obesity Prediction Using Machine Learning Algorithms. Engineering Proceedings, 107(1), 125. https://doi.org/10.3390/engproc2025107125

Article Menu

Obesity Prediction Using Machine Learning Algorithms^†

Abstract

1. Introduction

2. Literature Review