Machine Learning Approaches for Predicting Fatty Acid Classes in Popular US Snacks Using NHANES Data

In the US, people frequently snack between meals, consuming calorie-dense foods including baked goods (cakes), sweets, and desserts (ice cream) high in lipids, salt, and sugar. Monounsaturated fatty acid (MUFA) and polyunsaturated fatty acid (PUFA) are reasonably healthy; however, excessive consumption of food high in saturated fatty acid (SFA) has been related to an elevated risk of cardiovascular diseases. The National Health and Nutrition Survey (NHANES) uses a 24 h recall to collect information on people’s food habits in the US. The complexity of the NHANES data necessitates using machine learning (ML) methods, a branch of data science that uses algorithms to collect large, unstructured, and structured data sets and identify correlations between the data variables. This study focused on determining the ability of ML regression models including artificial neural networks (ANNs), decision trees (DTs), k-nearest neighbors (KNNs), and support vector machines (SVMs) to assess the variability in total fat content concerning the classes (SFA, MUFA, and PUFA) of US-consumed snacks between 2017 and 2018. KNNs and DTs predicted SFA, MUFA, and PUFA with mean squared error (MSE) of 0.707, 0.489, 0.612, and 1.172, 0.846, 0.738, respectively. SVMs failed to predict the fatty acids accurately; however, ANNs performed satisfactorily. Using ensemble methods, DTs (10.635, 5.120, 7.075) showed higher error values for MSE than linear regression (LiR) (9.086, 3.698, 5.820) for SFA, MUFA, and PUFA prediction, respectively. R2 score ranged between −0.541 to 0.983 and 0.390 to 0.751 for models one and two, respectively. Extreme gradient boost (XGR), Light gradient boost (LightGBM), and random forest (RF) performed better than LiR, with RF having the lowest score for MSE in predicting all the fatty acid classes.


Introduction
Most often energy-dense snacks are consumed in between meals. Frequently, they are low in micronutrients but high in calories, lipids, salt, or sugar [1,2]. Hunger, availability, food culture, boredom, and interruption are significant drivers of snacking [2]. In addition, due to urbanization, most consumers are desk-bound and inactive and do not expend energy from energy-dense foods. This contributes to obesity and other diet-related health problems and has become a growing concern worldwide [3]. In addition, consuming a diet low in sodium and saturated fatty acids (SFAs) helps in maintaining healthy blood pressure and low blood cholesterol level, respectively [4]. The high incidence of essential hypertension has led to increased efforts to reduce sodium and certain classes of fatty acids

Modeling
The Orange software, version 3.31, used in this study has tools known as widgets. The data were uploaded into the file widget, then the parameters (target and feature) were selected from the various categories in the data. The independent variables included energy, snack type, ingredient weights, total fat content, method of preparation, major nutrient, and ingredient description while the dependent variables were SFA, MUFA, or PUFA. The default pre-processor, which is included in the models, was used. Pre-processing activities included replacing all missing values (0.1%) with averages, transforming categorical variables into numbers, and scaling by standardization. Standardization was selected because it is less sensitive to outliers, preserves original data distribution, and was compatible with the selected algorithms such as KNN and ANN for stability during training. The data were then split into 70% and 30% for training and testing, respectively. The ML algorithms, KNN, ANN, SVM, and DT were used to train and test the data. The rectified linear activation unit (ReLU) function with 200 iterations and 100 neurons in the hidden layer was used for the ANN model. The radial basis function (RBF) with 100 iterations was used for SVM. For the KNN, distance weight was selected with Euclidean metrics. These combinations resulted in the best performance for each model. Finally, the model output was evaluated based on the following metrics: root means squared error (RMSE), mean absolute error (MAE), mean square error (MSE), and coefficient of determination (R 2 ). Python software version 3.10.6 was used to analyze the same data to create a second model using 500 selected popular snacks from various categories. The 500 selected snacks were from the categories with relatively high amounts of saturated fatty acid. Using different sets of data provides an additional "cushion" in addressing model uncertainty and the stability of the models.
After importing libraries, including pandas (version 1.3.5), seaborn (version 0.11.2), and NumPy (version 1.21.6), the data were uploaded and pre-processed. Pre-processing involved missing value treatment, variable transformation, and reduction. A summary of the data, including shape, range, mean, first, second, and third quartiles, was obtained. One row with missing values was deleted. Then, a box and whisker plot was generated to determine the data distribution. The features included snack type, flavor type (chocolate, vanilla, and fruits), major nutrients (proteins, carbohydrates, and fats) and primary methods (fermentation: yogurt, baking: cakes and doughnuts, frying: chips and popcorn) energy, total fat, SFA, MUFA, and PUFA. Processing methods, such as frying, for instance, have been reported to contribute to the fatty acid content in foods. Flavors such as chocolate snacks contribute to the overall fatty acid content. Feature transformation was done by replacing the non-numerical categories (flavor type, major nutrient, snack type, and preparatory method) with unique integer values using the LabelEncoder function since the variables had no order or hierarchy, and feature selection was made using the permutation_importance function from the scikit-learn library to compute the most significant variable contributing to the generation of the final model. The less essential features, which did not contribute to the modeling (food code, category description, sequence number) were dropped. A heat map was generated to find the correlation between the dependent and independent variables. K-fold cross-validation was used to train and test the datasets using k-folds of 5 and a random state of 100. The ML algorithms (extreme gradient boost regressor (XGR)), DT, light gradient boost model (LightGBM), and RF were used for predicting SFA, MUFA, and PUFA contents in snacks, using the total fat content. The models were also evaluated with the metrics mentioned above.
Other ML models include random forest (RF), partial least squares (PLS), partial least squares discriminant analysis (PLS-DA), principal component analysis (PCA), linear regression (LiR), logistic regression (LR), and KNN. PLS is a dimension reduction algorithm that minimizes multicollinearity in a dataset and can be used for multivariate outcomes [39]. Multicollinearity occurs if the independent variables in the dataset correlate with each other. It results in errors when building models. PLS regression is used when the dependent variable is continuous, and PLS-DA is applied for categorical dependent variables [39]. PCA is an unsupervised ML algorithm that recognizes patterns and is a feature extraction tool. It also gives knowledge on the ability of the data to be used for model creation through correlation [40] by restructuring data into principal components (PCs) based on the variance. If the sum of the first few components (PC1 and PC2) has a 95% variance or more, the data are regarded as dimensionally reduced with minimal loss of information [41]. Since it distinctively separates similar variables into groups, when these groups overlap, classification accuracy is minimized [27]. Linear regression mainly predicts the relationship between a continuous dependent variable (quantitative) and an independent variable by fitting a regression line to reduce the residual error [42]. Its modeling performance is enhanced with datasets devoid of multicollinearity [39]. Although regularization can decrease overfitting in LR, the major demerit is its ability to simplify practical problems [43]. The experimental workflow for the modeling process is illustrated in Figure 1. the data are regarded as dimensionally reduced with minimal loss of information [41].
Since it distinctively separates similar variables into groups, when these groups overlap, classification accuracy is minimized [27]. Linear regression mainly predicts the relationship between a continuous dependent variable (quantitative) and an independent variable by fitting a regression line to reduce the residual error [42]. Its modeling performance is enhanced with datasets devoid of multicollinearity [39]. Although regularization can decrease overfitting in LR, the major demerit is its ability to simplify practical problems [43]. The experimental workflow for the modeling process is illustrated in Figure 1. variables [39]. PCA is an unsupervised ML algorithm that recognizes patterns and is a feature extraction tool. It also gives knowledge on the ability of the data to be used for model creation through correlation [40] by restructuring data into principal components (PCs) based on the variance. If the sum of the first few components (PC1 and PC2) has a 95% variance or more, the data are regarded as dimensionally reduced with minimal loss of information [41]. Since it distinctively separates similar variables into groups, when these groups overlap, classification accuracy is minimized [27]. Linear regression mainly predicts the relationship between a continuous dependent variable (quantitative) and an independent variable by fitting a regression line to reduce the residual error [42]. Its modeling performance is enhanced with datasets devoid of multicollinearity [39]. Although regularization can decrease overfitting in LR, the major demerit is its ability to simplify practical problems [43]. The experimental workflow for the modeling process is illustrated in Figure 1.

Results and Discussion
The first modeling was done using the entire snack dataset. The results indicated that popular snacks sold in the US contain all three FA classes with 32.62% SFA, 34.84% MUFA, and 25.57% PUFA. Most of the snacks consumed in India contain all three FA classes [44] . Research has shown that a high-fat diet, especially one high in SFAs, increases the risk of diseases such as obesity, diabetes mellitus, cancers, and cardiovascular diseases. Frozen fruit juice bars in the 'gelatin, ice, and sorbet' category had between 20-60 kcal and 0-0.01 g total fat compared to chocolate-covered and nut candies with about 500-590 kcal and the highest total fat content ranging between 30-43.47 g. Chocolate-covered and nut candies had the worst composition. Therefore, the ML approach was used as a high computational technique to extract essential features from datasets used in this study (NHANES 2017(NHANES -2018 to rapidly predict the type of FA in a food product based on the total fat content.
All the models for all the FA classes (Table 1) showed a slight difference between the values obtained for the training and testing dataset metrics. Low error rates in models depict high prediction abilities ( Table 1). The error (MSE, RMSE, and MAE) margin obtained for the SVM model was very high in all the predictions of the FA classes (SFA >

Results and Discussion
The first modeling was done using the entire snack dataset. The results indicated that popular snacks sold in the US contain all three FA classes with 32.62% SFA, 34.84% MUFA, and 25.57% PUFA. Most of the snacks consumed in India contain all three FA classes [44]. Research has shown that a high-fat diet, especially one high in SFAs, increases the risk of diseases such as obesity, diabetes mellitus, cancers, and cardiovascular diseases. Frozen fruit juice bars in the 'gelatin, ice, and sorbet' category had between 20-60 kcal and 0-0.01 g total fat compared to chocolate-covered and nut candies with about 500-590 kcal and the highest total fat content ranging between 30-43.47 g. Chocolate-covered and nut candies had the worst composition. Therefore, the ML approach was used as a high computational technique to extract essential features from datasets used in this study (NHANES 2017(NHANES -2018 to rapidly predict the type of FA in a food product based on the total fat content. All the models for all the FA classes (Table 1) showed a slight difference between the values obtained for the training and testing dataset metrics. Low error rates in models depict high prediction abilities ( Table 1). The error (MSE, RMSE, and MAE) margin obtained for the SVM model was very high in all the predictions of the FA classes (SFA > PUFA > MUFA). KNN consistently outperforms DT, SVM, and ANN across various performance metrics for all FA classes. It achieves lower errors (MSE, RMSE, MAE) and higher R 2 scores, indicating better accuracy and ability to explain the variance in the data (Table 1). ANN shows moderate performance, but it falls short compared to KNN in terms of accuracy and ability to explain the variance. To enhance the prediction performance of the ANN algorithm, the model must be optimized using algorithms, such as gradient descent, to test many variables and select the one with the least error. For this study, the ANN model was not optimized, which could be a reason for its satisfactory performance.

Results and Discussion
The first modeling was done using the entire snack dataset. The results indicated that popular snacks sold in the US contain all three FA classes with 32.62% SFA, 34.84% MUFA, and 25.57% PUFA. Most of the snacks consumed in India contain all three FA classes [44]. Research has shown that a high-fat diet, especially one high in SFAs, increases the risk of diseases such as obesity, diabetes mellitus, cancers, and cardiovascular diseases. Frozen fruit juice bars in the 'gelatin, ice, and sorbet' category had between 20-60 kcal and 0-0.01 g total fat compared to chocolate-covered and nut candies with about 500-590 kcal and the highest total fat content ranging between 30-43.47 g. Chocolate-covered and nut candies had the worst composition. Therefore, the ML approach was used as a high computational technique to extract essential features from datasets used in this study (NHANES 2017(NHANES -2018 to rapidly predict the type of FA in a food product based on the total fat content. All the models for all the FA classes (Table 1) showed a slight difference between the values obtained for the training and testing dataset metrics. Low error rates in models depict high prediction abilities ( Table 1). The error (MSE, RMSE, and MAE) margin obtained for the SVM model was very high in all the predictions of the FA classes (SFA > PUFA > MUFA). KNN consistently outperforms DT, SVM, and ANN across various performance metrics for all FA classes. It achieves lower errors (MSE, RMSE, MAE) and higher R 2 scores, indicating better accuracy and ability to explain the variance in the data (Table 1). ANN shows moderate performance, but it falls short compared to KNN in terms of accuracy and ability to explain the variance. To enhance the prediction performance of the ANN algorithm, the model must be optimized using algorithms, such as gradient descent, to test many variables and select the one with the least error. For this study, the ANN model was not optimized, which could be a reason for its satisfactory performance.  In this study, KNN and DT predict values close to the values of interest [33]. Feature reduction is one characteristic that improves data. For instance, models such as KNN perform well when the number of columns/features are reduced [45]. In this study, after eliminating variables such as food code, sequence number, and ingredient description, which were not needed for prediction, the number of features was reduced and helped to improve the data. This may explain why the KNN was the most suitable model for this dataset, given its consistent and superior performance. KNN also uses all the instances in the training dataset to make predictions when a new dataset (test data) is given [35] and is a straightforward algorithm that makes no assumptions about datasets. It only stores the information from the trained data and uses the similarities between the trained data and new data to predict values [35].
SVM was unsuitable for this work compared to the other algorithms, as indicated by higher MSE, RMSE, and MAE values for all fatty acids in training and testing. The R 2 scores for SVM are negative or near zero, suggesting that the model does not sufficiently capture the underlying patterns and explain the data variance. It often overfits the model, minimizing its predictive power, which could be a reason for its low scores in this study [30]. SVM's performance might be affected by the choice of hyperparameters, or the linear nature of the model compared to the other models. Additionally, it is most suited for binary classification problems [46]. SVM is mainly used as a classifier and not as a regression model. This could influence its inability to predict the FA, as the data used in this study were continuous. These techniques can help generate ML models suitable for predicting different nutrient content in foods. Table 2 summarizes the quantitative dependent and independent variables in the dataset. The NHANES data are non-linear, and it can be inferred from the summary that the dataset is skewed. This is because a large part of the data falls within the third quartile while the first quartile is occupied by a few datasets across all the FA classes. Building a potent model involves problem formulation or data acquisition, tidying data, pre-processing, train-test split or cross-validation, model building, validation, prediction, and evaluation of model accuracy. For example, k-fold cross-validation is an evaluation technique that avoids overfitting by splitting different datasets into a specified number of k-fold for training and testing to accurately assess a model's performance [47]. Creating a second model using selected snacks and ensemble methods yielded slightly different results.
The magnitude of the correlation coefficient (r) reveals how strongly an association exists. The correlation matrix also aids in feature selection to detect multicollinearity among independent variables, which affects the final model's performance (Figure 2) [48]. Two independent variables should not have a high correlation. The correlation matrix ( Figure 2) showed a strong positive correlation among the independent variable, total fat, and dependent variables MUFA, PUFA, and SFA, with the highest correlation (0.83) between the total fat and MUFA. A robust positive correlation from 0.70 and above means that between two variables, one increases with a corresponding increase in the other [49], hence MUFA, PUFA (0.70), and SFA (0.71) increased in the snacks as the total fat content increased, with MUFA having the highest percentage increase. that between two variables, one increases with a corresponding increase in the other [49], hence MUFA, PUFA (0.70), and SFA (0.71) increased in the snacks as the total fat content increased, with MUFA having the highest percentage increase.     and ingredient weight had outliers. Ensemble and gradient boosting methods, including RF, XGR, and LightGBM, can analyze data (both linear and non-linear) without linear model assumptions to get accurate results and treat outliers [50]. Therefore, treatment of the outliers was not necessary. Linear models, such as LiR, on the other hand, make assumptions where data variables, in which multicollinearity and outliers exist, might result in lower algorithm performance. Therefore, such models may not be suitable for NHANES data.
Outliers are easily visualized using box and whisker plots, as illustrated (Figure 3), facilitating their treatment in the dataset. Points that lie outside the interquartile range of the dataset are outliers. For example, from Figure 3, features such as SFA, major nutrients, and ingredient weight had outliers. Ensemble and gradient boosting methods, including RF, XGR, and LightGBM, can analyze data (both linear and non-linear) without linear model assumptions to get accurate results and treat outliers [50]. Therefore, treatment of the outliers was not necessary. Linear models, such as LiR, on the other hand, make assumptions where data variables, in which multicollinearity and outliers exist, might result in lower algorithm performance. Therefore, such models may not be suitable for NHANES data. The model performance is summarized in Table 3. Ensemble techniques act as a "black box" that uses numerous features from a dataset and mixed models, typically tree models, to enhance them [54]. LiR was used as a baseline model for comparison. LiR had the highest scores for MAE in all FA classes; however, its error rate was slightly lower than DT for RMSE and MSE. RF showed the least error scores for RMSE, MSE, and MAE in predicting all content of the individual FA classes and was, therefore, the best prediction model. Higher R 2 values are also associated with decreased error scores, as seen in the MAE, RMSE, and MSE in all the fatty acid models, which indicates that these models were better at prediction. R 2 ranges between 0-1, with 1 being a perfect fit and capturing a significant portion of data variance and 0 meaning no relationship between the predicted and actual value. The R 2 scores ranged from 0.695 to 0.724, 0.390 to 0.514, 0.635 to 0.722, 0.503 to 0.622, and 0.686 to 0.751 in XGR, DT, LightGBM, LiR, and RF, respectively, for the FA classes. XGR, LightGBM, and RF exhibited higher R 2 values, indicating that these have a better ability to explain the variance in the data and accurately capture trends between Model 2's parameter tuning, and evaluation of essential attributes allowed it to find the ideal configuration to provide the highest accuracy [51]. The contribution of the independent variables in the various predictions is illustrated (Figure 4). The feature that contributed most to the accurate prediction of all three FA classes was total fat. Variations in "total fat" in food are closely related to changes in SFA, MUFA, and PUFA levels. As a result, the models perceived "total fat" as a strong predictor of these specific FA classes. Therefore, using significant parameters as inputs that directly affect the targets suggested Model 2's parameter tuning, and evaluation of essential attributes allowed it to find the ideal configuration to provide the highest accuracy [51]. The contribution of the independent variables in the various predictions is illustrated (Figure 4). The feature that contributed most to the accurate prediction of all three FA classes was total fat. Variations in "total fat" in food are closely related to changes in SFA, MUFA, and PUFA levels. As a result, the models perceived "total fat" as a strong predictor of these specific FA classes. Therefore, using significant parameters as inputs that directly affect the targets suggested provides robust ML models [52]. The prediction scores also differed depending on the fatty acid class (Tables 1 and 2). This may be attributed to the data distribution, with MUFA Nutrients 2023, 15, 3310 9 of 13 achieving the best scores across all the models. As variables whose distribution is closer to normal, MUFA had the highest R 2 values compared to the other FA classes (Tables 1 and 3), with corresponding lower MSE, RMSE, and MAE rates [29,53].
Nutrients 2023, 15, 3310 9 of 13 provides robust ML models [52]. The prediction scores also differed depending on the fatty acid class (Tables 1 and 2). This may be attributed to the data distribution, with MUFA achieving the best scores across all the models. As variables whose distribution is closer to normal, MUFA had the highest R 2 values compared to the other FA classes (Tables 1 and 3), with corresponding lower MSE, RMSE, and MAE rates [29,53].
gradient boosting algorithms, which iteratively improve the model's predictions by combining weak individual models in an ensemble to provide faster and more accurate predictions [54,55]. DT and LiR had low R 2 , meaning that the underlying patterns in the data were not captured effectively by these two models, which resulted in larger error rates. DT and LiR, being simpler models, may struggle to capture the complexities and nonlinearities in the data, leading to inferior performance. RF showed the highest R 2 (Table 3) in predicting all the fatty acids, except in PUFA, where the R 2 of XGR was slightly higher (Table 3).

Study Limitation
The NHANES data on snacks from 2017 to 2018 was collected on different days across the country and most manufacturers improve or develop new products each day. The data do not fully represent all snacks consumed in the US.

Conclusions
Machine learning techniques can model complex non-linear datasets by incorporating interactions between sparse matrices and nutrition survey data with content variables like fatty acid classes and snacks to find non-linear relationships between the outcomes that conventional regression models might miss. Based on the raw data of total fat content, KNN and DT could predict with significant accuracy the classes of fatty acids in popular snacks. In the second model, RF followed by XGR most accurately predicted the fatty acid classes, while DT was the least effective. However, it is important to note that choosing the best model depends on various factors, such as the specific problem, dataset size, interpretability requirements, and computational constraints. Faster determination of these nutrients in foods through these models could promote intervention strategies by regulatory bodies to generate new or combined ingredients which can minimize calorie intake from snacks by consumers. It will increase awareness of the healthiness of different foods, and cater to consumers' demand for personalized nutrition. Deep learning concepts could  The model performance is summarized in Table 3. Ensemble techniques act as a "black box" that uses numerous features from a dataset and mixed models, typically tree models, to enhance them [54]. LiR was used as a baseline model for comparison. LiR had the highest scores for MAE in all FA classes; however, its error rate was slightly lower than DT for RMSE and MSE. RF showed the least error scores for RMSE, MSE, and MAE in predicting all content of the individual FA classes and was, therefore, the best prediction model. Higher R 2 values are also associated with decreased error scores, as seen in the MAE, RMSE, and MSE in all the fatty acid models, which indicates that these models were better at prediction. R 2 ranges between 0-1, with 1 being a perfect fit and capturing a significant  The model performance is summarized in Table 3. Ensemble techniques act as a "black box" that uses numerous features from a dataset and mixed models, typically tree models, to enhance them [54]. LiR was used as a baseline model for comparison. LiR had the highest scores for MAE in all FA classes; however, its error rate was slightly lower than DT for RMSE and MSE. RF showed the least error scores for RMSE, MSE, and MAE in predicting all content of the individual FA classes and was, therefore, the best prediction model. Higher R 2 values are also associated with decreased error scores, as seen in the MAE, RMSE, and MSE in all the fatty acid models, which indicates that these models were better at prediction. R 2 ranges between 0-1, with 1 being a perfect fit and capturing a significant portion of data variance and 0 meaning no relationship between the predicted and actual value. The R 2 scores ranged from 0.695 to 0.724, 0.390 to 0.514, 0.635 to 0.722, 0.503 to 0.622, and 0.686 to 0.751 in XGR, DT, LightGBM, LiR, and RF, respectively, for the FA classes. XGR, LightGBM, and RF exhibited higher R 2 values, indicating that these have a better ability to explain the variance in the data and accurately capture trends between input features and the target variable. The success of XGR, LightGBM, and RF can be attributed to their ability to grasp complex relationships and handle non-linear patterns in the data. RF involves an ensemble of decision trees, which reduce overfitting [30]. Additionally, XGR and LightGBM XGR's performance can be attributed to the power of gradient boosting algorithms, which iteratively improve the model's predictions by combining weak individual models in an ensemble to provide faster and more accurate predictions [54,55]. DT and LiR had low R 2 , meaning that the underlying patterns in the data were not captured effectively by these two models, which resulted in larger error rates. DT and LiR, being simpler models, may struggle to capture the complexities and non-linearities in the data, leading to inferior performance.
RF showed the highest R 2 ( Table 3) in predicting all the fatty acids, except in PUFA, where the R 2 of XGR was slightly higher (Table 3).

Study Limitation
The NHANES data on snacks from 2017 to 2018 was collected on different days across the country and most manufacturers improve or develop new products each day. The data do not fully represent all snacks consumed in the US.

Conclusions
Machine learning techniques can model complex non-linear datasets by incorporating interactions between sparse matrices and nutrition survey data with content variables like fatty acid classes and snacks to find non-linear relationships between the outcomes that conventional regression models might miss. Based on the raw data of total fat content, KNN and DT could predict with significant accuracy the classes of fatty acids in popular snacks. In the second model, RF followed by XGR most accurately predicted the fatty acid classes, while DT was the least effective. However, it is important to note that choosing the best model depends on various factors, such as the specific problem, dataset size, interpretability requirements, and computational constraints. Faster determination of these nutrients in foods through these models could promote intervention strategies by regulatory bodies to generate new or combined ingredients which can minimize calorie intake from snacks by consumers. It will increase awareness of the healthiness of different foods, and cater to consumers' demand for personalized nutrition. Deep learning concepts could be developed for other foods that rely on tedious analytical/instrumentation methods to save time and minimize waste. Given sufficient data, ML algorithms could serve as a faster and more cost-effective means of predicting nutrient content in food.