Machine Learning Algorithms for Predicting Stunting among Under-Five Children in Papua New Guinea

Shen, Hao; Zhao, Hang; Jiang, Yi

doi:10.3390/children10101638

Open AccessEditor’s ChoiceArticle

Machine Learning Algorithms for Predicting Stunting among Under-Five Children in Papua New Guinea

by

Hao Shen

,

Hang Zhao

and

Yi Jiang

^*

School of Public Health, Chongqing Medical University, Chongqing 400016, China

^*

Author to whom correspondence should be addressed.

Children 2023, 10(10), 1638; https://doi.org/10.3390/children10101638

Submission received: 14 August 2023 / Revised: 27 September 2023 / Accepted: 28 September 2023 / Published: 30 September 2023

(This article belongs to the Special Issue Nutrition to Improve Child and Adolescent Health)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Preventing stunting is particularly important for healthy development across the life course. In Papua New Guinea (PNG), the prevalence of stunting in children under five years old has consistently not improved. Therefore, the primary objective of this study was to employ multiple machine learning algorithms to identify the most effective model and key predictors for stunting prediction in children in PNG. The study used data from the 2016–2018 Papua New Guinea Demographic Health Survey, including from 3380 children with complete height-for-age data. The least absolute shrinkage and selection operator (LASSO) and random-forest-recursive feature elimination were used for feature selection. Logistic regression, a conditional decision tree, a support vector machine with a radial basis function kernel, and an extreme gradient boosting machine (XGBoost) were employed to construct the prediction model. The performance of the final model was evaluated using accuracy, precision, recall, F1 score, and area under the curve (AUC). The results of the study showed that LASSO-XGBoost has the best performance for predicting stunting in PNG (AUC: 0.765; 95% CI: 0.714–0.819) with accuracy, precision, recall, and F1 scores of 0.728, 0.715, 0.628, and 0.669, respectively. Combined with the SHAP value method, the optimal prediction model identified living in the Highlands Region, the age of the child, being in the richest family, and having a larger or smaller birth size as the top five important characteristics for predicting stunting. Based on the model, the findings support the necessity of preventing stunting early in life. Emphasizing the nutritional status of vulnerable maternal and child populations in PNG is recommended to promote maternal and child health and overall well-being.

Keywords:

stunting; children; machine learning; Papua New Guinea

1. Introduction

Stunting has been defined as the lack of height relative to age in children [1] and is the most prevalent form of child malnutrition [2]. Stunting occurs mainly during the critical window of 0–24 months [3], which is the most sensitive period of child growth and development. Stunting was found to be especially vulnerable to environmentally modifiable factors [4]. This growth deficit continues to accumulate and worsens during early childhood (0–5 years) due to continued exposure to adverse environmental factors such as feeding, infections, and psychosocial factors [5].

The consequences of stunting observed within the first five years of life are far-reaching, encompassing increased morbidity and mortality, impaired cognitive development, poorer academic performance, physical developmental deficits, and diminished economic productivity [6]. Despite some studies suggesting the possibility of catch-up growth in stunted children, there is no conclusive evidence to support the full reversal of the early-life effects of stunting [7,8].

As of 2020, approximately 149 million children under the age of five remain affected by stunting worldwide with the overwhelming majority of cases (96.7%) occurring in low- and middle-income countries [9]. It is evident that stunting in children poses a significant global health challenge [1]. In response, Target 2.2 of the Sustainable Development Goals (SDGs) states that all forms of malnutrition should be eliminated by 2030, which includes stunting in children under five years of age [10].

Despite impressive achievements in reducing stunting in the Western Pacific Region, progress remains slow in some countries [11]. Papua New Guinea (PNG) is among the countries where stunting rates among children under five years old have persistently failed to improve, rising from 47.2% in 2000 to 48.4% in 2020. Surprisingly, this trend contradicts that of PNG’s rapid economic growth [12]. The increase in resources and wealth has not improved the nutritional status of children [13]. Consequently, there is a need to address stunting in children under five years of age in PNG as a serious public health issue.

Previous studies from PNG have explored factors associated with stunting, such as regional disparities, wealth indices, maternal education level, and childhood vaccinations [14,15,16,17]. However, these studies often relied on limited data and lacked national representativeness, limiting the generalizability of their results to the wider PNG population. A few studies have applied nationally representative data from the 2009–2010 Papua New Guinea Household Income and Expenditure Survey (PNG HIES) [18,19] to examine stunting prevalence variations across different regions in PNG. However, the timeliness of the data restricted their scope, and they only adjusted for a limited number of confounding variables.

Machine learning (ML) has emerged as a powerful data-mining technique that is particularly adept at handling high-dimensional and nonlinear relationships [20,21], surpassing classical statistical models in many aspects. As a result, ML algorithms have found widespread application in the exploration of the social determinants of health (SDHs) [21]. The application of algorithms such as decision trees (DTrees), random forests (RFs), support vector machines (SVMs), gradient boosting machines (GBMs), extreme gradient boosting machines (XGBoosts), and neural networks (ANNs) is commonly used in studies exploring the factors associated with stunting in children [22,23,24,25,26,27,28]. Evidence from Ethiopia, Tanzania, and Bangladesh [23,26,27,28] showed that traditional logistic regression (LR) often fails to achieve optimal performance in predicting stunting in children compared to other ML algorithms. Consequently, the application of multiple ML algorithms becomes imperative to identify the best predictive model.

Feature selection (FS), a technique aimed at reducing dimensionality, plays a crucial role in optimizing an algorithm’s predictive performance by eliminating redundant, irrelevant, and noisy features [29]. FS is usually categorized into filtered, embedded, wrapper, and hybrid methods [30]. Embedded methods employ built-in feature selection methods to optimize objective functions or classifiers [31], such as decision trees and the least absolute shrinkage and selection operator (LASSO). Conversely, wrapper methods employ repetitive learning steps and resampling techniques to evaluate feature usefulness and result in enhanced predictive capabilities but at a higher computational cost [32].

Given that the prevalence of stunting in children under five years of age in PNG is still not promising, there is a need for targeted programs and effective interventions. Therefore, the main objective of this study is to apply FS techniques with ML algorithms to train, evaluate, and select the best model for predicting stunting in children under five years of age in PNG based on the nationally representative 2016–2018 Papua New Guinea Demographic Health Survey database (2016–2018 PNG DHS) in addition to obtaining the most important features for predicting stunting. The study’s findings will provide evidence for PNG policy makers to plan scientifically sound programs with integrated interventions to prevent child stunting and protect the health of the most vulnerable subgroups of children. This will help accelerate PNG’s progress in the SDGs related to children’s health.

2. Materials and Methods

2.1. Data Source

The cross-sectional data used in this study were obtained from the 2016–2018 PNG DHS, which was conducted by the PNG National Statistics Office (NSO). This comprehensive national survey covered individuals aged 15–49 years in PNG with the aim to provide current information on key demographic and health indicators. The survey employed a two-stage stratified sampling method to select approximately 19,200 households, with 18,175 women aged 15–49 from the surveyed households eligible for individual interviews. A total of 15,198 women completed the interviews with a response rate of 84%. Child information was collected from mothers or primary caregivers. Structured questionnaires were applied for data collection, and details about the survey can be found in the 2016–2018 PNG DHS final report [33]. For households where male participants were selected for interviews, the 2016–2018 PNG DHS conducted height, length, weight, and mid-upper arm circumference (MUAC) measurements for eligible children under five years of age using equipment provided by UNICEF [33]. Ultimately, all children under five years of age with complete and valid height-for-age data were included in this study with a total of 3380 children meeting the inclusion criteria.

The 2016–2018 Papua New Guinea Demographic and Health Survey (PNG DHS) received ethical approval from the Institutional Review Board of Inner City Fund International. Additionally, informed consent was obtained from respondents for all interviews. On 17 August 2023, the DHS program approved the use of this dataset for this study. All data were desensitized (anonymized by removing all personal identifiers) before being received by the authors. This study was conducted in accordance with relevant guidelines and regulations regarding the published use of DHS datasets and did not require additional ethical review documentation or informed consent due to the use of open secondary data. Further information about DHS data and ethical standards is available at https://dhsprogram.com/methodology/Protecting-the-Privacy-of-DHS-Survey-Respondents.cfm (accessed on 17 July 2023).

2.2. Outcome Variable and Potential Risk Factors

Our outcome variable of interest was stunting, which was coded as a binary variable. According to criteria developed by the WHO in 2006, children with height-for-age z-scores (HAZs) that are 2 standard deviations below the WHO growth standards are recognized as stunted [1] and coded as 1, while all others are coded as 0. The conceptual framework proposed by the United Nations Children’s Nutrition Foundation (UNICEF) illustrates that stunting is attributed to complex contextual, underlying, and direct causes [34]. Therefore, based on the results of previous studies, this study incorporated potential factors and divided them into four main categories: individual characteristics, maternal characteristics, family characteristics, and community characteristics.

Individual characteristics included the child’s gender, age, birth size, birth order, duration of breastfeeding, early breastfeeding, and occurrence of diarrhea and fever in the last two weeks. Maternal characteristics included the mother’s age (years), employment status, occupation, marital status, education level, age at first birth (years), exposure to mass media, and their partner’s age (years), employment status, and education level. Following WHO recommendations [35], early breastfeeding was defined as the initiation of breastfeeding within one hour of delivery. Breastfeeding duration was categorized as never breastfeeding, a breastfeeding duration < 6 months, and a breastfeeding duration ≥ 6 months [36]. Exposure to mass media was based on the frequency of women reading newspapers, watching television, and listening to the radio; access to at least one of these media was considered exposure to mass media [37].

The household characteristics encompassed the sex of the househead, the number of children under five years of age, the number of household members, the type of latrine, the source of drinking water, the type of fuel for the kitchen, and the distance to the health facility. Community characteristics covered the place of residence as well as the region. Based on the WHO/UNICEF guidelines [38], the source of drinking water was categorized as unimproved or improved, and the type of latrine was categorized as unfurnished, unimproved, or improved. Based on WHO indoor air quality guidelines [39], kitchen fuel types were categorized as clean or polluting fuels, where clean fuels included electricity and liquefied petroleum gas. Household wealth was a composite index constructed by the 2016–2018 PNG DHS, where a principal component analysis was applied based on the household’s consumer goods and housing characteristics, forming the corresponding household wealth quintiles: poorest, poorer, middle, richer, and richest [33].

2.3. Analytic Strategy

2.3.1. Preprocessing

Data preprocessing was performed using STATA 17.0 statistical software. We conducted an initial screening of categorical and continuous variables using the χ2 (bivariate) test with the Wilcoxon rank sum test, and variables with a p > 0.05 were excluded. Descriptive analyses were performed in the form of frequencies for categorical variables and means for continuous variables.

To prepare categorical features for machine learning input, the classical one-hot coding method was employed. After the initial screening, multicategorical variables were converted into multiple binary feature vectors using one-hot coding. This approach ensured that the algorithm did not make erroneous assumptions about variable relationships. Furthermore, the missing indicator method (MIM) was utilized to add indicator metrics to categorical variables containing missing values. The brief analysis steps of this study are shown in Figure 1.

2.3.2. Feature Selection

All subsequent analyses were conducted using RStudio 4.2.3 statistical software. The samples were randomly split into test and training sets at a ratio of 1:9, and feature selection was performed only in the training sets to prevent leakage of test data. To mitigate the risk of overfitting, the AUC value under 10-fold cross-validation (CV) served as the performance evaluation metric [40].

We selected the embedded LASSO and the wrapped random-forest-recursive feature elimination (RF-RFE) for feature selection. Among them, the LASSO controlled model shrinkage via the penalty parameter (λ). By selecting the λ value that produced the highest AUC value, we identified the features with non-zero regression coefficients, forming the optimal feature subset. Alternatively, the RF-RFE measured feature importance using the Gini-coefficient-based Mean Impurity Reduction (MDI) after fitting the RF model. The process was repeated to recursively eliminate irrelevant features until the combination of features with the highest model AUC value was derived. For this analysis, the ntree was set to 500, and the mtry was set to the recommended

\sqrt{p}

[41].

2.3.3. Machine Learning Algorithms and Hyperparameter Tuning

Grid search (GS) is a traversal search for predefined hyperparameter values performed by a given algorithm. While it is suitable for low-dimensional hyperparameter tuning, it incurs high computational costs [42]. Bayesian optimization based on Gaussian process regression (BO-GPR) leverages a priori information from Gaussian process regression to rapidly converge to the global optimal solution, making it more adept at handling high-dimensional hyperparameter optimization problems with limited iterations [43].

In this study, AUC values were used as performance evaluation metrics for hyperparameter combinations. The GS and BO-GPR strategies under 10-fold cross-validation were applied for the hyperparameter tuning of the following models. In addition, the logistic regression (LR) model, fitted with the generalized linear function (GLM) as a binomial family, required no hyperparameter tuning due to its inherent simplicity and well-defined structure.

The conditional inference tree (CTree) is a special kind of decision tree [44] which embeds a tree regression model into a well-defined conditional inference process. Therefore, easily interpretable classification results can be produced [45]. The CTree included two hyperparameters for controlling the size of the tree’s growth, namely the 1-p value (mincriterion) and the maximum depth of the tree (maxdepth) [46]. In selecting the predefined hyperparameter values, caution was exercised, and insights from the relevant literature [47,48,49] were taken into consideration. GS was applied to search for the optimal values of maxdepth and mincriterion, which were confined to the following range:

m a x d e p t h = [1,30]

(1)

m i n c r i t e r i o n = \{0.900,0.950,0.990\}

(2)

The support vector machine (SVM) is a versatile algorithm used for addressing classification and regression problems. It possesses the capability to linearly classify data while also employing kernel tricks to handle nonlinear data challenges [50]. For instance, the radial basis function (RBF) kernel transforms the input space into a high-dimensional feature space, facilitating the modeling of nonlinear data [51]. In this study, an SVM with a radial basis function kernel with fewer hyperparameters was used to categorize the data. The hyperparameters requiring tuning were the penalty function (C) and kernel parameters (σ). Following the recommendations of related studies [52,53,54], we applied GS to search for the best C and σ in the following predefined set:

C = 2^{{- 5, - 3, - 1, 1, 3, 5, 7, 9, 11, 13, 15}}

(3)

σ = 2^{{- 15, - 13, - 11, - 9, - 7, - 5, - 3, - 1, 1, 3, 5, 7, 9, 11}}

(4)

The extreme gradient boosting machine (XGBoost) is an extensible and integrated algorithm based on gradient boosting decision trees, which is known for its exceptional ability to push the computational power of boosting trees to new limits [55]. The performance of XGBoost was highly dependent on optimizing a large number of hyperparameters, which are summarized as follows: the maximum number of boosting iterations (nrounds), the learning rate (eta), the minimum loss reduction (gamma), the minimum weight sum of instances (min child weight), the maximum depth (maxdepth), the subsample percentage (subsample), and the column-sample-by-tree ratio of subsamples (colsample bytree). Following the recommendations of related studies [56,57,58], we employed BO-GPR to search for the hyperparameters of XGBoost in the ranges presented in Table 1.

2.3.4. Model Performance Evaluation

The final performance of the model in the test set was measured with the AUC, accuracy, precision, recall, and F1 score. Acknowledged as the main performance metric, the AUC gave the overall model performance at each possible classification threshold. The confusion matrix was a square matrix including the True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (PN), allowing the extraction of the above-mentioned one-dimensional performance metrics from it [59].

Accuracy, defined as the ratio of the number of correct predictions to the total number of predictions, was the most common measure of overall prediction performance.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + P N}

(5)

Precision was defined as the ratio of the number of correct positive predictions to the total number of positive predictions, reflecting the consistency of the predictions with the positive cases in the test set.

P r e c i s i o n = \frac{T P}{T P + F P}

(6)

Recall, also known as sensitivity, was defined as the ratio of the number of correct positive predictions to the total number of positives, reflecting the effectiveness of the model in predicting positive cases.

R e c a l l = \frac{T P}{T P + F N}

(7)

The F1 score was the reconciled mean of precision and recall, responding to the association of the predicted outcome with positive cases in the test set [60].

F 1 s c o r e = \frac{2 T P}{2 T P + F P + F N}

(8)

With the default classification threshold, p > 0.50 is categorized as positive. However, this default threshold is often unsuitable for dealing with unbalanced data. To estimate the optimal threshold, we used the closest top-left method to select the point close to the upper-left corner of the ROC curve as the optimal threshold and reported the above one-dimensional performance metrics at the optimal threshold [61].

3. Results

3.1. Descriptive Results

Of the 3380 children under five years of age in this study, 1342 (39.70%) had stunted growth, with the mean age being 29.73 months. Most were boys (53.11%) and received breastfeeding for a duration of ≥6 months (Table 2). Regarding maternal characteristics, most (62.59%) mothers were not employed. Approximately half of the mothers (50.41%) had received a primary education, as had 45.40% of their partners. Household and community characteristics indicated that around 16.45% of the children came from the poorest households, almost half (46.19%) did not have access to improved water sources, the majority (76.36%) resided in rural areas, and about one-third (30.86%) hailed from the Southern Region (Table 2).

The prevalence of stunting was highest in the Highlands Region (58.97%) compared to other regions, and stunting prevalence among children from the poorest households (53.60%) was almost twice as high as that of children from the richest households (25.39%). Furthermore, the results of the χ2 test and Wilcoxon rank sum test showed that variables such as children’s birth order, early breastfeeding, occurrence of diarrhea and fever in the last two weeks, and the age and marital status of the mother and her partner were not significantly associated with child stunting, and thus, they were excluded from the follow-up study (Table 2).

The prevalence of stunting among children in PNG varied across provincial division, with the Southern Highlands showing the highest rates, while Manus and the National Capital District were less affected by stunting. The Highlands Region provinces, including Southern Highlands, Enga, Hela, Western Highlands, Jiwaka, Chimbu, and Eastern Highlands, also exhibited higher stunting rates compared to other provinces (Figure 2).

3.2. Feature Selection Results

Figure 3 presents the process of feature selection using the LASSO and RF-RFE methods. For LASSO, the model achieved the best AUC value (AUC: 0.669) at λ = 0.0051, resulting in the shrinkage of regression coefficients for 34 features to 0 and representing approximately 59.6% of all features. On the other hand, using RF-RFE, the model attained the best AUC value (AUC: 0.672) after removing the first 27 least important features, accounting for about 47.4% of all features.

3.3. Hyperparameter Tuning Results

Table 3 summarizes the best hyperparameters of CTree, SVM-RBF, and XGBoost models under 10-fold cross-validation using the GS or BO-GPR strategy. With the LASSO optimal feature subset, SVM-RBF demonstrated the best prediction performance in the training set (AUC: 0.671). Furthermore, the performance of CTree, SVM-RBF, and XGBoost in the training set improved after FS.

3.4. Evaluation of the Prediction Models

Table 4 and Figure 4 summarize the final performance of LR, CTree, SVM-RBF, and XGBoost using the test set. The results indicate that XGBoost, under the LASSO FS method, provided the best prediction performance (AUC: 0.765; 95% CI: 0.714–0.819), and the model’s accuracy, precision, recall, and F1 scores at the optimized threshold (0.487) were 0.728, 0.715, 0.628, and 0.669. Moreover, CTree exhibited the worst performance without using the FS method (AUC: 0.695; 95% CI: 0.639–0.750) (Table 4 and Figure 4).

The final performance of all models improved after using feature selection, indicating that the FS method effectively eliminated noise or redundant information while preserving crucial features of the original model (59.6% dimensionality reduction for LASSO and 47.4% dimensionality reduction for RF-RFE). The impact of feature selection varied depending on the optimized model: for CTree and XGBoost, the performance was best with LASSO, while for LR and SVM-RBF, the performance was optimized using RF-RFE (see Table 4 and Figure 4).

3.5. Model Interpretation

SHapley Additive exPlanations (SHAP) is a feature attribution method based on a game-theoretic framework that helps reveal the decision-making process of complex “black-box models” such as XGBoost. As mentioned above, we used the SHAP value method to explain the XGBoost prediction model under the LASSO optimal feature subset.

3.5.1. SHAP Summary Plots

The SHAP summary chart sorted the characteristics vertically from highest to lowest based on the mean absolute SHAP values. We selected the top 15 characteristics to illustrate their relative importance in predicting stunting in children (refer to Figure 5). Notably, living in the Highlands Region, the child’s age, belonging to the wealthiest family, and having a larger or smaller birth size were identified as the top five most significant factors.

Additionally, the SHAP summary chart represents each child’s features as points, which are colored according to their feature values, ranging from low (blue) to high (red). For binary feature vectors, red dots indicated the presence of the corresponding feature in the individual child. The SHAP value on the horizontal axis reflects the contribution of the feature to the model output. Higher SHAP values indicate a greater likelihood of stunting. Specifically, children from the Highlands Region, those with smaller birth sizes, or those in the poorest households had SHAP values > 0 for the corresponding characteristics, indicating a higher probability of stunting. In contrast, children from the wealthiest families who were female or of larger birth size had SHAP values < 0 for the corresponding trait, indicating a lower probability of stunting (see Figure 5).

3.5.2. SHAP Dependence Plot of Child’s Age

To provide a more intuitive view of the relationship between feature values and the model’s expected output, we constructed a dependency plot for child age (a continuous variable) versus SHAP values. The plot included points representing different individual children. The smoothed line of partial regression demonstrated a positive association between child age and SHAP values when children were ≤24 months old. After 24 months of age, SHAP values stabilized and remained positive for the vast majority of children (Figure 6).

4. Discussion

Based on the nationally representative PNG DHS 2016–2018 dataset, our study suggests that the LASSO-XGBoost combination had the best performance in predicting stunting among children under five years old in PNG (AUC: 0.767, 95% CI: 0.714–0.819). The optimal model identified living in the Highlands Region, the age of the child, being in the wealthiest household, and having a larger or smaller birth size as the top five most important characteristics for predicting stunting in children, reflecting the complexity of the causes of stunting. Critical findings of the study include the following.

Firstly, the study found that children residing in the Highlands Region were at a very high risk of stunting, and stunting was also most prevalent in the region (58.97%), which is similar to the findings of an earlier study using the 2009–2010 PNG HIES [19]. Food insecurity is one of the key underlying causes of child malnutrition [34]. In the Highlands Region, children are extremely vulnerable to food insecurity caused by events such as extreme weather and social conflict [62,63]. Long-term food deprivation [64] is associated closely with chronic malnutrition. Diets in the Highlands excessively rely on mono-foods (such as starchy foods like sweet potatoes and sago, among others) [63,65], and nutritionally unbalanced feeding practices could also contribute to linear growth deficits in children [14,64].

Secondly, the study highlights the strong association between household wealth, child age, and stunting. Children from wealthier households face a lower risk of stunting, which is potentially due to their better resilience against food insecurity [66], improved access to healthcare facilities [67], and ability to access high-protein foods [68,69]. Furthermore, the study observed a rapid increase in the risk of stunting in children aged 0–24 months, which is in line with previous cross-country studies [3]. This emphasized the urgency of early intervention to prevent stunting from exacerbating the cycle of deprivation, especially among the most vulnerable groups of children living in poverty [70].

Finally, the study underscores the significance of birth size in determining a child’s growth potential. A smaller birth size is associated with a higher risk of stunting in children, while a larger birth size is a protective factor. Maternal malnutrition during pregnancy could be a potential cause of smaller birth sizes, leading to altered fetal and placental growth patterns and contributing to impaired fetal growth [71,72]. Existing studies have emphasized the importance of intrauterine health in preventing stunting in children [73]. Therefore, it is necessary to focus on and improve the nutritional status of pregnant women and women of childbearing age (15–49 years) in PNG. However, it is important to emphasize that exploring the relationship between birth size and stunting still requires caution due to the subjective assessment of the child’s birth size by mothers.

Moreover, being female, mothers’ exposure to mass media, mothers’ secondary education level, and their partners’ higher education level were discovered to be protective factors against stunting. Evidence from the Highlands Margins of PNG suggests that gender heterogeneity in stunting may be attributed to girls’ growth strategies, which prioritize growth over maintenance to meet future reproductive potential [74]. Meanwhile, mothers and their partners with high levels of education were likely to have better incomes, leading to improved nutrition for their children [75]. And mothers exposed to mass media were more likely to acquire knowledge about proper modern healthcare practices and correct inappropriate attitudes [76].

In conclusion, our results show that the combination of ML and FS techniques provides a better classification of stunting. After BO-BPR hyperparameter tuning with 10-fold cross-validation, the LASSO-XGBoost model achieved the best predictive performance compared to traditional logistic regression (LR) (AUC: 0.765; 95% CI: 0.714–0.819). Particularly, LASSO and RF-RFE facilitated efficient ML learning by removing redundant, noisy information, resulting in a substantial dimensionality reduction of 59.6% and 47.4%, respectively. Thus, we suggest prioritizing the best combination of LASSO and XGBoost when stunting in PNG children is a central concern for prediction.

This study had several important strengths. Firstly, the data were derived from the nationally representative PNG DHS 2016–2018. Secondly, the study used ML algorithms and FS techniques to make better predictions, which have not been widely used in related research in PNG and could provide lessons for researchers conducting research on similar topics in PNG and other Pacific Island countries. Nevertheless, some potential limitations remain. Firstly, the SHAP value method employed in the study provided correlation analysis but could not establish causal inferences; therefore, the interpretability of the results is still limited. Secondly, although we tried to include as comprehensive a set of variables as possible, we could not exclude residual confounding caused by unmeasured variables such as the mother’s height and weight. Moreover, some of the children’s information was derived from their mother’s recollections (for instance, the occurrence of diarrhea and fever in the child in the last two weeks), and there may be a recollection bias.

5. Conclusions

Based on cross-sectional data from the nationally representative PNG DHS 2016–2018, this study used the ML algorithm with FS techniques to identify the optimal model and crucial factors for predicting stunting in children under five years of age in PNG. The results show that the combination of LASSO and XGBoost had the best predictive performance. Living in the Highlands Region, the child’s age, being in the richest household, and having a larger or smaller birth size emerged as the top five important characteristics for predicting stunting. The findings emphasize the importance of early-life interventions to prevent stunting, especially for the most vulnerable groups of children in the marginalized Highlands Region. Therefore, there is an imperative for more robust public health policies and interventions aimed at enhancing maternal nutrition and disseminating accurate knowledge of modern healthcare practices to promote maternal and child health and well-being in PNG.

Author Contributions

Conceptualization, H.S. and H.Z.; methodology, H.S. and H.Z.; software, H.S. and H.Z.; validation, H.S. and H.Z.; formal analysis, H.S. and H.Z.; investigation, H.S. and H.Z.; resources, H.S. and H.Z.; data curation, H.S. and H.Z.; writing—original draft preparation, H.S. and H.Z.; writing—review and editing, H.S. and H.Z.; visualization, H.S. and H.Z.; supervision, Y.J.; project administration, Y.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable, as secondary data analysis. For information on ethical approval of primary data collection, please visit https://dhsprogram.com/methodology/Protecting-the-Privacy-of-DHS-Survey-Respondents.cfm (accessed on 17 July 2023).

Informed Consent Statement

Informed consent was received from all subjects participating in the study. Informed consent forms for the original data collection can be accessed at https://dhsprogram.com/methodology/Protecting-the-Privacy-of-DHS-Survey-Respondents.cfm (accessed on 17 July 2023).

Data Availability Statement

Access to the dataset is available at https://dhsprogram.com/data/dataset/Papua-New-Guinea_Standard-DHS_2017.cfm?flag=0 (accessed on 17 July 2023).

Acknowledgments

We express our sincere gratitude to all the staff and participants of the 2016–2018 PNG DHS for their invaluable cooperation.

Conflicts of Interest

The authors declare no conflict of interest.

References

World Health Organization. WHO Child Growth Standards: Length/Height-for-Age, Weight-for-Age, Weight-for-Length, Weight-for-Height and Body Mass Index-for-Age: Methods and Development; World Health Organization: Geneva, Switzerland, 2006. [Google Scholar]
De Onis, M.; Branca, F. Childhood stunting: A global perspective. Matern. Child Nutr. 2016, 12, 12–26. [Google Scholar] [CrossRef] [PubMed]
Victora, C.G.; De Onis, M.; Hallal, P.C.; Blössner, M.; Shrimpton, R. Worldwide timing of growth faltering: Revisiting implications for interventions. Pediatrics 2010, 125, e473–e480. [Google Scholar] [CrossRef] [PubMed]
Black, M.M.; Walker, S.P.; Fernald, L.C.; Andersen, C.T.; DiGirolamo, A.M.; Lu, C.; McCoy, D.C.; Fink, G.; Shawar, Y.R.; Shiffman, J. Early childhood development coming of age: Science through the life course. Lancet 2017, 389, 77–90. [Google Scholar] [CrossRef] [PubMed]
Leroy, J.L.; Ruel, M.; Habicht, J.-P.; Frongillo, E.A. Linear growth deficit continues to accumulate beyond the first 1000 days in low-and middle-income countries: Global evidence from 51 national surveys. J. Nutr. 2014, 144, 1460–1466. [Google Scholar] [CrossRef]
Prendergast, A.J.; Humphrey, J.H. The stunting syndrome in developing countries. Paediatr. Int. Child Health 2014, 34, 250–265. [Google Scholar] [CrossRef]
Leroy, J.L.; Frongillo, E.A.; Dewan, P.; Black, M.M.; Waterland, R.A. Can children catch up from the consequences of undernourishment? Evidence from child linear growth, developmental epigenetics, and brain and neurocognitive development. Adv. Nutr. 2020, 11, 1032–1041. [Google Scholar] [CrossRef]
Leroy, J.L.; Frongillo, E.A. Perspective: What does stunting really mean? A critical review of the evidence. Adv. Nutr. 2019, 10, 196–204. [Google Scholar] [CrossRef]
World Health Organization. Levels and Trends in Child Malnutrition: UNICEF; World Health Organization: Geneva, Switzerland, 2021. [Google Scholar]
Osborn, D.; Cutter, A.; Ullah, F. Universal sustainable development goals. Underst. Transform. Chall. Dev. Ctries. 2015, 2, 1–25. [Google Scholar]
World Health Organization. The Health-Related Sustainable Development Goals: Progress Report of the Western Pacific Region, 2020; World Health Organization: Geneva, Switzerland, 2021. [Google Scholar]
Hou, X. Stagnant stunting rate despite rapid economic growth in Papua new guinea. In Factors Correlated with Malnutrition Among Children Under Five; World Bank Policy Research Working Paper; World Bank: Washington, DC, USA, 2015. [Google Scholar]
Banks, G. Papua New Guinea National Human Development Report; United Nations Development Programme: New York, NY, USA, 2014. [Google Scholar]
Pham, B.N.; Silas, V.D.; Okely, A.D.; Pomat, W. Measuring wasting and stunting prevalence among children under 5 years of age and associated risk factors in Papua New Guinea: New evidence from the Comprehensive Health and Epidemiological Surveillance System. Front. Nutr. 2021, 8, 622660. [Google Scholar] [CrossRef]
Samiak, L.; Emeto, T.I. Vaccination and nutritional status of children in Karawari, East Sepik Province, Papua New Guinea. PLoS ONE 2017, 12, e0187796. [Google Scholar] [CrossRef]
Wand, H.; Lote, N.; Semos, I.; Siba, P. Investigating the spatial variations of high prevalences of severe malnutrition among children in Papua New Guinea: Results from geoadditive models. BMC Res. Notes 2012, 5, 288. [Google Scholar] [CrossRef] [PubMed]
Hall, J.; Walton, M.; Van Ogtrop, F.; Guest, D.; Black, K.; Beardsley, J. Factors influencing undernutrition among children under 5 years from cocoa-growing communities in Bougainville. BMJ Glob. Health 2020, 5, e002478. [Google Scholar] [CrossRef] [PubMed]
van der Meulen Rodgers, Y.; Kassens, A.L. Women’s asset ownership and children’s nutritional status: Evidence from Papua New Guinea. Soc. Sci. Med. 2018, 204, 100–107. [Google Scholar] [CrossRef] [PubMed]
Hou, X. Stagnant Stunting Rate despite Rapid Economic Growth—An Analysis of Cross Sectional Survey Data of Undernutrition among Children under Five in Papua New Guinea. AIMS Public Health 2016, 3, 25. [Google Scholar] [CrossRef]
Juarez-Orozco, L.E.; Martinez-Manzanera, O.; Storti, A.E.; Knuuti, J. Machine learning in the evaluation of myocardial ischemia through nuclear cardiology. Curr. Cardiovasc. Imaging Rep. 2019, 12, 5. [Google Scholar] [CrossRef]
Khourdifi, Y.; Bahaj, M. Heart disease prediction and classification using machine learning algorithms optimized by particle swarm optimization and ant colony optimization. Int. J. Intell. Eng. Syst. 2019, 12, 242–252. [Google Scholar] [CrossRef]
Kino, S.; Hsu, Y.-T.; Shiba, K.; Chien, Y.-S.; Mita, C.; Kawachi, I.; Daoud, A. A scoping review on the use of machine learning in research on social determinants of health: Trends and research prospects. SSM-Popul. Health 2021, 15, 100836. [Google Scholar] [CrossRef]
Khan, J.R.; Tomal, J.H.; Raheem, E. Model and variable selection using machine learning methods with applications to childhood stunting in Bangladesh. Inform. Health Soc. Care 2021, 46, 425–442. [Google Scholar] [CrossRef]
Talukder, A.; Ahammed, B. Machine learning algorithms for predicting malnutrition among under-five children in Bangladesh. Nutrition 2020, 78, 110861. [Google Scholar] [CrossRef]
Haris, M.S.; Anshori, M.; Khudori, A.N. Prediction of stunting prevalence in east java province with random forest algorithm. J. Tek. Inform. Jutif 2023, 4, 11–13. [Google Scholar] [CrossRef]
Rahman, S.J.; Ahmed, N.F.; Abedin, M.M.; Ahammed, B.; Ali, M.; Rahman, M.J.; Maniruzzaman, M. Investigate the risk factors of stunting, wasting, and underweight among under-five Bangladeshi children and its prediction based on machine learning approach. PLoS ONE 2021, 16, e0253172. [Google Scholar] [CrossRef] [PubMed]
Lucy Lawrence, S. Predicting Stunting Status among Children under Five Years: The Case Study of Tanzania. Ph.D. Thesis, University of Rwanda, Kigali, Rwanda, 2021. [Google Scholar]
Fenta, H.M.; Zewotir, T.; Muluneh, E.K. A machine learning classifier approach for identifying the determinants of under-five child undernutrition in Ethiopian administrative zones. BMC Med. Inform. Decis. Mak. 2021, 21, 291. [Google Scholar] [CrossRef] [PubMed]
Gutkin, M.; Shamir, R.; Dror, G. SlimPLS: A method for feature selection in gene expression-based disease classification. PLoS ONE 2009, 4, e6416. [Google Scholar] [CrossRef] [PubMed]
Abiodun, E.O.; Alabdulatif, A.; Abiodun, O.I.; Alawida, M.; Alabdulatif, A.; Alkhawaldeh, R.S. A systematic review of emerging feature selection optimization methods for optimal text classification: The present state and prospective opportunities. Neural Comput. Appl. 2021, 33, 15091–15118. [Google Scholar] [CrossRef] [PubMed]
Venkatesh, B.; Anuradha, J. A review of feature selection and its methods. Cybern. Inf. Technol. 2019, 19, 3–26. [Google Scholar] [CrossRef]
Yan, K.; Zhang, D. Feature selection and analysis on correlated gas sensor data with recursive feature elimination. Sens. Actuators B Chem. 2015, 212, 353–363. [Google Scholar] [CrossRef]
National Statistical Office; ICF. Papua New Guinea Demographic and Health Survey 2016-18; ICF: Rockville, MD, USA, 2019. [Google Scholar]
Akanji, O.O. UNICEF: The State of the World’s Children 1998. Econ. Financ. Rev. 1998, 36, 6. [Google Scholar]
World Health Organization. Indicators for Assessing Infant and Young Child Feeding Practices: Part 1: Definitions: Conclusions of a Consensus Meeting Held 6–8 November 2007 in Washington DC, USA; World Health Organization: Geneva, Switzerland, 2008. [Google Scholar]
Campos, A.P.; Vilar-Compte, M.; Hawkins, S.S. Association between breastfeeding and child stunting in Mexico. Ann. Glob. Health 2020, 86, 145. [Google Scholar] [CrossRef]
Fatema, K.; Lariscy, J.T. Mass media exposure and maternal healthcare utilization in South Asia. SSM Popul. Health 2020, 11, 100614. [Google Scholar] [CrossRef]
WHO/UNICEF Joint Water Supply; Sanitation Monitoring Programme. Progress on Drinking Water and Sanitation: 2014 Update; World Health Organization: Geneva, Switzerland, 2014. [Google Scholar]
World Health Organization. Burning Opportunity: Clean Household Energy for Health, Sustainable Development, and Wellbeing of Women and Children; World Health Organization: Geneva, Switzerland, 2016. [Google Scholar]
Berrar, D. Cross-validation. In Encyclopedia of Bioinformatics and Computational Biology; Elsevier: Amsterdam, The Netherlands, 2019. [Google Scholar]
Darst, B.F.; Malecki, K.C.; Engelman, C.D. Using recursive feature elimination in random forest to account for correlated variables in high dimensional data. BMC Genet. 2018, 19, 65. [Google Scholar] [CrossRef]
Liashchynskyi, P.; Liashchynskyi, P. Grid search, random search, genetic algorithm: A big comparison for NAS. arXiv 2019, arXiv:1912.06059. [Google Scholar]
Wu, J.; Chen, X.-Y.; Zhang, H.; Xiong, L.-D.; Lei, H.; Deng, S.-H. Hyperparameter optimization for machine learning models based on Bayesian optimization. J. Electron. Sci. Technol. 2019, 17, 26–40. [Google Scholar]
Hothorn, T.; Hornik, K.; Zeileis, A. Unbiased recursive partitioning: A conditional inference framework. J. Comput. Graph. Stat. 2006, 15, 651–674. [Google Scholar] [CrossRef]
Hothorn, T.; Hornik, K.; Zeileis, A. ctree: Conditional inference trees. Compr. R Arch. Netw. 2015, 8, 1–34. [Google Scholar]
Hothorn, T.; Hornik, K.; Strobl, C.; Zeileis, A.; Hothorn, M.T. Package ‘party’. Packag Ref Man Party, Version 0.9-998. 2015; Volume 16, 37.
Sarda-Espinosa, A.; Subbiah, S.; Bartz-Beielstein, T. Conditional inference trees for knowledge extraction from motor health condition data. Eng. Appl. Artif. Intell. 2017, 62, 26–37. [Google Scholar] [CrossRef]
Mantovani, R.G.; Horváth, T.; Cerri, R.; Junior, S.B.; Vanschoren, J.; de Carvalho, A.C.P.d.L.F. An empirical study on hyperparameter tuning of decision trees. arXiv 2018, arXiv:1812.02207. [Google Scholar]
Nembrini, S. Prediction or interpretability? Emerg. Themes Epidemiol. 2019, 16, 4. [Google Scholar] [CrossRef]
Ghosh, S.; Dasgupta, A.; Swetapadma, A. A study on support vector machine based linear and non-linear pattern classification. In Proceedings of the 2019 International Conference on Intelligent Sustainable Systems (ICISS), Palladam, India, 21–22 February 2019; pp. 24–28. [Google Scholar]
Han, S.; Qubo, C.; Meng, H. Parameter selection in SVM with RBF kernel function. In Proceedings of the World Automation Congress 2012, Puerto Vallarta, Mexico, 24–28 June 2012; pp. 1–4. [Google Scholar]
Hsu, C.-W.; Chang, C.-C.; Lin, C.-J. A Practical Guide to Support Vector Classification; National Taiwan University: Taipei City, China, 2003. [Google Scholar]
Thombre, A.M. Effect of outlier removal on grid search and distance between two classes (the techniques to find hyperparameter, sigma of support vector machine). In Proceedings of the 2019 IEEE Pune Section International Conference (PuneCon), Pune, India, 18–20 December 2019; pp. 1–8. [Google Scholar]
Duarte, E.; Wainer, J. Empirical comparison of cross-validation and internal metrics for tuning SVM hyperparameters. Pattern Recognit. Lett. 2017, 88, 6–11. [Google Scholar] [CrossRef]
Dong, W.; Huang, Y.; Lehane, B.; Ma, G. XGBoost algorithm-based prediction of concrete electrical resistivity for structural health monitoring. Autom. Constr. 2020, 114, 103155. [Google Scholar] [CrossRef]
Ogunleye, A.; Wang, Q.-G. XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Trans. Comput. Biol. Bioinform. 2019, 17, 2131–2140. [Google Scholar] [CrossRef]
Kavzoglu, T.; Teke, A. Advanced hyperparameter optimization for improved spatial prediction of shallow landslides using extreme gradient boosting (XGBoost). Bull. Eng. Geol. Environ. 2022, 81, 201. [Google Scholar] [CrossRef]
Anggoro, D.A.; Mukti, S.S. Performance Comparison of Grid Search and Random Search Methods for Hyperparameter Tuning in Extreme Gradient Boosting Algorithm to Predict Chronic Kidney Failure. Int. J. Intell. Eng. Syst 2021, 14, 198–207. [Google Scholar]
Luque, A.; Carrasco, A.; Martín, A.; de Las Heras, A. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit. 2019, 91, 216–231. [Google Scholar] [CrossRef]
Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
Ruopp, M.D.; Perkins, N.J.; Whitcomb, B.W.; Schisterman, E.F. Youden Index and optimal cut-point estimated from observations affected by a lower limit of detection. Biom. J. J. Math. Methods Biosci. 2008, 50, 419–430. [Google Scholar] [CrossRef]
Food Assistance to El-Niño Affected Populations in Papua New Guinea; World Food Programme: Rome, Italy, 2016.
Gwatirisa, P.R.; Pamphilon, B.; Mikhailovich, K. Coping with drought in rural Papua New Guinea: A western highlands case study. Ecol. Food Nutr. 2017, 56, 393–410. [Google Scholar] [CrossRef]
Jacka, J.K.; Posner, S. How the Enga Cope with Frost in the 21st Century: Food Insecurity, Migration, and Development in the Papua New Guinea Highlands. Hum. Ecol. 2022, 50, 273–286. [Google Scholar] [CrossRef]
Benjamin, A.; Mopafi, I.; Duke, T. A Perspective on Food and Nutrition in the PNG Highlands. Food Secur. Papua New Guin. 2001, 11, 94. [Google Scholar]
Schmidt, E.; Dorosh, P.; Gilbert, R. Impacts of COVID-19 induced income and rice price shocks on household welfare in Papua New Guinea: Household model estimates. Agric. Econ. 2021, 52, 391–406. [Google Scholar] [CrossRef]
Jayanthan, J.; Irava, W.; Anuranga, C.; Rannan-Eliya, R. Impact of out-of-pocket expenditures on families and barriers to use of maternal and child health services in Papua New Guinea: Evidence from the Papua New Guinea household survey 1996 and household income and expenditure survey 2009–2010. In Country Brief; ADB RETA-6515 Country Brief Series; 2012; Available online: https://www.adb.org/sites/default/files/publication/30344/impact-oop-expenditures-mnch-services-png.pdf (accessed on 13 August 2023).
Schmidt, E.; Mueller, V.; Rosenbach, G. Rural households in Papua New Guinea afford better diets with income from small businesses. Food Policy 2020, 97, 101964. [Google Scholar] [CrossRef]
Mueller, I.; Vounatsou, P.; Allen, B.; Smith, T. Spatial patterns of child growth in Papua New Guinea and their relation to environment, diet, socio-economic status and subsistence activities. Ann. Hum. Biol. 2001, 28, 263–280. [Google Scholar] [CrossRef]
Stephenson, L.S.; Latham, M.C.; Ottesen, E. Malnutrition and parasitic helminth infections. Parasitology 2000, 121, S23–S38. [Google Scholar] [CrossRef] [PubMed]
Thame, M.; Wilks, R.J.; McFarlane-Anderson, N.; Bennett, F.I.; Forrester, T.E. Relationship between maternal nutritional status and infant’s weight and body proportions at birth. Eur. J. Clin. Nutr. 1997, 51, 134–138. [Google Scholar] [CrossRef] [PubMed]
Barker, D.J.; Godfrey, K.M.; Gluckman, P.D.; Harding, J.E.; Owens, J.A.; Robinson, J.S. Fetal nutrition and cardiovascular disease in adult life. Lancet 1993, 341, 938–941. [Google Scholar] [CrossRef] [PubMed]
Victora, C.G.; Villar, J.; Barros, F.C.; Ismail, L.C.; Chumlea, C.; Papageorghiou, A.T.; Bertino, E.; Ohuma, E.O.; Lambert, A.; Carvalho, M. Anthropometric characterization of impaired fetal growth: Risk factors for and prognosis of newborns with stunting or wasting. JAMA Pediatr. 2015, 169, e151431. [Google Scholar] [CrossRef] [PubMed]
Decaro, J.A.; Decaro, E.; Worthman, C.M. Sex differences in child nutritional and immunological status 5–9 years post contact in fringe highland Papua New Guinea. Am. J. Hum. Biol. 2010, 22, 657–666. [Google Scholar] [CrossRef]
Frost, M.B.; Forste, R.; Haas, D.W. Maternal education and child nutritional status in Bolivia: Finding the links. Soc. Sci. Med. 2005, 60, 395–407. [Google Scholar] [CrossRef]
Haile, D.; Azage, M.; Mola, T.; Rainey, R. Exploring spatial variations and factors associated with childhood stunting in Ethiopia: Spatial and multilevel analysis. BMC Pediatr. 2016, 16, 49. [Google Scholar] [CrossRef]

Figure 1. Analysis flow diagram.

Figure 2. Spatial distribution of stunting rates for children under five years of age by provincial-level divisions in Papua New Guinea; 2016–2018 PNG DHS.

Figure 3. Feature selection process for LASSO and RF-RFE. (a) feature selection process for LASSO; (b) feature selection process for RF-RFE.

Figure 4. ROC curve and AUC performance of the prediction models using the test set.

Figure 5. SHAP summary plot with top 15 contributing features for XGBoost models.

Figure 6. SHAP dependence plot of child’s age.

Table 1. Hyperparameter tuning range for XGBoost.

Hyperparameter	Range	Type
Eta	(0.01, 0.3)	Real
Gamma	(0, 0.2)	Real
Subsample	(0.1, 1)	Real
Colsample bytree	(0.1, 1)	Real
Nrounds	[1, 200]	Integer
Maxdepth	[1, 20]	Integer
Min child weight	[1, 20]	Integer

Table 2. Prevalence of stunting in children under 5 in Papua New Guinea by characteristics; PNG DHS 2016–2018.

			Stunted
Variables	N	Frequency (%)/Mean (SD)	No (%)	Yes (%)	p-Values
Individual characteristics
Child’s age (months)	3380	29.73			<0.001
Child’s gender					<0.01
Male	1795	53.11	57.83	42.17
Female	1585	46.89	63.09	36.91
Birth size					<0.001
Average	1215	38.36	65.93	34.07
Large	1337	42.22	60.13	39.87
Small	615	19.42	51.22	48.78
Birth order	3380	3.15			0.069
Duration of breastfeeding					<0.001
Never breastfed	168	7.19	60.71	39.29
<6 months	367	15.71	76.02	23.98
≥6 months	1801	77.10	59.74	40.26
Early breastfeeding					0.280
No	919	41.81	63.87	36.13
Yes	1279	58.19	61.61	38.39
Had diarrhea in the past 2 weeks					0.925
No	2683	84.74	60.27	39.73
Yes	483	15.26	60.04	39.96
Had fever in the past 2 weeks					0.867
No	2489	78.72	60.39	39.61
Yes	673	21.28	60.03	39.97
Maternal characteristics
Maternal age (years)	3380	30.16			0.848
Partner’s age (years)	2961	35.03			0.547
Maternal employment status					<0.001
Not employed	2101	62.59	57.45	42.55
Employed	1256	37.41	64.81	35.19
Partner’s employment status					<0.001
Not employed	1328	43.60	55.72	44.28
Employed	1718	56.40	63.50	36.50
Maternal occupation					<0.001
No occupation	2124	63.46	57.63	42.37
Professional/technical/managerial	161	4.81	78.88	21.12
Clerical	66	1.97	69.70	30.30
Sales	161	4.81	72.67	27.33
Agricultural	560	16.73	56.43	43.57
Services	257	7.68	68.87	31.13
Skilled manual	7	0.21	85.71	14.29
Unskilled manual	11	0.33	54.55	45.45
Maternal marital status					0.456
Never Married/divorced/separated	274	8.11	62.41	37.59
Married/living together	3106	91.89	60.11	39.89
Maternal religion					0.843
Non-Christian/no religion	29	0.86	58.62	41.38
Christian	3343	99.14	60.42	39.58
Maternal education level					<0.001
No education	647	19.14	48.53	51.47
Primary education	1704	50.41	59.10	40.90
Secondary education	918	27.16	69.17	30.83
Higher education	111	3.28	73.87	26.13
Partner’s education level
No education	458	15.17	46.72	53.28
Primary education	1371	45.40	58.35	41.65
Secondary education	953	31.56	64.85	35.15
Higher education	238	7.88	76.05	23.95
Exposure to mass media					<0.001
No	1646	49.03	53.95	46.05
Yes	1711	50.97	66.69	33.31
Maternal age of first birth (years)	3380	21.17			<0.01
Household characteristics
Sex of househead					0.061
Male	2892	85.56	59.65	40.35
Female	488	14.44	64.14	35.86
Household wealth					<0.001
Poorest	556	16.45	46.40	53.60
Poorer	531	15.71	49.91	50.09
Middle	653	19.32	60.49	39.51
Richer	809	23.93	61.80	38.20
Richest	831	24.59	74.61	25.39
Number of under-5 children	3380	3.35			<0.05
Number of household members	3380	6.93			<0.05
Type of toilet facility					<0.001
No facility	683	20.47	61.35	38.65
Unimproved	1579	47.33	54.40	45.60
Improved	1074	32.19	68.53	31.47
Source of drinking water					<0.001
Unimproved	1558	46.18	54.36	45.64
Improved	1816	53.82	65.42	34.58
Type of cooking fuels					<0.001
Polluting fuels	3058	91.64	58.70	41.30
Clean fuels	279	8.36	78.85	21.15
Distance to health facility					<0.001
Not a big problem	1509	45.14	64.88	35.12
Big problem	1834	54.86	56.32	43.68
Community characteristics
Region					<0.001
Southern Region	663	19.62	65.68	34.32
Highland Region	1043	30.86	41.03	58.97
Momase Region	799	23.64	59.32	40.68
Islands Region	875	25.89	69.37	30.63
Area					<0.001
Rural	2581	76.36	56.95	43.05
Urban	799	23.64	71.09	28.91

Table 3. Optimal value of each hyperparameter searched by the optimization strategy.

	Trainset (Cross-Validation)
Models	Optimal Hyperparameters	AUC
None
CTree	maxdepth = 5, mincriterion = 0.950	0.639
XGBoost	nrounds = 12, eta = 0.153, gamma = 0.091, subsample = 0.807, colsample bytree = 0.995, maxdepth = 6, min child weight = 5	0.644
SVM-RBF	C = 2⁻⁵, σ = 2⁻¹⁵	0.658
LASSO
CTree	maxdepth = 7, mincriterion = 0.900	0.642
XGBoost	nrounds = 12, eta = 0.012, gamma = 0.199, subsample = 0.694, colsample bytree = 0.811, maxdepth = 7, min child weight = 13	0.653
SVM-RBF	C = 2¹⁵, σ = 2⁻¹⁵	0.671
RF-RFE
CTree	maxdepth = 4, mincriterion = 0.990	0.646
XGBoost	nrounds = 19, eta = 0.149, gamma = 0.058, subsample = 0.909, colsample bytree = 1, maxdepth = 20, min child weight = 18	0.666
SVM-RBF	C = 2⁻¹, σ = 2⁻⁵	0.666

Table 4. Performance summary of the prediction models.

Models	Test Set
Metric	AUC (95% CI)	Accuracy	Precision	Recall	F1 Score	Threshold
None
LR	0.728 (0.672–0.785)	0.675	0.731	0.559	0.633	0.370
CTree	0.695 (0.639–0.750)	0.630	0.669	0.515	0.582	0.426
XGBoost	0.744 (0.690–0.798)	0.707	0.762	0.593	0.667	0.400
SVM-RBF	0.704 (0.646–0.761)	0.672	0.692	0.559	0.619	0.363
LASSO
LR	0.730 (0.674–0.787)	0.692	0.708	0.582	0.639	0.391
CTree	0.736 (0.682–0.789)	0.683	0.700	0.572	0.630	0.459
XGBoost	0.767 (0.714–0.819)	0.728	0.715	0.628	0.669	0.487
SVM-RBF	0.722 (0.666–0.778)	0.672	0.677	0.561	0.613	0.346
RF-RFE
LR	0.731 (0.676–0.785)	0.695	0.685	0.589	0.633	0.394
CTree	0.726 (0.672–0.781)	0.681	0.723	0.566	0.635	0.343
XGBoost	0.752 (0.698–0.806)	0.710	0.723	0.603	0.657	0.388
SVM-RBF	0.729 (0.674–0.785)	0.692	0.615	0.597	0.606	0.367

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shen, H.; Zhao, H.; Jiang, Y. Machine Learning Algorithms for Predicting Stunting among Under-Five Children in Papua New Guinea. Children 2023, 10, 1638. https://doi.org/10.3390/children10101638

AMA Style

Shen H, Zhao H, Jiang Y. Machine Learning Algorithms for Predicting Stunting among Under-Five Children in Papua New Guinea. Children. 2023; 10(10):1638. https://doi.org/10.3390/children10101638

Chicago/Turabian Style

Shen, Hao, Hang Zhao, and Yi Jiang. 2023. "Machine Learning Algorithms for Predicting Stunting among Under-Five Children in Papua New Guinea" Children 10, no. 10: 1638. https://doi.org/10.3390/children10101638

APA Style

Shen, H., Zhao, H., & Jiang, Y. (2023). Machine Learning Algorithms for Predicting Stunting among Under-Five Children in Papua New Guinea. Children, 10(10), 1638. https://doi.org/10.3390/children10101638

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning Algorithms for Predicting Stunting among Under-Five Children in Papua New Guinea

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Source

2.2. Outcome Variable and Potential Risk Factors

2.3. Analytic Strategy

2.3.1. Preprocessing

2.3.2. Feature Selection

2.3.3. Machine Learning Algorithms and Hyperparameter Tuning

2.3.4. Model Performance Evaluation

3. Results

3.1. Descriptive Results

3.2. Feature Selection Results

3.3. Hyperparameter Tuning Results

3.4. Evaluation of the Prediction Models

3.5. Model Interpretation

3.5.1. SHAP Summary Plots

3.5.2. SHAP Dependence Plot of Child’s Age

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI