1. Introduction
Understanding the factors that influence child well-being has become an important research priority in public health, developmental psychology, and social policy. Beyond traditional measures of physical health, recent research has increasingly emphasized the concept of
child flourishing, which captures positive developmental outcomes such as curiosity, persistence, emotional regulation, and engagement in learning activities. Flourishing indicators aim to measure whether children are developing the social, emotional, and behavioral capacities necessary for long-term well-being and successful life trajectories [
1,
2].
Large-scale population surveys provide a valuable source of information for studying child well-being. One of the most comprehensive data sources in the United States is the National Survey of Children’s Health (NSCH), which collects nationally relevant information about children’s physical health, emotional development, family environments, and access to healthcare services. The survey is conducted annually by the U.S. Census Bureau and funded by the Health Resources and Services Administration (HRSA), and it has been widely used in research on child health disparities and developmental outcomes [
3].
Although surveys such as the NSCH contain rich information about children’s lives, they are primarily designed for epidemiological and policy research rather than for machine learning applications. Raw survey datasets typically include logical skip codes, complex sampling structures, and large numbers of categorical variables that require extensive preprocessing before predictive modeling can be performed. As a result, applying machine learning methods to survey data often requires substantial data preparation and methodological decisions that are rarely documented in sufficient detail.
At the same time, the increasing availability of large-scale social datasets has created new opportunities for applying machine learning techniques to the study of human development and well-being. Machine learning methods have been widely used for predictive modeling across a variety of domains, including healthcare, education, and social sciences [
4]. However, their application to survey-based datasets raises several methodological challenges. In particular, researchers must decide how to handle survey weights, how to construct predictive outcomes from survey items, how to prevent target–predictor overlap, and how to transform complex survey structures into machine-learning-compatible datasets.
Another important challenge concerns reproducibility. In recent years, reproducibility has become a central concern in computational science and data-driven research. Reproducible research practices require that datasets, analytical pipelines, and computational procedures be transparently documented so that other researchers can replicate the results of a study [
5,
6]. Despite the growing importance of reproducibility, many studies using large survey datasets still provide limited documentation of the filtering, preprocessing, feature-selection, and evaluation steps required to obtain analytical datasets suitable for machine learning.
The present study addresses these challenges by constructing a machine-learning-ready benchmark dataset derived from the 2023 National Survey of Children’s Health and by providing a fully reproducible pipeline for generating that dataset and conducting baseline predictive analyses. The revised workflow focuses on a school-age analytical subset for which the flourishing outcome can be defined consistently from four survey items related to curiosity, task persistence, emotional self-regulation, and interest in doing well in school.
In contrast to studies that focus primarily on developing new predictive algorithms, this work emphasizes the creation of a reusable and transparent data resource for computational research. Transforming a complex survey into a machine-learning-ready benchmark can substantially reduce the barriers faced by researchers who wish to apply data-driven methods to questions related to child well-being, while also making the methodological choices behind that transformation explicit.
More specifically, this study makes four main contributions.
- 1.
Reproducible construction of a school-age child flourishing outcome from four NSCH survey items.
- 2.
Release of a machine-learning-ready analytical dataset derived from the 2023 NSCH.
- 3.
A documented feature-selection workflow designed to support benchmarking while avoiding direct overlap between predictors and the target variable.
- 4.
Transparent baseline experiments, including hold-out evaluation, cross-validation, and a cautious comparison between weighted and unweighted learning.
The main value of the study therefore lies not in proposing a new predictive model, but in providing a curated dataset and a reproducible workflow that can be reused, audited, and extended by other researchers. In this sense, the paper is positioned primarily as a data-resource contribution with empirical benchmarking value.
The remainder of the paper is organized as follows.
Section 2 describes the methodological framework used to construct the dataset, define the target, select predictors, and implement the reproducible evaluation pipeline.
Section 3 summarizes the released dataset and its benchmark structure.
Section 4 presents the empirical results, including sample construction, feature-selection results, baseline predictive performance, and the exploratory comparison between weighted and unweighted models.
Section 5 discusses the methodological and substantive implications of the findings, together with the main limitations of this study. Finally,
Section 6 concludes this paper and outlines directions for future research.
2. Methodology
This study develops a reproducible machine-learning framework for analyzing child flourishing using the 2023 National Survey of Children’s Health (NSCH). The revised methodological pipeline includes sample construction, target definition, predictor filtering, feature selection, benchmark modeling, and reproducible export of tables and figures. All steps were implemented through a structured Python workflow that transforms the original public-use survey file into a machine-learning-ready benchmark dataset and automatically generates the analytical artifacts reported in this paper.
2.1. Data Source
The data used in this study come from the publicly available National Survey of Children’s Health (NSCH) 2023 dataset. The dataset was downloaded from the Child and Adolescent Health Measurement Initiative (CAHMI) Data Resource Center for Child and Adolescent Health website (
https://www.childhealthdata.org, accessed on 22 October 2025), which provides public access to NSCH survey data and documentation. The specific file used in this study was
NSCH_2023e_Topical_CAHMI_DRC.csv, obtained from the CAHMI Data Resource Center portal.
The empirical analysis is based on the 2023 wave of the National Survey of Children’s Health (NSCH), a nationally relevant survey conducted annually in the United States to monitor multiple dimensions of child health and well-being. The NSCH is administered by the U.S. Census Bureau and funded by the Health Resources and Services Administration (HRSA). The survey collects information from parents or guardians about children aged 0–17 years and includes variables related to health status, emotional well-being, family characteristics, healthcare access, and social context [
3,
7].
The original dataset contains 55,162 observations. However, several survey items are only applicable to specific age groups, and some responses correspond to logical skip or nonresponse codes. As a result, explicit sample construction rules are required before defining the analytical outcome and preparing predictors for machine learning.
2.2. Construction of the Child Flourishing Indicator
Child flourishing was operationalized using four NSCH survey items for the school-age population. These items correspond to variables K6Q71_R, K7Q84_R, K7Q85_R, and K7Q82_R, which measure, respectively, interest and curiosity in learning new things, working to finish tasks that are started, staying calm and in control when faced with challenges, and caring about doing well in school.
Following established approaches in the literature on child well-being measurement [
1,
2], and using the four school-age items summarized in
Table 1, we constructed a binary indicator called
flourishing_all4. The indicator takes the value 1 when the child is reported as responding “Always” or “Usually” to all four items, and 0 otherwise. This operationalization captures children who consistently display positive developmental behaviors across multiple domains.
The decision to dichotomize the outcome was made for methodological and practical reasons. First, the main goal of the study is to release a transparent and reproducible benchmark dataset for baseline binary classification rather than to model a more complex latent or ordinal flourishing structure. Second, the “all four positive” definition is directly interpretable and closely aligned with prior flourishing-oriented formulations in the child well-being literature, thereby facilitating comparability across studies [
1,
2]. Third, dichotomization yields a benchmark target that is straightforward to reproduce, audit, and reuse in future machine-learning experiments. This choice does not imply that flourishing is inherently binary; rather, it reflects a deliberate benchmark-design decision intended to prioritize reproducibility, interpretability, and comparability in the present study.
Because the school-related item is only meaningful for the school-age population, the analytical sample was restricted to children aged 6–17 years. Responses coded outside the substantive response range were treated as invalid for outcome construction. Specifically, only response codes 1–4 were considered valid for the four flourishing items, while logical skip and nonresponse codes were excluded before the target was created.
Table 2 summarizes the resulting sample-construction process.
After applying the age-eligibility rule and removing records with invalid responses in any of the four flourishing items, 32,934 valid observations remained for outcome construction. Among these observations, 18,990 children satisfied the flourishing condition and 13,944 did not (see
Table 3).
2.3. Dataset Preparation and Filtering
The dataset preparation process involved several stages designed to ensure analytical consistency, reduce target leakage, and improve reproducibility. First, the raw NSCH file was filtered to retain only age-eligible observations with valid responses in the four flourishing items. Second, a clean analytical identifier (child_id) was created to support reproducible splitting and downstream joins across scripts. Third, candidate predictors were screened to exclude variables that should not be used for benchmark prediction.
In particular, we excluded identifiers, survey-design variables used only for weighting or descriptive purposes, the four source variables used to construct the target, and variables directly derived from flourishing composites or closely related recodings. This step was necessary to prevent direct target–predictor overlap. We also excluded variables with insufficient substantive coverage and variables with no substantive variability in the analytical subset.
The excluded flourishing-related variables comprised the four source items used to construct the benchmark target (K6Q71_R, K7Q84_R, K7Q85_R, and K7Q82_R), the target itself (flourishing_all4), and closely related derived or target-proximal variables, including flrsh6to17ct, flourish6mto17_23, flrish6to17_23, nomFlrish6to17_23, resil6to17_23, curious6to17_23, finishes_23, cares_23, SchlEngage_23, homework_23, and K7Q83_R. Additional age-misaligned or flourishing-related recodings such as flrsh0to5ct, flrish0to5_23, and nomFlrish6mto5_23 were also excluded. This explicit exclusion rule was applied before feature ranking to ensure that the benchmark predictors could not trivially reconstruct the target.
In addition, variables whose usable content was dominated by logical skip patterns, nonresponse categories, or other survey-specific nonsubstantive codes were not retained as benchmark predictors when those coding structures compromised substantive interpretability for machine-learning use. At the variable level, blank strings and nonsubstantive response codes were recoded as missing during preprocessing; at the dataset-construction level, variables with insufficient substantive coverage after this recoding were removed from the benchmark candidate set.
Although the original NSCH dataset contains hundreds of variables, many are redundant, highly sparse, age-specific, or unsuitable for benchmark predictive modeling. Reducing the dimensionality of the candidate predictor set improves interpretability and decreases the risk of unstable models or artificial signal inflation [
4]. The resulting reduced dataset therefore represents a curated subset of variables selected for their analytical usefulness and their suitability for reproducible machine-learning experiments.
2.4. Feature Selection and Benchmark Dataset Construction
Feature relevance was evaluated using mutual information, a nonparametric measure of statistical dependency that quantifies the information shared between each predictor and the target outcome [
8]. Mutual information was chosen because the candidate predictors include mixed survey-coded variables and because it can capture both linear and nonlinear associations without requiring a specific predictive model.
We selected mutual information as the primary filter method for three reasons. First, unlike simple correlation-based screening, it is not restricted to linear dependence and is therefore better suited to heterogeneous survey-coded predictors that may relate to the target in nonlinear ways. Second, compared with correlation analysis, mutual information can be applied more flexibly to mixed discrete and ordinal survey variables without assuming a monotonic or approximately Gaussian association structure. Third, unlike recursive feature elimination (RFE), which is inherently model-dependent, mutual information provides a model-agnostic ranking criterion that is more appropriate for a benchmark paper whose main contribution is a reusable dataset rather than optimization of a single classifier [
4,
8,
9]. For this reason, mutual information was used as the primary ranking tool, whereas more model-specific selectors such as RFE are left for future benchmarking extensions.
To avoid information leakage, feature selection was performed only on the training partition after the train/test split had been established. Candidate predictors were first cleaned by recoding blank strings and survey-specific nonsubstantive codes as missing values, coercing all variables to numeric form, and imputing the remaining missing values with −1 for downstream feature ranking and model fitting.
All predictors in the released benchmark were represented using their native NSCH survey-coded values and then coerced to numeric form for modeling. No one-hot encoding was applied in the baseline pipeline. This choice reflects the aim of providing a simple, transparent, and fully reproducible benchmark representation across models rather than an exhaustively engineered feature space. Accordingly, the baseline encoding should be interpreted as a pragmatic benchmark decision, not as a claim that every survey-coded predictor should necessarily be treated as ordinal in all downstream applications.
After blank strings and nonsubstantive survey codes were recoded as missing, remaining missing values were imputed with −1. We used −1 as a sentinel value because the substantive NSCH response codes retained in the benchmark are nonnegative survey-coded values, making −1 a convenient out-of-range marker that preserves the distinction between observed and imputed entries in a dense design matrix. This simple deterministic imputation was adopted to keep the baseline pipeline reproducible and identical across feature ranking and model fitting steps. It should therefore be interpreted as a benchmark-oriented preprocessing choice rather than as a universally optimal missing-data strategy.
The analytical subset was divided into training and test partitions using a stratified 70/30 split with a fixed random seed. This procedure yielded 23,053 training observations and 9881 test observations. Mutual information scores were then computed using only the training partition.
Rather than selecting the final number of predictors arbitrarily, we evaluated a grid of candidate subset sizes using five-fold stratified cross-validation on the training partition. A logistic regression baseline was used to assess predictive stability as the number of top-ranked variables increased. The final benchmark subset was defined as the smallest feature set whose mean cross-validated ROC-AUC was within a tolerance of 0.002 of the best observed value. Under this criterion, the final benchmark dataset retained 150 predictors.
To evaluate potential multicollinearity among the selected predictors, we computed a correlation matrix for the top benchmark variables and visualized it using a correlation heatmap (
Figure 1). The heatmap shows that many pairwise correlations are modest, although several conceptually related variables exhibit moderate associations. This pattern is expected in survey-based data and does not suggest pervasive redundancy severe enough to invalidate interpretable baseline models [
9].
The final reduced benchmark subset therefore provides a compact and performance-preserving representation of the survey data suitable for reproducible baseline machine-learning experiments.
2.5. Machine Learning Models and Evaluation Protocol
Two baseline machine learning models were implemented to evaluate the predictability of the flourishing indicator: logistic regression and random forest classifiers. Logistic regression is a widely used statistical learning method for binary classification and provides interpretable coefficients describing the relationship between predictors and the outcome variable [
10]. Random forests are ensemble learning methods that combine multiple decision trees to improve predictive performance and robustness against overfitting [
11].
The models were evaluated using accuracy, F1-score, and the area under the receiver operating characteristic curve (ROC-AUC). These metrics provide complementary perspectives on model performance and are commonly used in machine learning evaluations [
12].
For logistic regression, predictors were standardized before model fitting, and the classifier was estimated using the liblinear solver with max_iter = 10,000. For random forest, we used 500 trees with a fixed random seed. Logistic regression was fitted on the cleaned numeric matrix after standardization, whereas random forest was trained on the same cleaned numeric matrix without scaling. Model selection was not the objective of the study; accordingly, these algorithms were used as transparent baselines rather than as heavily tuned predictive systems.
Model stability was assessed through five-fold stratified cross-validation performed on the training partition only. Final model performance was then reported on the held-out test partition. This separation between feature selection, cross-validation, and final hold-out evaluation was introduced to ensure that the reported benchmark results remained free of test-set leakage.
2.6. Survey Weighting Experiment
A key methodological question addressed in this study concerns the role of survey weights in machine-learning models trained on complex survey data. Survey weights are designed to ensure that statistical estimates are representative of the underlying population by correcting for sampling design and nonresponse bias [
13].
However, the objectives of predictive modeling differ from those of population inference. While survey weights improve population representativeness for inferential purposes, they do not necessarily improve discriminative performance in predictive tasks. To examine this issue empirically, we trained both logistic regression and random forest models under two conditions: with and without survey weights.
This comparison was treated as a secondary methodological experiment rather than as the central contribution of the study. The aim was to document how weighting affected the benchmark models under the fixed analytical pipeline described above.
2.7. Geographic Information
The curated dataset preserves the variable FIPSST, which identifies the U.S. state associated with each observation. Retaining this geographic identifier increases the analytical flexibility of the released dataset and allows future descriptive or contextual analyses of regional variation in child flourishing.
However, geographic comparisons were not treated as a primary analytical contribution in the revised benchmark design. The state identifier was therefore preserved mainly as a contextual variable for future reuse rather than as the basis for strong substantive claims in the main empirical analysis.
2.8. Reproducible Research Pipeline
All data preparation, target construction, predictor filtering, feature selection, model training, cross-validation, and table-generation procedures were implemented through a structured Python pipeline composed of modular scripts. The workflow automatically performs dataset cleaning, benchmark-set construction, model evaluation, and LaTeX export of tables and figures.
Reproducibility is a fundamental principle of modern computational research. Providing transparent and reproducible data-processing workflows improves scientific reliability and facilitates the reuse of research outputs by other scholars [
5,
6]. The automated pipeline developed in this study ensures that the results reported in the paper can be regenerated directly from the original dataset using the released scripts.
3. Data Description
This study is based on the 2023 wave of the National Survey of Children’s Health (NSCH), a large-scale survey designed to monitor the health and well-being of children in the United States. As described in
Section 2, the raw NSCH public-use file contains 55,162 observations and a broad set of demographic, socioeconomic, healthcare, and behavioral variables [
3,
7]. However, the main contribution of the present work is not the raw survey itself, but a curated analytical resource derived from it through a reproducible benchmark-construction pipeline.
The released resource corresponds to a school-age analytical subset for which the flourishing outcome can be defined consistently from four valid survey items. As shown in
Table 2, restricting the data to age-eligible observations and valid responses in all four flourishing items yields a final analytical dataset of 32,934 observations. This distinction is important: the released benchmark dataset should be interpreted as a reproducibly constructed analytical subset of NSCH 2023 rather than as a direct machine-learning translation of the entire raw survey.
3.1. Released Benchmark Dataset
One of the primary contributions of this work is the release of a machine-learning-ready benchmark dataset derived from the NSCH survey. The final benchmark files are generated automatically by the reproducible pipeline and include cleaned identifiers, the derived target, preserved survey-design variables, and the selected predictor subsets used in the baseline experiments.
The released dataset includes the following key variables:
child_id: reproducible analytical identifier for each observation
flourishing_all4: binary flourishing outcome
FWC: survey sampling weight
STRATUM: survey stratification variable
FIPSST: state-level geographic identifier
benchmark predictor variables retained after filtering and feature selection
In practical terms, the released benchmark resource is organized into multiple layers. First, a cleaned analytical dataset is created after age filtering and valid outcome construction. Second, a reduced machine-learning-ready predictor set is generated after excluding identifiers, direct target source items, and target-proximal derived variables. Third, a final benchmark subset is produced after mutual-information-based ranking on the training partition and cross-validated subset-size selection. This organization allows other researchers to reuse either the final benchmark dataset or intermediate reproducibility artifacts, depending on their objectives.
3.2. Benchmark Structure and Analytical Scope
The final benchmark subset retains 150 predictors selected to preserve predictive signal while reducing dimensionality and direct target overlap. The selected variables cover multiple aspects of child context, including health-related indicators, family and household conditions, functional difficulties, school-related measures, and broader developmental characteristics. In this sense, the benchmark is not limited to a single conceptual domain, but instead reflects the multidimensional nature of child flourishing.
At the same time, the released benchmark should not be interpreted as a population-inference dataset in the usual survey-statistical sense. Because the analytical subset depends on age eligibility and valid outcome measurement, it reflects a filtered study population. This makes the dataset especially useful for reproducible computational benchmarking while also requiring caution when extrapolating beyond the analytically eligible sample.
3.3. Descriptive Characteristics
Table 4 reports descriptive statistics for the key variables retained in the benchmark dataset, including the flourishing outcome, the survey weight, the stratification variable, and the state identifier.
Table 3 reports the class distribution of
flourishing_all4. The target is not perfectly balanced, but both classes are well represented in the final analytical subset, which makes the benchmark suitable for standard binary classification experiments without extreme class-imbalance corrections.
To support transparency and reuse, the project also exports a codebook describing the variables included in the benchmark dataset, their data types, missing-value counts, and observed cardinalities. Although the full codebook is not reproduced in the main text, it is made available as part of the released project artifacts.
3.4. Contextual and Geographic Variables
The dataset preserves selected survey-design and contextual fields that may support future methodological extensions. In particular, FWC and STRATUM are retained to enable weighted versus unweighted modeling comparisons and related methodological analyses. Likewise, FIPSST is preserved as a state-level contextual identifier.
Although the revised manuscript does not emphasize geographic ranking results in the main text, preserving the state identifier increases the long-term analytical value of the released resource. Future work may use this field to integrate external contextual data or to conduct descriptive analyses of regional heterogeneity in child well-being.
3.5. Dataset Availability and Reproducibility
Another key contribution of this work is the development of a fully reproducible data-preparation and benchmarking pipeline. The curated dataset is not produced manually; instead, it is generated automatically through a sequence of Python scripts that perform sample construction, target generation, predictor filtering, feature selection, model training, cross-validation, and table and figure generation.
Reproducibility is increasingly recognized as a fundamental requirement in computational science and data-driven research [
5,
6]. By releasing both the processed dataset and the scripts used to generate it, this work enables other researchers to reproduce the complete analytical workflow and extend the analysis in future studies.
The curated dataset and the complete reproducible pipeline used in this study are publicly available in the project repository:
https://github.com/miguelarcosa/NSCH-CuratedDataSet, accessed on 18 March 2026. The repository contains the scripts used for sample construction, feature selection, machine-learning experiments, and automatic generation of the tables and figures reported in this article. Providing the full computational workflow together with the processed benchmark dataset enables independent regeneration, auditing, and extension of the released analytical resource.
Taken together, the curated benchmark dataset and reproducible pipeline transform the NSCH survey into a reusable resource for machine-learning research on child flourishing and related methodological studies.
4. Results
This section presents the empirical results obtained from the revised NSCH-derived benchmark dataset and the reproducible machine-learning pipeline described in the previous sections. The results are organized into four components: (i) sample construction and descriptive characteristics of the released benchmark dataset, (ii) feature-selection results, (iii) baseline predictive performance, and (iv) the exploratory comparison between weighted and unweighted learning.
4.1. Sample Construction and Dataset Characteristics
The revised benchmark-construction pipeline transformed the raw NSCH 2023 file into a school-age analytical dataset suitable for machine-learning experiments. Rather than relying on a very small filtered subset, the revised workflow restricted the data to age-eligible children and then retained only records with valid responses in all four flourishing items.
Table 2 summarizes the resulting sample-construction process.
Starting from 55,162 observations in the raw NSCH 2023 file, the benchmark-construction workflow retained 33,638 age-eligible observations for children aged 6–17 years. After removing records with invalid responses in any of the four flourishing items, the final analytical dataset contained 32,934 observations. This benchmark subset therefore represents a large and analytically coherent school-age sample rather than a narrowly filtered residual subset.
Table 3 and
Table 4 report the descriptive characteristics of the benchmark dataset and the class distribution of
flourishing_all4. In the final analytical sample, 18,990 observations were classified as flourishing and 13,944 were classified as non-flourishing, corresponding to proportions of 0.5766 and 0.4234, respectively. The target is therefore somewhat imbalanced, but both classes remain well represented for standard binary classification experiments.
In addition to the binary flourishing target, the released benchmark dataset preserves key contextual and survey-design variables such as FWC, STRATUM, and FIPSST. Preserving these fields increases the flexibility of the benchmark for future methodological extensions while keeping the present analysis focused on transparent baseline prediction.
4.2. Feature Selection Results
To construct a compact benchmark dataset, feature relevance was evaluated using mutual information computed on the training partition only. This procedure was adopted to avoid test-set leakage and to ensure that the final ranking reflected only information available during model development [
8].
Table 5 reports the top 15 predictors from the final benchmark subset.
The highest-ranked variables include indicators such as
K8Q31,
DiffCare_23,
K8Q32,
bother_23, and
MedRiskct_23. More generally, the selected predictors span multiple domains, including health complexity, functional difficulties, emotional or behavioral burden, school-related functioning, and broader child-context variables. This pattern is consistent with the multidimensional nature of child flourishing and suggests that useful predictive signal is distributed across several domains rather than concentrated in a single factor [
1,
2].
Using the cross-validated subset-size selection procedure described in
Section 2, the final benchmark dataset retained 150 predictors. This reduced representation preserves predictive signal while substantially lowering dimensionality relative to the broader candidate pool.
4.3. Baseline Machine-Learning Performance
To evaluate the predictive potential of the benchmark dataset, we implemented logistic regression and random forest as transparent baseline classifiers. These models were selected because they represent widely used baseline approaches in predictive modeling and provide complementary perspectives on linear and nonlinear classification behavior [
4,
11].
Table 6 reports the hold-out test performance of the baseline models, and
Table 7 reports the corresponding five-fold cross-validation mean ROC-AUC values computed on the training partition only.
The results indicate that the revised benchmark dataset supports stable and reasonably strong predictive performance. On the held-out test partition, logistic regression without survey weights achieved an accuracy of 0.7727, an F1-score of 0.8147, and a ROC-AUC of 0.8470. Random forest without survey weights achieved very similar performance, with an accuracy of 0.7742, an F1-score of 0.8131, and a ROC-AUC of 0.8447.
The cross-validation results are consistent with the hold-out evaluation. Mean ROC-AUC on the training partition ranged from 0.8373 to 0.8451 across the four model configurations, indicating that the benchmark signal is stable across resampling folds and that the reported hold-out results are not driven by an anomalous split. These results also suggest that the final 150-feature benchmark subset preserves substantial predictive information.
Importantly, the main value of the paper does not lie in claiming a single superior predictive model. Instead, the empirical results show that the released benchmark dataset supports reproducible and nontrivial prediction of school-age child flourishing using standard baseline models. In that sense, the results provide an empirical anchor for the data-resource contribution of the paper.
4.4. Effect of Survey Weighting on Predictive Performance
A secondary objective of the study was to explore how survey weighting affects benchmark predictive performance in this NSCH-based setting [
13]. The weighted and unweighted results should therefore be interpreted as a methodological comparison within the present benchmark rather than as a universal conclusion about all survey-based machine-learning applications.
For logistic regression, the weighted model produced slightly lower performance than the unweighted version on the held-out test set, with the ROC-AUC decreasing from 0.8470 to 0.8394 and the F1-score decreasing from 0.8147 to 0.8098. A similar pattern was observed in cross-validation, where the mean ROC-AUC decreased from 0.8443 to 0.8373.
For random forest, the differences between weighted and unweighted fitting were smaller. On the held-out test set, the weighted model produced a slightly lower ROC-AUC than the unweighted model (0.8440 versus 0.8447), while accuracy and the F1-score were marginally higher. In cross-validation, the mean ROC-AUC values for the weighted and unweighted random forest models were nearly identical (0.8447 and 0.8451, respectively).
Taken together, these results suggest that weighting did not improve discrimination as measured by ROC-AUC in the present benchmark setting, although the magnitude of the difference depended on the model and was modest for random forest. Accordingly, the revised manuscript adopts a cautious interpretation: survey weights remain important for inferential representativeness, but their effect on predictive discrimination in this benchmark is limited and not uniformly detrimental across all evaluation metrics.
4.5. Geographic Information Preserved for Future Use
The released benchmark dataset preserves the state identifier FIPSST, which may support future descriptive or contextual extensions. However, geographic comparisons are not emphasized in the revised main results because the primary contribution of the paper is the benchmark dataset and the reproducible analytical workflow, not a state-ranking exercise. For this reason, geographic outputs are treated as supplementary extensions rather than as central empirical findings.
5. Discussion
The primary objective of this study was not to introduce a new machine-learning algorithm, but rather to construct a reproducible benchmark dataset and analytical pipeline for studying child flourishing using the 2023 National Survey of Children’s Health (NSCH). The revised results provide several insights regarding the predictability of child flourishing, the methodological implications of applying machine learning to survey data, and the value of releasing curated and reproducible data resources for computational research.
5.1. Predictability of Child Flourishing
One of the central findings of the revised analysis is that school-age child flourishing can be predicted with reasonably strong and stable baseline performance when the outcome is constructed consistently, leakage is controlled, and the benchmark is built on a large analytical subset. In the present benchmark, the best-performing baseline configurations achieved held-out ROC-AUC values around 0.84–0.85, with closely aligned cross-validation results on the training partition. These results are clearly stronger than the preliminary values reported in the earlier version of the workflow, and they indicate that the released dataset contains substantial predictive signal.
At the same time, these results should be interpreted carefully. Even though the benchmark supports strong baseline discrimination, child flourishing remains a multidimensional construct influenced by emotional development, family environment, educational context, and broader social conditions [
1,
2,
14]. The fact that competitive performance can be achieved with transparent baseline models does not imply that flourishing is reducible to a single narrow set of predictors. Rather, it suggests that a reproducibly constructed school-age benchmark can capture meaningful patterns related to flourishing while still leaving room for richer contextual and methodological extensions.
This interpretation is consistent with previous work showing that predictive modeling of social and behavioral outcomes depends on the quality of feature construction, sample definition, and evaluation design, rather than only on algorithmic complexity [
4,
15]. In that sense, the revised findings reinforce the importance of well-documented benchmark construction in addition to model choice.
5.2. Machine Learning and Survey Data
A second important contribution of this study concerns the interaction between machine learning models and complex survey data. The NSCH survey was designed primarily for population-level statistical inference rather than predictive modeling [
3]. This creates a methodological tension when the same dataset is repurposed for machine-learning benchmarks.
In the revised experiments, survey weighting did not improve discrimination as measured by ROC-AUC for either logistic regression or random forest. For logistic regression, weighting was associated with a modest but consistent reduction in ROC-AUC in both hold-out evaluation and cross-validation. For random forest, the differences were smaller: ROC-AUC remained slightly lower under weighting, whereas accuracy and F1-score changed only marginally and were not uniformly worse across all settings.
This pattern highlights an important methodological distinction between two analytical objectives. In survey statistics, weights are used to improve representativeness of population estimates. In predictive modeling, however, the objective is to optimize out-of-sample classification performance rather than to estimate population parameters. Because weighting may alter the effective learning objective and variance structure, its effect on prediction is not guaranteed to be beneficial. Similar tensions between design-based inference and predictive modeling have been discussed in the survey-methodology and applied-statistics literature [
13,
16].
Accordingly, the weighting comparison in this study should be interpreted as a benchmark-specific methodological observation rather than as a universal rule. The revised manuscript therefore adopts a more cautious position: survey weights remain essential for many inferential purposes, but their contribution to predictive discrimination in this benchmark is limited and model-dependent.
5.3. Value of a Machine-Learning-Ready Survey Benchmark
A central contribution of this work is the release of a curated machine-learning-ready benchmark dataset derived from the NSCH survey. Although the original NSCH dataset is a valuable resource for epidemiological and population-health research, its raw structure is not optimized for machine-learning applications.
Transforming complex datasets into reusable research resources is increasingly recognized as a meaningful contribution in data-driven science [
5]. The benchmark released in this study addresses several practical challenges commonly encountered when applying machine learning to survey data, including age-specific applicability of outcome items, the presence of logical skip and nonresponse codes, the need to control target–predictor overlap, and the absence of a directly reusable benchmark target in the raw file.
By documenting the construction of the flourishing outcome and releasing the processed benchmark dataset together with a reproducible pipeline, this study transforms a complex public-health survey into a resource that can be directly used in machine-learning research. Reproducibility has become an increasingly important principle in computational science, particularly in fields that rely on large-scale data analysis [
5,
6]. Providing both the processed dataset and the scripts used to generate it enables other researchers to reproduce the results of this study, audit the benchmark-construction choices, and extend the analysis in future work.
5.4. Interpretation of the Selected Predictors
The revised feature-selection results also provide a substantive insight into the structure of the benchmark. The highest-ranked predictors are not confined to a single domain; instead, they span multiple aspects of the child context, including functional difficulties, health complexity, school-related measures, and broader developmental conditions. This pattern is consistent with the conceptual literature on flourishing, which emphasizes that positive child development is multidimensional rather than narrowly behavioral or purely clinical [
1,
2,
14].
At the same time, the selected predictors should not be interpreted causally. Mutual information was used as a model-agnostic filter for benchmark construction, not as a causal discovery tool. Accordingly, the importance of a variable in the benchmark reflects predictive association under the present analytical design rather than evidence of direct causal influence on flourishing.
5.5. Limitations
Several limitations should be considered when interpreting the results of this study. First, although the revised analytical dataset is much larger than in the earlier workflow, it still represents an analytically eligible school-age subset rather than the full NSCH population. The benchmark depends on age eligibility and valid measurement of all four flourishing items, which means that the released dataset may still be affected by sample-selection mechanisms relative to the raw survey.
Second, the benchmark is intended for reproducible predictive analysis, not for causal inference. The predictors retained in the final subset may be highly informative for classification, but they should not be interpreted as causal determinants of child flourishing without a different research design.
Third, although the revised benchmark achieves stronger predictive performance than the earlier version, the study intentionally focuses on transparent baseline models rather than aggressive model optimization. The reported results therefore provide a solid benchmark reference point, but not an upper bound on achievable performance.
Fourth, the preserved geographic information increases the flexibility of the released dataset, but the present study does not treat geographic comparisons as a central empirical contribution. Future geographic or contextual analyses should incorporate appropriate safeguards for varying state-level sample sizes and should avoid overinterpreting descriptive rankings.
5.6. Future Research
The benchmark dataset and reproducible pipeline introduced in this study open several avenues for future research. First, researchers may apply additional machine-learning models, including gradient boosting methods or neural networks, to evaluate whether more complex algorithms improve predictive performance beyond the transparent baselines reported here [
4].
Second, future studies could integrate additional contextual datasets, such as regional socioeconomic indicators, education statistics, or community-level variables, in order to better capture the broader environmental determinants of child flourishing.
Third, the released benchmark may serve as a testbed for methodological comparison studies, including alternative feature-selection strategies, different missing-data treatments, interpretable machine-learning approaches, calibration analysis, or fairness-oriented evaluation. Because the full benchmark-construction workflow is reproducible, such studies can focus on methodological innovation without having to reconstruct the analytical subset from scratch.
6. Conclusions
This study presents a reproducible machine-learning-ready benchmark dataset derived from the 2023 National Survey of Children’s Health (NSCH) together with a transparent computational pipeline for studying school-age child flourishing. The work addresses an important gap between large-scale population surveys and machine-learning research by transforming a complex public-use survey file into a benchmark resource that can be directly reused for predictive modeling and computational analysis.
A key contribution of the study is the reproducible construction of a school-age flourishing outcome derived from four NSCH survey items measuring curiosity, task persistence, emotional self-regulation, and interest in doing well in school. Because the NSCH dataset was originally designed for population health monitoring rather than machine-learning applications, constructing a benchmark target requires explicit handling of age eligibility, survey-specific response codes, and valid-response filtering. By documenting these steps and providing a standardized outcome definition, this work facilitates consistent future analyses of child well-being using NSCH data.
The study also releases a curated benchmark dataset that includes the flourishing outcome, the final selected predictor subset, and key survey-design and contextual variables. Converting a complex survey dataset into a machine-learning-ready benchmark substantially reduces the barriers faced by researchers who wish to apply predictive models to child well-being data while preserving enough structure to support methodological extensions.
Another contribution of the study is the development of a fully reproducible computational pipeline that performs sample construction, predictor filtering, feature selection, model training, cross-validation, and automatic export of tables and figures used in the paper. Reproducibility is increasingly recognized as a fundamental requirement for reliable computational research [
5,
6]. By releasing both the processed benchmark dataset and the scripts required to regenerate the analytical results, this study supports transparent and replicable research practices.
The baseline machine-learning experiments reported in this study show that the revised benchmark supports stable and reasonably strong predictive performance under transparent baseline models. Logistic regression and random forest achieved closely comparable results, with held-out ROC-AUC values around 0.84–0.85 and consistent cross-validation performance on the training partition. These findings indicate that the released benchmark contains meaningful predictive signals and can serve as a useful reference task for future methodological work.
The comparison between weighted and unweighted models provides an additional methodological insight into the interaction between machine learning and complex survey data. In the present benchmark, weighting did not improve discriminative performance as measured by ROC-AUC, although the magnitude of the effect was modest and depended on the model. This result reinforces an important methodological distinction between predictive modeling and design-based inference: survey weights remain essential for many inferential purposes, but their benefit for predictive discrimination is not guaranteed in benchmark classification settings.
Taken together, the curated dataset, reproducible pipeline, and baseline benchmarking results presented in this study provide a foundation for future computational research on child flourishing. The released benchmark can serve as a resource for evaluating predictive models, comparing feature-selection strategies, exploring interpretable machine-learning approaches, and integrating additional contextual datasets related to education, health, and socioeconomic conditions.
Future work may extend this research in several directions. Researchers may evaluate additional machine-learning algorithms, incorporate external contextual data, study calibration and fairness properties, or investigate richer methodological designs for benchmark construction and validation. The released resource may also support interdisciplinary work connecting public health, social science, and machine-learning research.
By transforming the NSCH survey into a reproducible benchmark resource and documenting the full analytical workflow, this study contributes to the development of open and reusable data resources for studying child well-being.