Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Predicting College Enrollment for Low-Socioeconomic-Status Students Using Machine Learning Approaches

Big Data Cogn. Comput. 2025, 9(4), 99; https://doi.org/10.3390/bdcc9040099

by Surina He^1,*

, Mehrdad Yousefpoori-Naeim¹

, Ying Cui² and Maria Cutumisu^3,*

Reviewer 1: Anonymous

Reviewer 2:

Manjur Kolhar

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Big Data Cogn. Comput. 2025, 9(4), 99; https://doi.org/10.3390/bdcc9040099

Submission received: 6 March 2025 / Revised: 4 April 2025 / Accepted: 10 April 2025 / Published: 12 April 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In the propsoed manuscript, the authors apply machine learning techniques to predict college enrollment among high school students from low socioeconomic status (SES) backgrounds. Using data from the High School Longitudinal Study of 2009 (HSLS:09), they evaluate some machine learning models, such as logistic regression and k-nearest neighbors (KNN), to determine the most accurate predictor of college enrollment. The results show that the random forest classifier outperforms other models, achieving an accuracy of 67% circa. The study identifies key predictors of college enrollment, including high school GPA, parental educational expectations, and the influence of peers. Notably, students with higher GPAs, supportive parental expectations, and friends planning to attend college were more likely to enroll in higher education. The study provides insights into modifiable factors that educators and policymakers can leverage to encourage college enrollment among low-SES students.

The paper is a well-structured and methodologically sound exploration of predictive factors influencing college enrollment. It effectively integrates machine learning with educational research, providing a data-driven approach to identifying key determinants of postsecondary enrollment. The use of shap enhances the interpretability of the model, offering valuable insights into the relative importance of different predictors. The study also contributes to the field by focusing specifically on low-SES students, a group that is often underrepresented in higher education research.

Nevertheless, the presentation has some drawbacks that currently undermine the quality of the paper. Consequently, I suggest the authors to take into account the following comments:

A more comprehensive overview of methods and problems in social science and explainability in general should be provided; indeed, the study of the related literature is not really comprehensive in this current form. I suggest the authors to take into account problems and scenarios of different natures (such as social network analysis, see [1], and novel explainability models, see [2]) and integrate them in a more general overview of the related literature.
The approach remains heavily descriptive rather than prescriptive although the authors use ML. The study identifies key predictors but does not explore how interventions based on these findings could be implemented in real-world settings. Without an experimental or quasi-experimental validation of these findings, it is unclear whether targeting these predictors (e.g., parental expectations) would actually increase college enrollment rates.
The dataset is very old (2009), thus raising concerns about its applicability to current socioeconomic and educational contexts, especially given recent disruptions in higher education trends due to COVID-19, economic shifts, and changes in financial aid policies. Some more details should be provided.
The machine learning methodology, while rigorous, lacks a thorough justification for model selection beyond accuracy comparisons. More details on this should be also provided.
The discussion lacks concrete policy recommendations or practical applications. While the study identifies important factors, it does not suggest specific interventions or educational policies that could be implemented to address disparities in college enrollment. More discussion on how school administrators, teachers, and policymakers can use these findings to design effective support programs for low-SES students would enhance the study’s contribution.

References:

[1] "A model-agnostic, network theory-based framework for supporting XAI on classifiers." Expert Systems with Applications 241 (2024): 122588.

Author Response

Comment 1: A more comprehensive overview of methods and problems in social science and explainability in general should be provided; indeed, the study of the related literature is not really comprehensive in this current form. I suggest the authors to take into account problems and scenarios of different natures (such as social network analysis, see [1], and novel explainability models, see [2]) and integrate them in a more general overview of the related literature.

Response 1: We have added a new subsection (Section 2.4: Explainable Artificial Intelligence Techniques) to provide a more comprehensive overview of explainability-related methods and challenges in social science. This section also includes the categorization of prevalent XAI approaches in computer science (page 4, lines 157–179, and page 5, lines 180–193).

Comment 2: The approach remains heavily descriptive rather than prescriptive although the authors use ML. The study identifies key predictors but does not explore how interventions based on these findings could be implemented in real-world settings. Without an experimental or quasi-experimental validation of these findings, it is unclear whether targeting these predictors (e.g., parental expectations) would actually increase college enrollment rates.

Response 2: We have addressed the limitations of our findings in the Limitations section and recommend that future studies could obtain more practical results through experimental or quasi-experimental designs (page 14, lines 521–525).

Comment 3: The dataset is very old (2009), thus raising concerns about its applicability to current socioeconomic and educational contexts, especially given recent disruptions in higher education trends due to COVID-19, economic shifts, and changes in financial aid policies. Some more details should be provided.

Response 3: Regarding the old dataset, we have to admit that it’s not very easy to find large-scale college enrollment datasets since nationwide data collection is time-consuming and costly. The HSLS 2009 is already the most recent publicly available dataset suitable for our study.

Comment 4: The machine learning methodology, while rigorous, lacks a thorough justification for model selection beyond accuracy comparisons. More details on this should be also provided.

Response 4: We have provided a more detailed rationale for selecting the five classifiers in the Machine Learning Model-Building and Testing subsection (page 7, lines 286–303).

Comment 5: The discussion lacks concrete policy recommendations or practical applications. While the study identifies important factors, it does not suggest specific interventions or educational policies that could be implemented to address disparities in college enrollment. More discussion on how school administrators, teachers, and policymakers can use these findings to design effective support programs for low-SES students would enhance the study’s contribution.

Response 5: We have added more practical recommendations in the Discussion section (page 13, lines 444-453, lines 492-495; page 14, lines 496-501).

Reviewer 2 Report

Comments and Suggestions for Authors

Briefly mention the theoretical framework (Ecological Systems Theory) used, to provide additional context.

Clearly states research gaps, justifying the current study. However, it might be beneficial to explicitly emphasize why machine learning was specifically chosen as a methodological approach in the introduction itself.

In Literature Review,
you might want to further justify the unique contribution of this study clearly by explicitly identifying how the chosen predictors differ from those already explored in past research or explicitly stating the gap addressed by this machine learning approach.

Further clarify why specifically 2-fold inner cross-validation was chosen (which is uncommon) instead of more traditional values such as 5 or 10 folds.

Consider clarifying the reason behind the relatively modest accuracy (67%-70%)

Does this suggest the problem is inherently challenging,

is there room for methodological improvements?

Need information on the algorithm multiple times (e.g., 50+ iterations) within an X-fold cross-validation framework. Report performance metrics as mean ± standard deviation to demonstrate consistency. can this applied for the ML

Author Response

Thank you for your valuable feedback. We have carefully addressed your comments and revised the manuscript accordingly. Below is a point-by-point response detailing the changes made.

Comment 1: Briefly mention the theoretical framework (Ecological Systems Theory) used, to provide additional context. Clearly states research gaps, justifying the current study. However, it might be beneficial to explicitly emphasize why machine learning was specifically chosen as a methodological approach in the introduction itself. In Literature Review, you might want to further justify the unique contribution of this study clearly by explicitly identifying how the chosen predictors differ from those already explored in past research or explicitly stating the gap addressed by this machine learning approach.

Response 1: We have incorporated the ecological system theory to strengthen the framing of the research gaps, study purpose, and justification for using machine learning and XAI methods. This revision also clarifies the study’s contributions in both the Introduction section (page 2, lines 45–52) and The Present Study subsection (page 5, lines 208–215).

Comment 2: Further clarify why specifically 2-fold inner cross-validation was chosen (which is uncommon) instead of more traditional values such as 5 or 10 folds.

Response 2: As suggested, we adopted the more robust 5-fold cross-validation with 5 repetitions to ensure reliable model evaluation.

Comment 3: Consider clarifying the reason behind the relatively modest accuracy (67%-70%).

Response 3: We have added an explanation regarding the modest accuracy of our models in the Limitations section (page 14, lines 506–509).

Comment 4: Need information on the algorithm multiple times (e.g., 50+ iterations) within an X-fold cross-validation framework. Report performance metrics as mean ± standard deviation to demonstrate consistency.

Response 4: To enhance the reliability of our results, we implemented a nested cross-validation approach with a 5×5 (inner) × 5×5 (outer) design (totaling 625 fits), combined with a grid search for hyperparameter tuning. This ensures stable and accurate performance estimation. Detailed methodological explanations are provided in Section 3.4 (Machine Learning Model-Building and Testing; page 7, lines 286–330; page 8, lines 331–337). Model performance metrics (mean ± standard deviation) for both inner and outer loops are now reported in Table 1 (page 10).

Reviewer 3 Report

Comments and Suggestions for Authors

The paper titled "Using Machine Learning Approaches to Predict College Enrollment for Low Socioeconomic Status Students" investigates the factors influencing college enrollment among low-SES high school students using five machine learning algorithms. The study leverages data from the High School Longitudinal Study of 2009 (HSLS:09) and identifies key predictors through a rigorous machine learning pipeline, including data preprocessing, feature selection, and model evaluation. However, the paper still has the following problems:

The authors used a combination of Spearman correlation and Lasso regression to reduce the number of predictors from 100 to 28. However, it is unclear whether other feature selection methods (e.g., recursive feature elimination or principal component analysis) could yield different or more robust results. Additionally, the rationale for choosing a specific Lasso regularization strength should be explained in more detail.

While the SHAP values provide insights into feature importance, the interpretation of these values could be enhanced. For instance, the authors could discuss how the directionality and magnitude of SHAP values align with theoretical expectations from the ecological systems theory. Additionally, exploring partial dependence plots or interaction values might offer deeper insights into the relationships between predictors and college enrollment.

The study compares five machine learning algorithms but lacks a comparison with simpler baseline models (e.g., logistic regression without feature selection or a basic decision tree). Including such baselines would help contextualize the performance gains achieved through more complex methods like random forests.

The authors report accuracy, precision, recall, F1-score, and ROC-AUC for model evaluation. However, given the imbalanced nature of the dataset (before applying SMOTE), it would be beneficial to include additional metrics such as the confusion matrix for the test set and class-specific performance metrics to better understand the model's predictive power across different enrollment statuses.

The study relies on data from the HSLS:09, which is specific to the United States. While the findings are valuable, the authors should discuss the potential limitations in generalizing these results to other regions or populations. Future work could involve validating the identified predictors using datasets from other countries or demographic groups.

The study concludes with practical implications for educational practitioners, but it could benefit from more detailed recommendations. For example, the authors could suggest specific interventions or programs that target the identified predictors (e.g., GPA improvement initiatives or parental engagement workshops) to directly support low-SES students.

Given the longitudinal nature of the HSLS:09 dataset, it would be informative to explore the temporal dynamics of the predictors. For instance, how do the influences of GPA, parental expectations, and peer networks change over time? A longitudinal analysis could provide deeper insights into the developmental trajectories leading to college enrollment.

The authors could conduct a sensitivity analysis to assess the robustness of their findings. For example, how do the results change when different hyperparameters are chosen for the random forest algorithm, or when the oversampling technique (SMOTE) is varied? This analysis would strengthen the reliability of the identified predictors.

The study is grounded in ecological systems theory, but the discussion could more explicitly link the findings to this theoretical framework. For example, how do the identified predictors align with the micro-, meso-, and exosystem levels described by Bronfenbrenner? A more detailed theoretical integration would enhance the study's coherence and provide a stronger foundation for future research.

The papers in the introduction of the paper are old and insufficient, and the background description needs to cite more papers. The following paper needs to be cited: "From Sample Poverty to Rich Feature Learning: A New Metric Learning Method for Few-Shot Classification"

Author Response

Thank you for your detailed suggestions. We have carefully addressed each of your comments and revised the manuscript accordingly. Below is a point-by-point response detailing the changes made.

Comment 1: The authors used a combination of Spearman correlation and Lasso regression to reduce the number of predictors from 100 to 28. However, it is unclear whether other feature selection methods (e.g., recursive feature elimination or principal component analysis) could yield different or more robust results. Additionally, the rationale for choosing a specific Lasso regularization strength should be explained in more detail.

Response 1: We have further clarified the rationale for selecting Lasso regression in the Data Preprocessing subsection (page 6, lines 269–272). Compared to Recursive Feature Elimination (RFE), Lasso is computationally efficient, as it does not require an external model or iterative feature removal. Additionally, while Principal Component Analysis (PCA) is primarily a dimensionality reduction technique that constructs less interpretable features, Lasso retains the original features (shrinking some coefficients to zero), ensuring better interpretability.

Comment 2: While the SHAP values provide insights into feature importance, the interpretation of these values could be enhanced. For instance, the authors could discuss how the directionality and magnitude of SHAP values align with theoretical expectations from the ecological systems theory. Additionally, exploring partial dependence plots or interaction values might offer deeper insights into the relationships between predictors and college enrollment.

Response 2: As suggested, we have linked the interpretation of feature importance (derived from SHAP values) to the ecological system theory in the Discussion section (page 13, lines 446–453).

Comment 3: The study compares five machine learning algorithms but lacks a comparison with simpler baseline models (e.g., logistic regression without feature selection or a basic decision tree). Including such baselines would help contextualize the performance gains achieved through more complex methods like random forests.

Response 3: In this study, logistic regression was chosen as the baseline model due to its simplicity, low variance, and widespread use in educational research. Additional justification is provided in the Machine Learning Model-Building and Testing subsection (page 7, lines 300–303).

Comment 4: The authors report accuracy, precision, recall, F1-score, and ROC-AUC for model evaluation. However, given the imbalanced nature of the dataset (before applying SMOTE), it would be beneficial to include additional metrics such as the confusion matrix for the test set and class-specific performance metrics to better understand the model's predictive power across different enrollment statuses.

Response 4: We have added a confusion matrix and class-specific ROC-AUC plots (Figures 3 and 4) in the Results section (page 10).

Comment 5:The study relies on data from the HSLS:09, which is specific to the United States. While the findings are valuable, the authors should discuss the potential limitations in generalizing these results to other regions or populations. Future work could involve validating the identified predictors using datasets from other countries or demographic groups.

Response 5: We also include the limitation of generalizability of the US sample and limited features in the Limitation section (page 14, lines 506-512).

Comment 6: The study concludes with practical implications for educational practitioners, but it could benefit from more detailed recommendations. For example, the authors could suggest specific interventions or programs that target the identified predictors (e.g., GPA improvement initiatives or parental engagement workshops) to directly support low-SES students.

Response 6: More practical recommendations have been added in the Discussion section (page 13, lines 444-453, lines 492-495; page 14, lines 496-501).

Comment 7: Given the longitudinal nature of the HSLS:09 dataset, it would be informative to explore the temporal dynamics of the predictors. For instance, how do the influences of GPA, parental expectations, and peer networks change over time? A longitudinal analysis could provide deeper insights into the developmental trajectories leading to college enrollment.

Response 7: While exploring longitudinal changes in feature importance is an interesting direction, the current study focuses on identifying predictors of college enrollment among low-SES students. We may investigate temporal dynamics in future work.

Comment 8: The authors could conduct a sensitivity analysis to assess the robustness of their findings. For example, how do the results change when different hyperparameters are chosen for the random forest algorithm, or when the oversampling technique (SMOTE) is varied? This analysis would strengthen the reliability of the identified predictors.

Response 8: The main focus of this study is to determine the importance of features in the best predictive model. The range of hyperparameters is already wide enough. Moreover, to obtain a reliable and predictive model, this study applied a nested cross-validation approach with a 5×5 (inner) × 5×5 (outer) design (totaling 625 fits), combined with grid search for hyperparameter tuning. We believe this would be enough to ensure stable and accurate performance estimation, without doing sensitivity analysis. Detailed methodological explanations are provided in Section 3.4 (Machine Learning Model-Building and Testing) (page 7, lines 286–330; page 8, lines 331–337).

Comment 9: The study is grounded in ecological systems theory, but the discussion could more explicitly link the findings to this theoretical framework. For example, how do the identified predictors align with the micro-, meso-, and exosystem levels described by Bronfenbrenner? A more detailed theoretical integration would enhance the study's coherence and provide a stronger foundation for future research.

Response 9: The relative importance of features is now explicitly discussed under the ecological system theory framework in the Discussion section (page 13, lines 444–453).

Comment 10: The papers in the introduction of the paper are old and insufficient, and the background description needs to cite more papers. The following paper needs to be cited: "From Sample Poverty to Rich Feature Learning: A New Metric Learning Method for Few-Shot Classification"

Response 10: We have updated the Related Work section with recent literature and data (page 2, lines 70-79).

Reviewer 4 Report

Comments and Suggestions for Authors

Comments to Authors

In this manuscript, the authors report the machine learning study for prediction of college Enrollment for low socioeconomic status students. The study examines the factors that predict college enrollment among high-school students with lower socioeconomic status (SES) using five different machine learning algorithms on the data from the High School Longitudinal Study of 2009 (HSLS:09). The three most important factors for predicting student enrollment in 2-year and 4-year colleges were overall high-school GPA, parental educational expectations, and the number of close friends who plan to attend a 4-year college.

The manuscript presents good results with sufficient novelty and research progress for the BDCC. However, following revisions are suggested before the manuscript can be considered for publication:

The writing (technical, grammatical, and typos) of the manuscript should be improved.
The title should be changed to, “Predicting College Enrollment for Low Socioeconomic Status Students Using Machine Learning Approaches”.
The first citation has been numbered as [29]; the numbering should be corrected and authors should carefully proofread the manuscript.
The “Introduction” should be extended; it did not provide sufficient background for the demonstrated study.
Motivation of the demonstrated study should further be highlighted to improve the significance of the study.
All the details about machine learning work should be added to the methods to reproduce the study.
Most of the literature is old. Recent reports should be included in the study.
I suggest the authors deepen their critical analysis to improve the comparison of current findings with those already reported in the literature.
I still did not understand the physical interpretation of the negative scale for “Students School Engagement” in Fig. 4.
Authors should mention any limitations in the machine learning model and methods that could affect the reliability of the results.

Comments on the Quality of English Language

The minor grammatical errors should be corrected.

Author Response

We sincerely appreciate your thoughtful suggestions, which have significantly improved our manuscript. Below we provide a detailed response to how we have addressed each of your concerns:

Comment 1: The writing (technical, grammatical, and typos) of the manuscript should be improved. The title should be changed to, “Predicting College Enrollment for Low Socioeconomic Status Students Using Machine Learning Approaches”. The first citation has been numbered as [29]; the numbering should be corrected and authors should carefully proofread the manuscript.

Response 1: We have implemented editorial improvements throughout the manuscript, including title revision, enhanced writing clarity, and corrected reference formatting.

Comment 2: The “Introduction” should be extended; it did not provide sufficient background for the demonstrated study. Motivation of the demonstrated study should further be highlighted to improve the significance of the study.

Response 2: The Introduction section has been expanded to better articulate the research gaps, study purpose, and significance of our work (page 2, lines 45-52).

Comment 3: All the details about machine learning work should be added to the methods to reproduce the study.

Response 3: We have also added all detailed information on machine learning work in Section 3.4 (Machine Learning Model-Building and Testing) (page 7, lines 286–330; page 8, lines 331–337).

Comment 4: Most of the literature is old. Recent reports should be included in the study.

Response 4: We have updated the Related Work section with recent literature and data (page 2, lines 70-79).

Comment 5: I suggest the authors deepen their critical analysis to improve the comparison of current findings with those already reported in the literature.

Response 5: In the Discussion, we explicitly compare our findings with prior research (page 12, lines 441-442; page 13, lines 443-444).

Comment 6: I still did not understand the physical interpretation of the negative scale for “Students School Engagement” in Fig. 4.

Response 6: The negative scale for “Students School Engagement” has been further explained in the Results section (page 11, lines 421-424).

Comment 7: Authors should mention any limitations in the machine learning model and methods that could affect the reliability of the results.

Response 7: The Limitations section acknowledges the constraints of the current ML approach and the limited number of features (page 14, lines 506-512; 521-525).

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors addressed my concerns.

Reviewer 3 Report

Comments and Suggestions for Authors

I have no further comments.

Reviewer 4 Report

Comments and Suggestions for Authors

Comments to Authors

The revision of the manuscript is satisfactory. It can now be accepted for publication in the BDCC.

Article Menu

Predicting College Enrollment for Low-Socioeconomic-Status Students Using Machine Learning Approaches

Further Information

Guidelines

MDPI Initiatives

Follow MDPI