Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Predicting Early Employability of Vietnamese Graduates: Insights from Data-Driven Analysis Through Machine Learning Methods

Big Data Cogn. Comput. 2025, 9(5), 134; https://doi.org/10.3390/bdcc9050134

by Long-Sheng Chen^1,2

, Thao-Trang Huynh-Cam^2,*

, Van-Canh Nguyen³

, Tzu-Chuen Lu²

and Dang-Khoa Le-Huynh⁴

Reviewer 1:

Qusai Shambour

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Big Data Cogn. Comput. 2025, 9(5), 134; https://doi.org/10.3390/bdcc9050134

Submission received: 9 April 2025 / Revised: 13 May 2025 / Accepted: 14 May 2025 / Published: 19 May 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The literature review compares previous studies (Table 1) but doesn't benchmark the current results against these in the discussion. For instance, how does 93.6% accuracy compare to other studies using similar methods in different regions?
The paper states that CART can handle both numeric and categorical data, but the dataset includes categorical variables (e.g., major, gender). How were categorical features encoded (e.g., one-hot encoding, ordinal encoding) before being fed into CART? Did this preprocessing impact model performance?
The authors mention that DTs can address overfitting when "properly pruned." What pruning strategies (e.g., cost-complexity pruning, max depth settings) were applied to the CART model? Were hyperparameters (e.g., max_depth, min_samples_split) tuned via cross-validation?
The paper mentions using under-sampling and over-sampling but doesn't elaborate on why these methods were chosen over others (e.g., SMOTE). The effectiveness of oversampling in Case 3 leading to high accuracy might be due to overfitting, which isn't discussed. Evaluate potential overfitting in Case 3 (oversampling) and validate results using cross-validation or holdout datasets to ensure robustness.
In line 278, the authors refer to '21 rules'; however, 23 rules are actually presented. Please clarify or correct ??
The paper states CART outperformed SVM and C5.0 but doesn't delve into why CART is more suitable for this dataset. There's no discussion on model assumptions or potential biases, especially since decision trees can be prone to overfitting. Were there differences in preprocessing, hyperparameter tuning, or implementation between the two models that could explain this performance gap?
The paper compares CART, C5.0, and SVM but does not explore ensemble methods (e.g., Random Forest, Gradient Boosting) or deep learning models, which might offer better performance. A broader comparison would strengthen the methodological rigor.
The decision rules extracted (Table 8) are quite specific but lack a discussion on how universities can operationalize these findings. Practical steps for curriculum changes or policy adjustments are mentioned but not detailed. Adding a section on actionable steps or pilot programs of how universities can operationalize the extracted decision rules would bridge the gap between research and practice.
While CART is chosen for its interpretability, the extracted rules (23 rules) are complex. Simplifying these rules or using visual aids (e.g., decision tree diagrams) could enhance clarity for policymakers. Simplify or visualize the 23 decision rules (e.g., through decision tree diagrams or flowcharts) to enhance accessibility for policymakers. Highlight the most impactful rules and their implications for curriculum design.

Author Response

Thank you very much for taking the time to review this manuscript. Thank you for your comments regarding the submitted paper. The corresponding revisions and corrections have been made according to your suggestions. All revision and correction parts have been highlighted in red color or in track changes in the revised manuscript. We also provide a point-to-point response to your comments in the attached file. Please check the revised manuscript and attached file.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Include a brief discussion on the computational cost of the proposed models, especially for real-world deployment in universities.

Expand the comparison with other ML models (e.g., Random Forest, XGBoost) to justify the selection of CART over alternatives.

Address how the oversampling/undersampling techniques impact model generalizability beyond the studied dataset.

Strengthen the originality claim by contrasting the study with prior work in Southeast Asia (e.g., Philippines, Thailand) to highlight regional uniqueness.

Simplify some statistical explanations (e.g., Gini impurity) for non-technical readers.

Define "XAI" (Explainable AI) at first mention in the Limitations section.

Add references to Vietnamese labor market studies (e.g., government reports, local university research) to contextualize findings.

The paper is strong but would benefit from addressing the minor suggestions above.

share anonymized data/code (if ethically permissible) to enhance reproducibility.

Fix typo: "men may be unsure" → "students may be unsure" (Section 5.1).

Clarify Vietnamese GPA scale (0–10) in the Abstract or Introduction.

A valuable contribution to AI in education policy. With minor revisions, this paper is ready for publication.

This research has significant practical utility for universities and policymakers.

Author Response

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This study analyzed data from 610 recent graduates of a public university in Vietnam’s Mekong Delta, applying AI-driven Classification and Regression Trees (CART) to predict employment outcomes within six months after graduation. Key predictors were identified.

This study has some limitations, namely those mentioned by the authors, such as the fact that the sample was drawn exclusively from students of a single university. Additionally, the conclusions of this article do not adequately reflect the results presented in Chapter 4 nor the discussions presented in Chapter 5. In Chapter 5, a large number of rules were listed, which are not easy to understand in practice. In order for the article to serve as a guide for another set of studies, it is suggested that the conclusions be rewritten so that the reader can understand the main results of the study without having to read it in full.

Moreover, the article should be revised in terms of format and content in order to be more reader friendly (Sections 3, 4 and 5). Some additional minor observations must also be taken in consideration:

- line 110: Table 1 columns should be explained;

- lines 152 to 156: rewrite;

- line 166: Replace "Table 1" by "Table 2";

- line 172: Replace "The dataset used in the present study were obtained from graduates who recently " by "The dataset used in the present study was obtained from graduates who recently";

- line 182: Replace "...all category data ..." by "...all categorical data ...";

- line 197: The correlation matrix is difficult to read; enlarge it

Author Response

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors claim to use CART (via Scikit-learn) to "extract knowledge rules" for policymaking. However, as per Scikit-learn’s own documentation, its CART implementation does not generate rule sets in the form of symbolic IF-THEN statements. Could the authors clarify how the 14 rules were extracted? This is critical for transparency and reproducibility.

Thank you for clarifying that manual ordinal encoding was applied to the categorical features (gender and major) for compatibility across CART, C5.0, and SVM. While this preprocessing choice is understandable for practical consistency, I would recommend the authors acknowledge a key limitation:

The major variable is nominal, not ordinal — i.e., there is no inherent ranked order among fields of study such as “Accounting”, “Tourism”, or “Agriculture”. Encoding them as ordinal integers (e.g., 1, 2, ..., 21) may inadvertently introduce artificial relationships or distances between categories during the CART split process, potentially affecting the interpretability and reliability of the model.

Author Response

Thank you for your comments regarding the submitted paper. Please find our responses in the attached file. The corrections have been made according to your suggestions and comments. Please check the revised manuscript.

Author Response File: Author Response.pdf

Article Menu

Predicting Early Employability of Vietnamese Graduates: Insights from Data-Driven Analysis Through Machine Learning Methods

Further Information

Guidelines

MDPI Initiatives

Follow MDPI