Review Reports - Harnessing Large-Scale University Registrar Data for Predictive Insights: A Data-Driven Approach to Forecasting Undergraduate Student Success with Convolutional Autoencoders

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper presents a machine learning study that uses over a decade of student data from Louisiana State University (LSU) to predict undergraduate graduation outcomes. It uses a large and diverse dataset (94,931 records over 12 years), which increases the generalizability of the results. The paper is well-organized, with detailed descriptions of preprocessing, feature engineering, model architecture, and training.

Weakness:

The dependent variable (Graduation) is not formally defined in the paper, it should be explained how the graduation status label was determined (e.g., 4-year, 6-year , delayed graduation, dropout …etc) to ensure readers understand the target definition.
While the literature review includes several references to prior work in graduation prediction, the paper lacks a direct benchmarking of its results against those studies. Including a comparative analysis (even if approximate) would help contextualize the model’s performance and highlight the true contribution of the proposed method.
The paper does not discuss the demographic bias. Evaluating and mitigating potential bias would enhance the paper’s ethical rigor.
Further discussion of how the system could be implemented in academic advising or institutional decision-making would strengthen the impact of this study.
I recommend including a dedicated Conclusion section to strengthen the manuscript’s structure. Although the discussion addresses some reflective points, it does not provide a clear and focused summary of the main contributions, limitations, or future research directions. Adding a concise conclusion would help underscore the study’s significance and offer clearer insights for both researchers and practitioners.

Author Response

We sincerely thank the Reviewer for their thoughtful and constructive feedback. The comments and suggestions provided were highly valuable and led to meaningful improvements in the clarity, rigor, and overall quality of the manuscript. All revisions made in response to Reviewer’s comments are clearly marked in red in the revised version of the paper. We appreciate the time and care taken in the review process.

Comment 1: “The dependent variable (Graduation) is not formally defined in the paper, it should be explained how the graduation status label was determined (e.g., 4-year, 6-year , delayed graduation, dropout …etc) to ensure readers understand the target definition.”

Response 1: To clarify, graduation status was defined as a simple binary outcome: students were labeled as “graduates” if they had completed their degree by the time of data extraction, and “non-graduates” if they had not. We did not distinguish between on-time and delayed graduation or identify specific dropout patterns. This binary labeling approach was chosen to simplify the prediction task and align with institutional goals focused on overall degree completion. The paragraph in the manuscript has been revised accordingly to make this definition explicit (see pages 6, bottom and 7, top).

Comment 2: “While the literature review includes several references to prior work in graduation prediction, the paper lacks a direct benchmarking of its results against those studies. Including a comparative analysis (even if approximate) would help contextualize the model’s performance and highlight the true contribution of the proposed method.”

Response 2: We sincerely appreciate this valuable suggestion. We agree that benchmarking against prior studies would strengthen the contextual understanding of our model performance. However, in practice, fair and direct comparisons are challenging due to the absence of publicly available benchmarking datasets for graduation prediction. Most existing studies use proprietary institutional data, which vary widely in terms of features, student demographics, and institutional structures. As a result, reported performance metrics across studies are not directly comparable. Nevertheless, in response to this comment, we have expanded our performance analysis by including additional comparisons with baseline models, specifically logistic regression and linear discriminant analysis. These additions, now included in Section 3.4. “Benchmarking random forest against traditional baseline models” (pages 16, bottom and 17) and Table 2, help situate our CAE-based approach relative to commonly used methods in the field. If the Reviewer is aware of a specific public dataset or standardized benchmark that would allow a meaningful and fair comparison, we would greatly appreciate the recommendation and would be happy to incorporate such an analysis in a revision or follow-up study.

Comment 3: “The paper does not discuss the demographic bias. Evaluating and mitigating potential bias would enhance the paper’s ethical rigor.”

Response 3: We appreciate this thoughtful observation regarding demographic bias, which is a critical issue in educational data science and predictive modeling. We fully agree that addressing potential bias is essential for ensuring fairness and ethical rigor in student success prediction. However, as part of our data governance and compliance with university policies, we were not permitted to disaggregate or analyze student outcomes by demographic subgroups (e.g., race, ethnicity, gender identity). These restrictions are in place to protect student privacy and comply with institutional ethical standards. To help mitigate potential bias indirectly, we implemented several strategies: (1) feature preprocessing and encoding protocols ensured no proxy variables (e.g., ZIP code, income brackets) were treated in a way that could introduce artificial or unintended ordinal relationships, (2) model-based imputation and balanced accuracy metrics were used to minimize bias resulting from missing data and class imbalance, (3) exclusion of specific student populations (e.g., athletes, veterans) aimed to reduce confounding from atypical institutional pathways, and (4) use of interpretable models (e.g., random forests) allowed us to monitor feature importance for any unexpected associations with sensitive fields. We have added a dedicated paragraph in the Discussion section to explicitly address this limitation and our mitigation approach (please see pages 21, bottom and 22, top of the revised manuscript). We hope this clarification demonstrates our commitment to responsible, ethical modeling within the bounds of institutional policy.

Comment 4: “Further discussion of how the system could be implemented in academic advising or institutional decision-making would strengthen the impact of this study.”

Response 4: We agree that discussing potential applications of our predictive system enhances the relevance and practical utility of the study. In response, we have added a new paragraph to the Discussion section that outlines how the developed models and embeddings could be integrated into academic advising workflows and institutional decision-making processes (please see page 22, top of the revised manuscript). This includes use cases such as early identification of at-risk students, resource allocation planning, and support service targeting. We hope this addition clarifies the broader impact and applicability of our approach in higher education settings.

Comment 5: “I recommend including a dedicated Conclusion section to strengthen the manuscript’s structure. Although the discussion addresses some reflective points, it does not provide a clear and focused summary of the main contributions, limitations, or future research directions. Adding a concise conclusion would help underscore the study’s significance and offer clearer insights for both researchers and practitioners.”

Response 5: Following this Reviewer’s suggestion, we have added a dedicated Conclusion section to the manuscript. This section provides a focused summary of the main contributions, reflects on key limitations, and outlines potential directions for future research and practical application. We believe this addition strengthens the overall structure and clarity of the paper. Please see the new Conclusion section on page 22 of the revised manuscript.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper proposes a convolutional autoencoder (CAE) to predict undergraduate graduation outcomes from student data in large-scale university data. The author compiled a dataset of courses, grades, and demographics of tens of thousands of students, and designed a multi-layer CAE to learn latent representations and make predictions. The experimental evaluation lacks a baseline for comparison with other studies. This paper needs to be revised and reviewed again.

Comments for Article revision:

1. In Section 3.4, please explain what F1 score is? What is confusing matrix?

2. In Section 4, ROC-AUC is misspelled.

3. The format errors in references 7,12 need to be corrected.

The specific comments about the paper are as following,

The paper investigates whether a convolutional autoencoder applied to massive registrar datasets can accurately forecast undergraduate student success metrics such as retention, GPA trajectories, and time-to-degree completion. It seeks to determine if deep unsupervised feature extraction can uncover latent patterns in enrollment and performance histories that predict individual outcomes more reliably.

Student success prediction is a central concern in learning analytics and higher education policy. Large registrar datasets are increasingly available. Applying CAEs—typically used for images or time series—to tabular, sparse academic records is novel in this domain. It signals a shift from hand crafted feature engineering toward automated representation learning.Prior work often uses logistic regression, decision trees, or recurrent nets. Few studies have explored spatial/structural deep representations (CAE) on registrar data. The paper could better justify why CAEs are the good choice for this kind of data.

The current paper uses a CAE to reduce the dimensionality of registered data. This adds to the field by showing whether such unsupervised feature learning can uncover nonlinear relationships in the data that simpler models miss. Second, the paper applies this method to large-scale institutional data. Prior studies often considered limited subsets of students or specific courses, whereas this paper’s CAE approach is expected to handle much larger and more complex inputs. It demonstrates forecasting accuracy. In summary, the added value is combining CAE-driven feature extraction with extensive registrar data, potentially yielding more powerful predictors than prior ML methods.

The study should compare the CAE's performance against simpler models (e.g., logistic regression, random forest) to establish the added value of this architecture. Detail imputation or masking strategies and assess sensitivity to missing data handling. Registrar data is inherently time-series. Incorporating temporal dynamics (e.g., using LSTM layers) might improve predictive accuracy.

The reported drop from 85% (raw input) to 83% (CAE embedding) accuracy, along with 0.90→0.87 AUC, supports claims that CAE embeddings perform comparably while reducing dimensionality. The authors conclude that CAEs are effective for capturing meaningful latent representations from registrar data and that these can be used to forecast student success. While the results seem promising, the conclusions are not fully substantiated due to lack of baseline comparisons. Without knowing how CAEs perform relative to other models, we cannot determine if their performance is truly impressive.

Author Response

Comment 1: “In Section 3.4, please explain what F1 score is? What is confusing matrix?”

Response 1: To enhance accessibility for readers who may be less familiar with machine learning evaluation metrics, we have revised the relevant section in the Results to include brief explanations of the F1 score and the confusion matrix (please see page 17, top for F1 score and middle for the confusion matrix). These definitions clarify the purpose and interpretation of these metrics in the context of model performance evaluation.

Comment 2: “In Section 4, ROC-AUC is misspelled.”

Response 2: Thank you for your careful reading of the manuscript. We have reviewed the manuscript to ensure that ROC-AUC is spelled correctly throughout.

Comment 3: “The format errors in references 7,12 need to be corrected.”

Response 3: Format errors in these references have been corrected in the revised version of our manuscript.

Comment 4: “Student success prediction is a central concern in learning analytics and higher education policy. Large registrar datasets are increasingly available. Applying CAEs—typically used for images or time series—to tabular, sparse academic records is novel in this domain. It signals a shift from hand crafted feature engineering toward automated representation learning. Prior work often uses logistic regression, decision trees, or recurrent nets. Few studies have explored spatial/structural deep representations (CAE) on registrar data. The paper could better justify why CAEs are the good choice for this kind of data.”

Response 4: We agree that the application of CAEs to tabular registrar data is relatively novel and warrants clearer justification. In response, we have expanded the manuscript to include a rationale for selecting CAEs, emphasizing their capacity for automated representation learning, local pattern detection, and scalability to high-dimensional inputs. We also contrast their strengths with traditional models such as logistic regression and decision trees. This addition helps clarify why CAEs are well-suited for modeling large-scale, heterogeneous educational datasets. The new content can be found in the revised manuscript on page 20, middle.

Comment 5: “The study should compare the CAE's performance against simpler models (e.g., logistic regression, random forest) to establish the added value of this architecture. Detail imputation or masking strategies and assess sensitivity to missing data handling. Registrar data is inherently time-series. Incorporating temporal dynamics (e.g., using LSTM layers) might improve predictive accuracy. The reported drop from 85% (raw input) to 83% (CAE embedding) accuracy, along with 0.90→0.87 AUC, supports claims that CAE embeddings perform comparably while reducing dimensionality. The authors conclude that CAEs are effective for capturing meaningful latent representations from registrar data and that these can be used to forecast student success. While the results seem promising, the conclusions are not fully substantiated due to lack of baseline comparisons. Without knowing how CAEs perform relative to other models, we cannot determine if their performance is truly impressive.”

Response 5: In response to the suggestion to include baseline comparisons, we have incorporated logistic regression (LR) and linear discriminant analysis (LDA) as additional benchmark models to assess the performance of our approach. These models were selected as representative traditional classifiers for tabular data and serve as valuable baselines for evaluating the added value of the CAE architecture. The results of these comparisons are now presented in Table 2 and discussed in detail in Section 3.4 “Benchmarking random forest against traditional baseline models” (pages 16, bottom and 17). This expanded analysis confirms that while all models achieve competitive results, the random forest (RF) classifier, used throughout our downstream experiments, offers a strong balance between precision, recall, and generalizability. This context helps clarify the relative utility of CAE embeddings and supports our conclusions regarding their effectiveness for compact, scalable student outcome prediction.

Reviewer 3 Report

Comments and Suggestions for Authors

Overall, this is a well-structured and interesting study applying modern machine learning techniques to a critical problem in higher education. A methodologically rigorous approach to model evaluation using a temporal gap strategy, which simulates a realistic deployment scenario. An honest and transparent comparison of model performance on raw data versus compressed embeddings, correctly concluding that a trade-off exists between performance and efficiency. Detailed and clear description of the extensive data preprocessing and feature engineering steps, which is often a critical but under-reported part of such studies.

However, there are some issues. The choice for hyperparameters like the CAE’s latent dimension size seems to be based on a gut feeling not data. And the talk about the “interpretability” of the embeddings is an overstatement, its still a black box, the main benefit is just efficiency. Also missing a breakdown of how the dataset was filtered.
The CAE architecture diagram in Figure 2 is useful, but the layer descriptions like 1x180x8 Conv1d are confusing for people not as familiar with CNNs. The authors should probably define the terms more explcitly.
There’s no real evidence for choosing 141 latent dimentions. This is a critical missing piece, the authors need to show a plot of reconstruction loss vs dimension or something similar to justify this.
The authors should include a table showing how many records were removed at each filtering step in Section 2.1.

Following are some specific comments:

Abstract: “While models trained on embeddings showed slightly reduced performance compared to raw input data, with accuracies of 83% and 85% respectively, their efficiency and interpretability highlight their potential for large-scale analyses.” -> The claim of “interpretability” for CAE embeddings is not strongly supported in the text. While t-SNE provides a visualization, the 141 latent features themselves are not inherently interpretable. Consider rephrasing to focus on the demonstrated benefits of efficiency and dimensionality reduction.

Line 136: “This study utilized a dataset containing 94,931 student records with 276 features... In total, the final dataset included 55,215 student records...” -> For transparency, it would be beneficial to include a small table or a sentence detailing the number of students excluded at each step (e.g., post-Spring 2020, transfer students, athletes, veterans). This clarifies the composition of the final cohort.

Line 163: “Specifically, ZIP codes were translated into geographic coordinates.” -> This is a good approach. However, it assumes that geographic proximity correlates with student characteristics. Did the authors consider potential confounders, such as the fact that a single ZIP code can contain significant socioeconomic diversity? A brief acknowledgment of this limitation would strengthen the paper.

Line 199: “...if a ‘best math’ score was missing for a student ranked in the top 10, it was replaced with the median ‘best math’ score from other students in the top 10 category.” -> This is a clever contextual imputation method. This should be highlighted as a methodological strength, as it is more sophisticated than simple global median imputation.

Line 291: “...this architecture was adapted to handle 1D data, aligning with the structure of student records.” -> This is a key aspect of the study’s novelty. To strengthen this point, the authors could briefly mention that the adaptation of convolutional networks for 1D sequential or tabular data is an emerging trend in various fields beyond educational data mining. This would frame their work within a broader methodological context. For instance, similar 1D-CNN approaches are being used for feature extraction from sensor data in advanced manufacturing (e.g., for acoustic emission analysis in additive manufacturing as explored in studies e.g. data-driven approach to identify acoustic emission source motion and positioning effects in laser powder bed fusion with frequency analysis), medical (e.g., classifying cardiac arrhythmia from ECG signal using 1D CNN deep learning model, or a review paper like a review of non-fully supervised deep learning for medical image segmentation), traffic, robot control, etc.

Line 295: “This dimensionality was selected after systematically testing various configurations...” -> This justification is vague. The manuscript would be significantly strengthened by including a graph or table showing the trade-off between embedding dimensionality and reconstruction error (MSE). This would provide a clear, data-driven rationale for choosing 141 dimensions.

Line 428: “While some overlap occurred, likely due to shared characteristics or similar patterns between the groups, the embeddings retained sufficient information to effectively distinguish between the two classes.” -> This is an honest assessment. The discussion could be expanded slightly to speculate on what these “shared characteristics” might be. For example, do they represent students at the margin whose outcomes are genuinely difficult to predict?

Line 512: “The results of this study provide valuable insights into the effectiveness of modeling strategies, preprocessing approaches, and evaluation techniques for predicting student graduation outcomes.” -> This is a generic opening sentence for a discussion section. It could be more impactful by leading with a key finding, e.g., “This study demonstrates that while complex models like CAEs offer computational benefits, rigorous temporal validation reveals challenges in generalizing predictions over time.”

Line 525: “Incorporating additional feature engineering or hybrid approaches could mitigate this limitation, enhancing the representational power of embeddings.” -> This is a good point for future work. Could the authors be more specific? For example, suggesting the use of attention mechanisms in the CAE to highlight more salient features.

Line 535: “Addressing these changes through incremental learning or adaptive modeling could enhance model robustness over time.” -> This is a crucial insight. The authors might consider citing works that specifically deal with concept drift in machine learning to ground this suggestion in established literature.

Author Response

Comments 1: “The choice for hyperparameters like the CAE’s latent dimension size seems to be based on a gut feeling not data.” And “There’s no real evidence for choosing 141 latent dimentions. This is a critical missing piece, the authors need to show a plot of reconstruction loss vs dimension or something similar to justify this.” And Line 295: “This dimensionality was selected after systematically testing various configurations...” -> This justification is vague. The manuscript would be significantly strengthened by including a graph or table showing the trade-off between embedding dimensionality and reconstruction error (MSE). This would provide a clear, data-driven rationale for choosing 141 dimensions.

Response 1: We fully agree that the selection of hyperparameters, particularly the embedding size, should be supported by empirical evidence rather than subjective reasoning. In response to this helpful suggestion, we have revised the manuscript to include a systematic evaluation of the reconstruction performance across a range of embedding sizes, from 180 down to 64 dimensions. As suggested by this Reviewer, we now report validation mean squared error (MSE) for each configuration in Table 1, and we discuss the trade-off between dimensionality and reconstruction fidelity in Section 3.1. “Optimization and reconstruction performance of the convolutional autoencoder” (pages 14, bottom and 15, top). This analysis demonstrates that the reconstruction errors across the top range of embedding sizes (180-141) were very similar indicating that the performance of CAE was not highly sensitive to the precise latent dimensionality in that range and confirming the robustness of the architecture. Furthermore, to strengthen our justification and provide a task-specific evaluation of embedding effectiveness, we also assessed the downstream classification performance of a random forest model trained on embeddings of different sizes. These results are presented in the revised Section 3.5. “Comparison of model performance using input data vs. embeddings” (page 18) and summarized in Table 3. The embedding size of 141 was selected as optimal due to its strong balance between reconstruction quality and classification performance, along with its suitability for visualization (e.g., t-SNE and ROC curves in Figures 5 and 6). We appreciate this Reviewer’s recommendation, which allowed us to significantly strengthen the empirical foundation of our model design decisions.

Comments 2: “And the talk about the ‘interpretability’ of the embeddings is an overstatement, its still a black box, the main benefit is just efficiency.” And “Abstract: ‘While models trained on embeddings showed slightly reduced performance compared to raw input data, with accuracies of 83% and 85% respectively, their efficiency and interpretability highlight their potential for large-scale analyses.’ -> The claim of ‘interpretability’ for CAE embeddings is not strongly supported in the text. While t-SNE provides a visualization, the 141 latent features themselves are not inherently interpretable. Consider rephrasing to focus on the demonstrated benefits of efficiency and dimensionality reduction.

Response 2: We fully agree that the CAE-generated embeddings, while useful for downstream tasks and visualization, do not offer direct interpretability in the traditional sense. The individual latent dimensions are not readily mapped to semantically meaningful input features, and we recognize that this distinction is important for readers seeking transparency and explainability in model outputs. In response, we have revised the abstract and discussion to remove or rephrase references to “interpretability.” Instead, we now emphasize the demonstrated benefits of dimensionality reduction and computational efficiency as the primary advantages of using embeddings. We also clarify that while t-SNE provides visual insight into the structure of the latent space, the embeddings themselves remain abstract representations. These changes can be found in the revised abstract (page 2, middle) and discussion section 4 (page 21, top).

Comments 3: “Also missing a breakdown of how the dataset was filtered.” And Line 136: “This study utilized a dataset containing 94,931 student records with 276 features... In total, the final dataset included 55,215 student records...” -> For transparency, it would be beneficial to include a small table or a sentence detailing the number of students excluded at each step (e.g., post-Spring 2020, transfer students, athletes, veterans). This clarifies the composition of the final cohort. And “The authors should include a table showing how many records were removed at each filtering step in Section 2.1.”

Response 3: We agree that a detailed breakdown of filtering steps is essential for understanding how the final cohort was derived. In response, we have added Section 2.7. “Cohort selection, data filtering, and dataset partitioning” (page 10, middle) to clearly describe the sequential data cleaning process. We now include the number of student records removed at each step due to students still within the graduation window, non-degree-seeking trajectories, athlete/veteran status, unresolved inconsistencies, and missing financial data, starting from the original dataset of 94,931 observations and concluding with the final cleaned dataset of 55,215 records (94,931 –33,962 –3,138 –747 –803 –428 –638 = 55,215). These revisions make the composition of the final dataset more transparent and reproducible, as requested. Thank you for prompting this important improvement.

Comment 4: “The CAE architecture diagram in Figure 2 is useful, but the layer descriptions like 1x180x8 Conv1d are confusing for people not as familiar with CNNs. The authors should probably define the terms more explcitly.”

Response 4: We agree that more explicit definitions of layer shapes and terminology would improve clarity for readers who may be less familiar with convolutional neural networks. In response, we have revised the caption of Figure 2 to explain the format of the layer dimensions and to clarify what each element (e.g., “1×180×8” and “Conv1d”) represents in the context of one-dimensional student data. We hope this revision enhances the accessibility and interpretability of the architectural diagram.

Comment 5: Line 163: “Specifically, ZIP codes were translated into geographic coordinates.” -> This is a good approach. However, it assumes that geographic proximity correlates with student characteristics. Did the authors consider potential confounders, such as the fact that a single ZIP code can contain significant socioeconomic diversity? A brief acknowledgment of this limitation would strengthen the paper.

Response 5: We agree that while ZIP codes often correlate with socioeconomic characteristics, they are inherently limited as they are based on postal delivery routes and can encompass significant heterogeneity in income, education, and housing conditions. In response to this important observation, we have revised the relevant section to acknowledge this limitation and clarify that geographic coordinates were used as a proxy to support pattern detection, while recognizing the potential for within-ZIP code variability. The revised text appears in Section 2.2 “Numerical and geographic data representation” (pages 7, bottom and 8, top).

Comment 6: Line 199: “...if a ‘best math’ score was missing for a student ranked in the top 10, it was replaced with the median ‘best math’ score from other students in the top 10 category.” -> This is a clever contextual imputation method. This should be highlighted as a methodological strength, as it is more sophisticated than simple global median imputation.

Response 6: We thank the Reviewer for recognizing the contextual imputation strategy used in our preprocessing pipeline. In response to this helpful suggestion, we have revised the discussion to explicitly highlight this as a methodological strength. Specifically, we clarify that missing values were handled not through simple global medians, but through context-aware strategies that preserved relationships within subgroups, such as imputing high school performance variables based on student rank categories, and financial variables based on ZIP code medians. This approach helped minimize bias and retain structural patterns in the data. The addition appears in the revised Discussion section (page 20, top).

Comment 7: Line 291: “...this architecture was adapted to handle 1D data, aligning with the structure of student records.” -> This is a key aspect of the study’s novelty. To strengthen this point, the authors could briefly mention that the adaptation of convolutional networks for 1D sequential or tabular data is an emerging trend in various fields beyond educational data mining. This would frame their work within a broader methodological context. For instance, similar 1D-CNN approaches are being used for feature extraction from sensor data in advanced manufacturing (e.g., for acoustic emission analysis in additive manufacturing as explored in studies e.g. data-driven approach to identify acoustic emission source motion and positioning effects in laser powder bed fusion with frequency analysis), medical (e.g., classifying cardiac arrhythmia from ECG signal using 1D CNN deep learning model, or a review paper like a review of non-fully supervised deep learning for medical image segmentation), traffic, robot control, etc.

Response 7: We agree that highlighting the broader applicability of 1D-CNNs helps situate our approach within a growing methodological trend across domains. In response, we have revised the Introduction section (page 6, top) to explicitly acknowledge the use of 1D-CNNs in diverse fields such as medical diagnostics, advanced manufacturing, intelligent transportation, and neural signal processing. This framing strengthens the novelty of applying such architectures to educational data mining and emphasizes the potential for cross-domain innovation.

Comment 8: Line 428: “While some overlap occurred, likely due to shared characteristics or similar patterns between the groups, the embeddings retained sufficient information to effectively distinguish between the two classes.” -> This is an honest assessment. The discussion could be expanded slightly to speculate on what these “shared characteristics” might be. For example, do they represent students at the margin whose outcomes are genuinely difficult to predict?

Response 8: We agree that expanding on the possible nature of overlapping cases strengthens the discussion. In response, we have revised the relevant section to point out that these overlapping instances likely correspond to students at the decision boundary, those whose characteristics place them near the margin between graduates and non-graduates in the high-dimensional feature space. These cases may exhibit mixed or ambiguous signals (e.g., fluctuating GPA, moderate engagement, or borderline financial indicators) that make their outcomes inherently more difficult to classify with high certainty. The revised text appears in Section 3.3. “Visualizing latent representations with t-SNE” (page 16, middle).

Comment 9: Line 512: “The results of this study provide valuable insights into the effectiveness of modeling strategies, preprocessing approaches, and evaluation techniques for predicting student graduation outcomes.” -> This is a generic opening sentence for a discussion section. It could be more impactful by leading with a key finding, e.g., “This study demonstrates that while complex models like CAEs offer computational benefits, rigorous temporal validation reveals challenges in generalizing predictions over time.”

Response 9: Following this recommendation, we revised the introductory sentence to highlight a central finding of the study, the trade-off between model complexity and temporal generalizability. The revised text leads with the observation that, while CAEs offer clear benefits in terms of dimensionality reduction and computational efficiency, their performance, like that of simpler models, is sensitive to temporal shifts in student populations. This change helps to better anchor the discussion in the study’s core contributions. Please see the revised opening of the Discussion section (page 21, middle).

Comment 10: Line 525: “Incorporating additional feature engineering or hybrid approaches could mitigate this limitation, enhancing the representational power of embeddings.” -> This is a good point for future work. Could the authors be more specific? For example, suggesting the use of attention mechanisms in the CAE to highlight more salient features.

Response 10: We agree that being more specific about potential directions for improving the representational power of the CAE embeddings strengthens the conclusion. In response, we have revised the sentence to include attention mechanisms as a promising enhancement for future work. This revision appears on pages 20, bottom and 21, top of the updated manuscript.

Comment 11: Line 535: “Addressing these changes through incremental learning or adaptive modeling could enhance model robustness over time.” -> This is a crucial insight. The authors might consider citing works that specifically deal with concept drift in machine learning to ground this suggestion in established literature.

Response 11: To ground our suggestion in relevant literature, we have revised the sentence to reference established work on concept drift in machine learning. This provides context for the need to adapt models in response to temporal changes in student data. The updated sentence appears on page 21, middle of the revised manuscript, and we have included an appropriate citation to support this point.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

This study used over ten years of Louisiana State University student data to extract latent features using a convolutional autoencoder (CAE) to predict undergraduate graduation outcomes. While CAE compression improves computational efficiency and scalability, the accuracy of a random forest (RF) model trained on the embedded features was 83%, slightly lower than the 85% achieved with the original data. Compared to traditional models such as logistic regression (LR) and linear discriminant analysis (LDA), RF demonstrated a more balanced performance. This study demonstrates the potential of CAE in dynamic educational data environments.The revised paper has clearly discussed the unclear points in the original article, allowing readers to clearly understand the method and its contribution. It is recommended that the paper be accepted.

Reviewer 3 Report

Comments and Suggestions for Authors

All my comments were well addressed.