Personalized and Timely Feedback in Online Education: Enhancing Learning with Deep Learning and Large Language Models

Cuéllar, Óscar; Contero, Manuel; Hincapié, Mauricio

doi:10.3390/mti9050045

Open AccessArticle

Personalized and Timely Feedback in Online Education: Enhancing Learning with Deep Learning and Large Language Models

by

Óscar Cuéllar

^1,*

,

Manuel Contero

¹

and

Mauricio Hincapié

²

¹

Instituto Universitario de Investigación en Tecnología Centrada en el Ser Humano (Human-Tech), Universitat Politècnica de València, 46022 Valencia, Spain

²

Escuela de Artes y Humanidades, Área Creación, Universidad EAFIT, Medellín 050022, Colombia

^*

Author to whom correspondence should be addressed.

Multimodal Technol. Interact. 2025, 9(5), 45; https://doi.org/10.3390/mti9050045

Submission received: 27 March 2025 / Revised: 4 May 2025 / Accepted: 9 May 2025 / Published: 14 May 2025

(This article belongs to the Special Issue Innovative Theories and Practices for Designing and Evaluating Inclusive Educational Technology and Online Learning)

Download

Browse Figures

Review Reports Versions Notes

Abstract

This study investigates an Adaptive Feedback System (AFS) that integrates deep learning (a recurrent neural network trained with historical student data) and GPT-4 to provide personalized feedback in a Digital Art course. In a quasi-experimental design, the intervention group (n = 42) received weekly feedback generated from model predictions, while the control group (n = 39) followed the same program without this intervention across four learning blocks or levels. The results revealed (1) a cumulative effect with a significant performance difference in the fourth learning block (+12.63 percentage points); (2) a reduction in performance disparities between students with varying levels of prior knowledge in the experimental group (−56.5%) versus an increase in the control group (+103.3%); (3) an “overcoming effect” where up to 42.9% of students surpassed negative performance predictions; and (4) a positive impact on active participation, especially in live class attendance (+30.21 points) and forum activity (+9.79 points). These findings demonstrate that integrating deep learning with LLMs can significantly improve learning outcomes in online educational environments, particularly for students with limited prior knowledge.

Keywords:

deep learning; large language models; personalized feedback; assistive technology; educational technology; performance prediction; online education

1. Introduction

The rapid expansion of virtual learning environments, dramatically accelerated by the COVID-19 pandemic, has led to an unprecedented surge in online course enrollments, with some institutions reporting increases exceeding 300% [1]. While this shift offers increased accessibility and flexibility, it has also amplified inherent challenges in online education, particularly regarding student engagement, assessment, and the delivery of personalized learning experiences. In online settings, students frequently experience reduced social presence and a sense of isolation, which negatively impacts their motivation, commitment, and overall learning outcomes [2]. Moreover, traditional assessment methods, often reliant on standardized metrics, fail to capture the nuanced dynamics of digital engagement, and instructors struggle to provide timely, individualized feedback at scale. These issues underscore the urgent need for innovative, automated solutions that can support personalized feedback.

In response to these challenges, Artificial Intelligence (AI) technologies have emerged as promising tools for enhancing online education. In particular, deep learning (DL) techniques and large language models (LLMs) offer significant potential. Deep learning, which leverages large datasets to analyze complex educational data, has been suggested to offer the potential for superior accuracy in predicting student performance compared to traditional statistical methods [3]. Models such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have proven effective in forecasting academic outcomes and identifying students at risk of dropout [4]. Simultaneously, LLMs like OpenAI’s GPT-4 are revolutionizing automated, personalized feedback by generating natural language responses, understanding intricate prompts, and adapting their communication style to individual student needs [5]. However, it is crucial to acknowledge that the application of DL and LLMs in education is not without challenges. Potential limitations include biases in training data, the “black box” nature of many deep learning models, and the risk of LLMs generating generic or contextually inappropriate feedback. Addressing these issues requires careful consideration and mitigation strategies, such as focusing on diverse and representative training data, incorporating interpretability techniques into model design, and implementing human oversight in the feedback generation process.

The significance of effective feedback for learning is well-established. The seminal work by Hattie and Timperley [6] demonstrated that the type and modality of feedback are critical determinants of student performance, specifically, that corrective and detailed written feedback is more effective than simple praise or grades. Building on these insights, the field of Artificial Intelligence in Education (AIED) has been at the forefront of developing intelligent tutoring systems (ITS) and adaptive learning environments that leverage AI to automate and personalize feedback [7,8,9,10]. Early ITS approaches were based on rule-based systems and knowledge tracing, whereas modern methods have evolved to incorporate advanced techniques such as Bayesian networks and deep knowledge tracing, providing a more refined basis for adaptive feedback [8]. This rich history in AIED research lays a robust foundation for integrating contemporary DL and LLM technologies to further enhance feedback mechanisms in online education.

In this study, we introduce the Adaptive Feedback System (AFS), a novel system designed to provide personalized and timely feedback to students in online courses. AFS utilizes a deep learning-based predictive model, specifically, a recurrent neural network (RNN) trained on historical student engagement and performance data from previous online courses in Digital Arts for Film and Video Games. This model analyzes key metrics, such as synchronous session attendance, digital content interaction, forum participation, and portfolio task completion, to identify students who could benefit from targeted interventions. The predictions are then interpreted by OpenAI’s GPT-4, which generates tailored feedback messages addressing individual student needs. For instance, AFS might offer encouraging messages and gamified incentives to students at risk of disengagement, while providing detailed, constructive criticism and targeted resources to those facing specific academic challenges. AFS is notably designed as a dynamic and adaptive system that continuously refines its feedback strategies based on ongoing student interactions.

The central research questions guiding this study are as follows: Can recurrent neural network-based deep learning models, applied to extensive datasets of quantitative learning analytics data, predict student performance in online environments more effectively than traditional methods? Furthermore, how effective are LLMs like OpenAI’s GPT-4 in interpreting these predictions and generating personalized feedback that enhances student engagement and learning outcomes? Finally, what is the overall impact of the integrated AFS on student performance and engagement in a real-world online course setting?

To address these questions and evaluate the effectiveness of AFS, this paper is structured as follows. First, we review relevant literature on AI in education—focusing on DL-based prediction models and LLMs for feedback—and critically discuss both their potential and limitations within educational contexts. Next, we detail the methodology of our study, including the experimental design, participant demographics, and data collection and analysis procedures. The Results section presents a comparative analysis of student performance and engagement metrics between the experimental group (using AFS) and a control group (receiving traditional feedback). The Discussion section interprets these findings in the context of established AIED research and broader educational literature, and the Conclusion section reflects on the challenges, limitations, and future directions for research in AI-enhanced feedback systems.

1.1. Personalized Feedback with ChatGPT

Building upon decades of research in the field of Artificial Intelligence in Education (AIED) on personalized and Adaptive Feedback Systems [7,9], recent studies have begun to explore the potential of large language models (LLMs) like ChatGPT to further enhance feedback in online learning environments. Wang et al. [11] found that ChatGPT significantly enhances the quality of student argumentation by offering specific and timely feedback, key characteristics of effective feedback according to Hattie and Timperley [6]. In contrast, Messer et al. [12], in their systematic review of automated grading and feedback tools for programming education, offer a broader perspective, indicating that automated feedback tools, including those powered by LLMs, are generally effective in improving learning outcomes, reinforcing the more specific findings of Wang et al. [11] and Meyer et al. [13] explored the collaborative generation of feedback between students and AI, highlighting the potential of this collaboration to enrich educational feedback. Specifically, Li Wang et al. [14] demonstrated that ChatGPT exhibits impressive accuracy (91.8%) and a retrieval rate of 63.2% when providing feedback on undergraduate student argumentation. However, they also found that the length of the arguments and the use of discourse markers significantly influenced the effectiveness of the feedback, suggesting that the quality of ChatGPT’s feedback may depend on the characteristics of the student’s input. Meyer et al. [13] for their part, discovered that feedback generated by LLMs led to enhanced performance in text revisions, as well as increased motivation and positive emotions among high school students, indicating the potential multifaceted benefits of ChatGPT-based feedback beyond academic performance.

1.2. Deep Learning Models for Performance Prediction

In the field of learning analytics (LA) and educational data mining (EDM), deep learning has emerged as a powerful set of techniques for student performance prediction. Researchers have explored various deep learning models, including recurrent neural networks (RNNs) and, particularly, long short-term memory (LSTM) networks, due to their ability to model sequential learning data [4]. These models have proven effective in predicting academic performance and in identifying students at risk of dropping out of courses [3]. For example, Messer et al. [12] used neural network algorithms and LSTM to analyze large volumes of data for the purpose of predicting performance in programming education, although their primary focus was the review of automated feedback tools. Alnasyan et al. [15], while not directly focused on prediction, also found that the integration of generative AI, which often employs underlying deep learning models, can improve the academic performance and motivation of business students, suggesting that accurate performance prediction, enabled by deep learning, could be a valuable component for more adaptive and personalized educational systems. In addition to RNNs and LSTMs, other deep learning architectures, such as convolutional neural networks (CNNs) and multilayer perceptron (MLP) networks, have also been investigated for prediction tasks in LA/EDM, each with strengths and weaknesses depending on the data type and specific prediction task.

1.3. Evaluation of LLM-Based Educational Applications and Their Implicit Comparison with Traditional Feedback Methods

While several studies have directly compared feedback generated by deep learning models with traditional feedback methods (search for direct comparison studies to add here if you find them), research on LLMs in education often adopts an approach of evaluating the effectiveness and potential of LLMs per se in various educational roles, leaving the comparison with traditional methods implicit. For example, Hellas et al. [16] investigated LLM responses to beginner programmers’ help requests, demonstrating the models’ effectiveness in providing helpful assistance, compared to the assistance traditionally expected from a human tutor or conventional educational resources. Shaikh et al. [17] analyzed the potential of LLMs for sentiment analysis in student feedback, finding that these models can effectively identify students’ emotions and attitudes, a task that would traditionally require laborious qualitative human analysis. Golchin et al. [18] examined the use of LLMs as evaluators in MOOCs, finding that they can provide accurate and consistent assessments, suggesting a scalable and automated alternative to human assessment in massive courses.

Extance [19] offers a broader analysis of the growing use of ChatGPT in the educational field, highlighting both its potential and risks. A case study at Arizona State University showed that LLM-based chatbots can foster creativity and help students generate more ideas than they would normally consider, suggesting an improvement over traditional methods of individual or small group brainstorming. Furthermore, tools like Khanmigo, developed by Khan Academy in collaboration with OpenAI, use GPT-4 to offer personalized tutoring, guiding students through problems without giving direct answers, promoting “productive struggle” in learning, in a style that emulates individualized human tutoring, but with the scalability potential of AI.

Nevertheless, Extance [19] also highlights the risks associated with the use of LLMs, such as the possibility of students over-relying on these tools to complete tasks without fully understanding the underlying concepts, which could be a greater problem than dependence on “solutions” from traditional sources like web search engines. Privacy concerns are also mentioned, as data entered into these platforms are stored and used to train future models, a risk not present with traditional feedback methods. Additionally, LLMs are prone to errors and “hallucinations”, where they can generate incorrect or fictitious information, requiring human validation and oversight that would not be necessary with feedback generated by human experts. Despite these challenges, Extance [19] argues that the careful integration and proper supervision of LLMs in education can offer significant benefits, such as the ability to provide immediate and personalized feedback and improve access to high-quality tutoring, especially in resource-limited contexts, where personalized human tutoring is often infeasible.

1.4. Challenges and Opportunities

Despite these promising advances, significant challenges persist that require careful attention, including data privacy management and ethical dilemmas related to publishing AI-generated works. Furthermore, revisiting a historical concern in the field of AIED, the interpretability of complex deep learning models remains a crucial challenge to ensure transparency and trust in these educational systems [20]. Specifically, concerns about data privacy are highlighted by Lund et al. [21] and UNESCO [20], who emphasize the need to safeguard student information in AI environments. UNESCO [20] and Farrokhnia et al. [22] also point to biases in AI-generated feedback as a significant risk, underscoring the need for human intervention to mitigate these biases and ensure academic integrity and educational equity, fundamental principles in AIED. Jeon and Lee [23] emphasize the need for rigorous methods to translate these findings into effective educational practices and understand their potential long-term impacts. Finally, UNESCO [20] also underscores the importance of validation of AI tools by educational institutions to ensure their effectiveness and fairness, aligning with AIED’s tradition of rigorous evaluation of technology-based educational interventions. Table 1 summarizes the key challenges and opportunities in the use of deep learning models and LLMs in education.

While advances in the use of LLMs and deep learning in education are promising, further research is needed to address current limitations and maximize their potential, especially in the context of the AIED tradition and its ethical and pedagogical principles. This study seeks to fill some of these gaps by integrating the interpretation, still not sufficiently explored, of performance predictions made with deep learning models through personalized and timely feedback using LLMs.

2. Methodology

The primary aim of this experimental study is to assess the effectiveness of integrating advanced technologies, such as deep learning and large language models (LLMs), in enhancing student performance in a Digital Art for Film and Video Games course. The specific objectives of this study are as follows:

Evaluate the accuracy of student performance predictions using deep learning models trained on historical student data.
Determine the impact of personalized feedback generated by LLMs, such as GPT-4, on enhancing academic performance and student engagement.
Compare participation and performance metrics between a control group and an experimental group that receives performance predictions and personalized recommendations.
Explore the feasibility of integrating technological tools into daily educational routines to facilitate effective and timely communication that supports educational decision-making.
Identify challenges and opportunities associated with implementing these technologies in an educational setting, including ethical and privacy considerations.

2.1. Participants

The participants in this study were students enrolled in a Digital Art for Film and Video Games Specialized Course, offered as part of Academy by PolygonUs, a highly competitive, non-traditional educational program specifically designed to address the immediate needs of the digital art industry. This program operates within an informal education context, focusing on developing high-impact digital skills relevant to the film and video game industries. The course employs a flipped classroom methodology, delivered 100% virtually through PolygonUs’s custom-built Learning Content Management System (LCMS). Instruction includes three weekly synchronous online sessions, complemented by asynchronous pre-recorded lectures and practical activities. The 16-month program is structured into four levels, encompassing 22 industry-specific courses, including technical art skills, industry-focused English, and power skills. Notably, 80% of the curriculum is practical, emphasizing hands-on project development and portfolio building, while 20% is theoretical, delivered primarily through pre-recorded sessions on the LCMS. Throughout the program, students present their work to industry professionals for feedback and portfolio refinement, preparing them for direct entry into the digital art job market.

A significant characteristic of the student population in this program is their high level of motivation. All participants are scholarship recipients, having been selected through a rigorous admission process from a pool of approximately 3600 applicants competing for limited scholarship positions. This highly selective admission process ensures that admitted students demonstrate a strong intrinsic motivation and commitment to pursuing a career in digital art. Furthermore, as part of the admission process, all applicants undergo a structured interview designed to assess their motivation and commitment to the program, in addition to the aforementioned diagnostic technical test. This multi-faceted selection process aims to identify and enroll students with a high aptitude and intrinsic drive for digital art skill development.

Due to institutional scheduling and enrollment dynamics inherent in this industry-driven and scholarship-based program, participants were not randomly assigned to groups; instead, existing cohorts were utilized. The control group consisted of 39 students enrolled in the 2022 iteration of the course, while the experimental group comprised 42 students enrolled in the 2023 iteration. As illustrated in Figure 1A, initial demographic analysis revealed some gender imbalances between the groups. The control group had a higher proportion of female students compared to the experimental group. Furthermore, as shown in Figure 1B, analysis of initial diagnostic test scores indicated that the control group exhibited slightly higher baseline performance levels compared to the experimental group at the beginning of the course. The age range of participants was broadly similar across groups, with the control group ranging from 17 to 43 years and the experimental group from 17 to 36 years.

Prior to the course commencement, all participants undertook a diagnostic technical test designed to assess their foundational skills in character concept design, a core skill in the Digital Art for Film and Video Games domain. This test required students to submit an initial character concept, which was evaluated by program instructors to determine their entry-level proficiency. The majority of students in both groups were categorized as having a “low and basic level of performance” based on this diagnostic assessment, indicating limited prior experience in character design upon entering the program. A small number of students in each group demonstrated “advanced level” skills at the outset of the training (see Figure 1B for detailed distribution of diagnostic test scores). It is important to note that these pre-existing group differences are acknowledged as a limitation of the quasi-experimental design and will be statistically addressed through covariance analysis in subsequent data analysis stages to mitigate potential confounding effects.

To statistically assess the significance of these baseline differences in diagnostic test scores between the experimental and control groups, a Chi-square test of independence was performed. This test examined the distribution of students across different proficiency levels (low, basic, and advanced) as determined by the diagnostic character concept test. The Chi-square test revealed a statistically significant difference in the distribution of baseline knowledge between the two groups (χ² = 18.49, df = 2, p < 0.001).

Specifically, as depicted in Figure 1B, the experimental group presented a higher frequency of students categorized at the low proficiency level in the diagnostic test compared to the Control Group. Conversely, the control group exhibited a greater proportion of students classified as having basic and advanced baseline knowledge. These findings quantitatively confirm the initial observation that the experimental group, on average, commenced the course with a lower level of prior technical skills in digital art compared to the control group.

These statistically confirmed differences in baseline knowledge, while acknowledging an initial disadvantage for the experimental group, also present a significant opportunity to rigorously evaluate the potential of adaptive and personalized learning interventions. Specifically, these pre-existing disparities underscore the importance and potential impact of the AI-powered feedback system implemented in the experimental group to level the playing field. This initial disadvantage allows us to examine whether the personalized feedback can effectively mitigate these initial gaps in skills and knowledge and potentially enable the experimental group to achieve comparable, or even superior, performance outcomes compared to the control group.

Therefore, rather than being a confounding factor, these baseline differences become a critical context for assessing the true value and effectiveness of AI-driven personalized learning in addressing diverse student needs and promoting equitable educational outcomes. The subsequent sections of this methodology will detail the identical instructional design applied to both groups, alongside a comprehensive description of the AI-driven feedback system uniquely introduced to the experimental group, setting the stage for analyzing the intervention’s impact within this context of initial group differences.

2.2. Design and Procedure

Due to the logistical constraints inherent in random assignment within real-world, scholarship-based educational programs, this research employed a quasi-experimental design. However, to maximize methodological validity, the study focused on maintaining strict procedural equivalence between the experimental group and the control group. Consequently, both groups received identical instruction in all aspects of the course: the same curricular content, the same instructors, and the same pedagogical methodology based on a flipped classroom approach that integrated synchronous online sessions, pre-recorded asynchronous lectures, and weekly feedback from industry experts.

The only systematic and planned difference between the groups was the implementation of the Adaptive Feedback System (AFS), exclusive to the experimental group. This system provided students in the experimental cohort with personalized weekly messages containing performance predictions and specific recommendations for improvement. While the control group participated in the program between August 2022 and August 2023 and the experimental group between April 2023 and May 2024, it is crucial to emphasize that the content and delivery of instruction were rigorously standardized across both cohorts.

This quasi-experimental design, while acknowledging the limitations of non-randomization, strengthens the ecological validity of the study by evaluating the AI-based intervention within the authentic setting of a demanding, industry-oriented digital arts program. By minimizing procedural variations beyond the AFS intervention, this design aims to isolate the specific effect of the AI-powered feedback while preserving the authenticity of the real-world educational context.

2.3. Measurements

This study employed a comprehensive measurement system designed to evaluate multiple dimensions of student performance throughout the four levels of the program. In order to robustly and thoroughly capture the impact of the AI-powered Adaptive Feedback System, the following key performance indicators (KPIs) were recorded and analyzed recurrently at each program level. These measured variables served as key input features for the deep learning model used to generate student performance predictions within the Adaptive Feedback System:

Cognitive Learning (Quiz Scores): To measure knowledge acquisition and conceptual understanding, a total of 84 quizzes were administered throughout the program. These formative assessments, primarily composed of multiple-choice and short-answer questions, were systematically repeated in each of the four levels, with 21 quizzes administered per level. The average quiz scores, calculated for each level (Average Quiz Score), provided a granular measure of cognitive learning gains at each stage of the program.
Behavioral Engagement (Session Engagement): Student engagement with learning materials was quantified by tracking attendance and interaction in both recorded and live sessions, conducted at each level of the program. The percentage of recorded sessions accessed (Recorded Percentage) and the percentage of live sessions attended (Live Percentage) served as indicators of behavioral engagement per level. It was hypothesized that the AFS would increase engagement in both learning modalities at each level.
Social Learning and Collaboration (Forum Activity): Participation in collaborative learning was evaluated by quantifying forum activity across four key evaluation periods within each level of the program, resulting in a total of 16 forum activity evaluations throughout the program. The frequency and quality of contributions (Forums 1 to Forums 4, Forum Percentage), measured in each evaluation period within each level, indicated the level of social learning and collaborative engagement per level. It was expected that the AFS would foster greater participation and quality in forums at each level of the program.
Practical Skill Mastery (Practical Activities Scores): To assess the development of practical skills, numerous industry-deliverable challenges were designed and implemented as practical activities in each course and level of the program. These practical activities, based on real industry requirements, were evaluated by industry experts through personalized feedback and quantitatively scored using predefined rubrics. An average score of practical activities per level (Average of Practical Activities) was recorded. This KPI measured the progress in mastering industry-relevant practical skills at each level of the program.
Practical Skill Mastery (Portfolio Assessment Scores): To evaluate the ability to apply skills in complex projects and demonstrate mastery of industry-level competencies, portfolio submissions delivered by students at the end of each level of the program were evaluated, resulting in a total of eight portfolio evaluations throughout the program. Industry experts assigned scores in two portfolio evaluations per level (Evaluation 1, Evaluation 2), and their percentage average (Evaluations Percentage) was calculated to obtain a measure of practical skill mastery in portfolio projects at the end of each level. It was anticipated that the AFS would improve portfolio quality and, therefore, scores in these evaluations at each level of the program.

The collection and analysis of these KPIs, systematically repeated at each of the four program levels, provided a rich and granular database to exhaustively evaluate student performance and the impact of the AI-powered Adaptive Feedback System across multiple dimensions of learning.

It is important to note that while all variables were collected for pedagogical and evaluation purposes, only a subset was selected as input features for the predictive model, based on their availability in the early stages of each program level. Specifically, the portfolio evaluation scores were excluded from model training due to their late availability in the course timeline, which would have introduced data leakage. However, these scores were retained as valuable outcome indicators to assess skill mastery at the end of each level.

2.4. Data Collection

To ensure transparent and replicable data collection, this study relied on both automated data logging and expert evaluation procedures. Automated data collection was facilitated by a proprietary Learning Content Management System (LCMS) platform, developed by PolygonUs for its Academy educational programs. This LCMS was instrumented to automatically capture student activity data relevant to the study’s metrics. Specifically, the following data streams were automatically recorded via the LCMS API:

Student Performance in Quizzes: The LCMS automatically logged and stored individual student scores for each quiz administered within the platform (Q1Punta to Q21Punta), allowing for the calculation of average quiz scores (Average Quiz Score).
Student Engagement in Sessions: The LCMS tracked student access and participation in learning sessions, automatically recording attendance in live sessions (Live Percentage) and engagement with recorded sessions (Recorded Percentage) based on platform login and session activity logs.
Student Activity in Forums: The LCMS automatically logged student posts and interactions within the platform’s forum modules, enabling the quantification of forum activity based on frequency and timestamps of contributions (Forums 1 to Forums 4).

Data collected via the LCMS was securely stored in an Amazon Web Services (AWS) database infrastructure. Weekly data reports were generated and extracted from this AWS database, providing structured datasets for feeding the AI-powered Adaptive Feedback System (AFS). Importantly, the data stored in the AWS database could not be modified, ensuring the integrity and auditability of the collected data. These data were queried through a secure web interface and integrated into the Python 3.11 algorithm that executed the predictive model weekly.

In addition to automated data, expert-assessed data were collected for metrics related to practical skill mastery. Industry experts, defined as senior artists with extensive experience in the digital art production pipeline, including areas such as concept art, 3D modeling, texturing, rendering, animation, retopology, UVs, and compositing with After Effects, conducted evaluations of student portfolio submissions and practical activity deliverables. These experts, trained by the PolygonUs education team for this purpose, performed the evaluations using standardized rubrics integrated within the LCMS platform. Practical activities, consisting of industry-deliverable challenges, were reviewed by experts every 15 days, given the practical nature of the program. Additionally, final projects, designed to integrate knowledge from each level, were also evaluated by experts at the end of each level. Students received personalized feedback from experts in 2 h sessions, twice a week. The experts provided both quantitative scores (recorded as Evaluation 1, Evaluation 2, Average of Practical Activities) and detailed qualitative feedback comments for each student submission, enhancing the quality of the assessment data. This expert evaluation data were then queried through the secure web interface and used to generate the portfolio and practical activity evaluation metrics.

Data collection for all metrics was performed periodically and continuously throughout the four levels of the program, allowing for the longitudinal analysis of student performance and engagement. The complete data flow, from student interaction in the LCMS to data integration into the prediction algorithm and feedback communication through the website and other channels, is illustrated in Figure 2.

2.5. Adaptive Feedback System (AFS): Architecture and Implementation

The Adaptive Feedback System (AFS) developed in this study was designed as a multi-layered architecture that integrates predictive modeling and automated feedback generation in a scalable, secure, and pedagogically informed manner. This section outlines the four core layers of the system and their interconnections:

2.5.1. Data Collection Layer

This layer interfaces directly with the Learning Content Management System (LCMS) through its API, retrieving the following:

Engagement metrics (live attendance, recorded class access, forum participation);
Performance metrics (quiz scores).

Data are anonymized at collection and securely stored in AWS databases, with Google Sheets serving as an intermediary interface for real-time visualization and instructor access. The system ensured that the AWS-stored data could not be altered post-capture, thus preserving the auditability of performance logs.

2.5.2. Prediction Layer

The system’s prediction layer utilized a deep learning model (a feedforward neural network with dropout regularization) trained to estimate the probability of each student passing the course. The model used four behavioral and performance variables as input and was optimized using binary cross-entropy loss and the Adam optimizer. Weekly inference runs produced class probability predictions, which were later used to generate feedback messages.

To illustrate the data flow through the prediction layer:

Original input from LCMS: quiz = 63.91%; recorded attendance = 82.5%; live attendance = 66.7%; and forum activity = 41.3%.
Normalized input vector for the model: [0.6391, 0.825, 0.667, 0.413].
Output: Probability of success = 0.698 → classified as “likely to pass” (threshold = 0.6048).

To improve interpretability and transparency, a logistic regression model and several baseline classifiers (decision tree and random forest) were also tested, as reported in Section 3.4. Model performance was evaluated using metrics such as F1 score, ROC-AUC, and cross-validation scores. The final neural network model achieved a ROC-AUC of 0.945 and was selected based on its ability to generalize across cohorts while avoiding overfitting.

2.5.3. Feedback Generation Layer

The feedback layer was powered by GPT-4, which received structured prompts engineered using a chain-of-thought style and context-aware variables. Prompts were designed to generate multi-part feedback that included:

A motivational greeting;
An interpretation of the AI prediction;
Detailed analyses of quiz scores, session engagement, and forum activity;
A list of personalized recommendations.

Pop culture preferences (e.g., Star Wars, Marvel) and mentor archetypes (e.g., Yoda, Dumbledore) were used to personalize tone and content style. Prompts were crafted to follow pedagogical best practices, and output quality was reviewed by the education team. Importantly, GPT-4 operated as a one-way feedback generator—students could not interact with the AI. In cases where feedback contained errors or ambiguity, human team members manually reviewed and corrected the messages.

Prompt design was aligned with formative assessment principles and cognitive scaffolding strategies to enhance self-regulated learning.

2.5.4. Delivery and Notification Mechanism

Once generated, the feedback messages were stored in the student profile and made available through the platform every Friday. Additionally, a short WhatsApp notification was automatically sent to each student, summarizing the feedback in under 200 characters and linking back to the platform. Although the system tracked delivery, it did not verify whether students consciously engaged with the messages. This limitation is acknowledged and discussed in Section 4.6.

Figure 3 illustrates the complete architecture of the AFS, from data acquisition to predictive modeling and feedback delivery. For further details on the practical application of the predictive model, see Appendix A.

2.6. Model Selection Process

To estimate the likelihood of academic success and generate personalized feedback, multiple machine learning models were trained and evaluated on a dataset comprising 445 labeled instances. The selected input variables reflect behavioral indicators: performance in in-class tests, attendance to prerecorded sessions, attendance to live classes, and participation in discussion forums. The target variable was a binary label (pass/fail) based on the final course outcomes.

Four different classifiers were implemented: logistic regression, decision tree, random forest, and a compact-architecture neural network. Models were trained using an 80/20 split for training and test sets and evaluated using standard metrics such as precision, recall, F1 score, and area under the ROC curve (ROC-AUC). To estimate generalization capacity, a five-fold cross-validation with three repetitions was applied.

To support reproducibility, Table 2 presents a detailed summary of the hyperparameters used for the neural network.

2.7. Ethical Considerations

This study was conducted with the formal approval of Academy by PolygonUS, the organization responsible for the “Academy” training program. Given the nature of the research carried out in the real-world context of an educational program within a private company, obtaining this institutional approval was a fundamental step to ensure the feasibility and ethical conduct of the study. Additionally, this research was undertaken as part of the first author’s doctoral candidacy at the Universitat Politècnica de València (UPV), framing the study within a rigorous academic context and with implicit ethical oversight.

All participants provided their informed consent prior to participating in the study. They were provided with detailed information about the research objectives, the nature of the intervention (AFS), the data collection procedures, and the intended use of the data. They were explicitly guaranteed the confidentiality of their personal and academic data, ensuring that their information remained anonymous and protected. They were also clearly informed about their right to withdraw from the study at any time, without any negative consequences for their participation in the educational program.

To protect participant privacy, the collected data were anonymized at the point of collection, removing any personally identifiable information. The anonymized data were securely stored on encrypted servers, complying with data protection regulations. Access to the data were restricted solely to the principal investigators of the study.

3. Results

This section presents the results obtained after implementing the Adaptive Feedback System (AFS) based on deep learning—specifically, a pre-trained recurrent neural network (RNN) with historical data and GPT-4. The system provided weekly personalized feedback to the experimental group (n = 42) throughout all four learning blocks or levels of the program, while the control group (n = 39) followed the same four learning blocks or levels program without this intervention. Key findings include (1) a cumulative effect resulting in a significant performance difference in the fourth learning block (+12.63 percentage points); (2) a notable reduction in performance disparities between students with varying levels of prior knowledge in the experimental group (−56.5%) versus an increase in the control group (+103.3%); (3) a particularly beneficial effect for students with low prior knowledge; (4) an overcoming effect where up to 42.9% of students surpassed initial negative predictions; and (5) an especially positive impact on components related to active participation. These results remained robust across various complementary analyses.

3.1. Initial Characteristics of the Groups

The experimental (n = 42) and control (n = 39) groups presented significant differences in their initial characteristics, an important factor to consider in this quasi-experimental design, as shown in Table 3.

Chi-square tests confirmed statistically significant differences in both gender distribution (χ² = 4.63, p = 0.031) and prior knowledge level (χ² = 20.87; p < 0.001). These results coincide with the diagnostic assessment described in the methodology (χ² = 18.49; df = 2; p < 0.001).

These differences, while representing a methodological limitation, offer an opportunity to evaluate the capacity of the AFS to mitigate initial knowledge gaps, aligning with the literature on adaptive educational systems [7,9], which suggests their leveling potential. To address these differences statistically, analysis of covariance (ANCOVA) will be employed in subsequent sections.

3.2. Evolution of Academic Performance

At Level 2, both groups experienced a decrease in performance, but the experimental group maintained a significantly higher level (69.67% vs. 58.05%; p = 0.017) with a small to moderate effect size (η² = 0.073), as shown in Table 4.

Level 3 presented an interesting reversal, with the control group recovering (69.82%) while the experimental group continued to decline (61.83%). This difference, however, did not reach statistical significance (p = 0.102).

The most striking difference emerged at Level 4, where the experimental group showed a dramatic recovery (79.30%), exceeding even their initial performance, while the control group declined to 66.67%. This difference was highly significant (p < 0.001) with a large effect size (η² = 0.183).

This performance pattern across all four levels indicates a cumulative effect of the AFS, which becomes most evident in the program’s final stage. This aligns with prior research on the variable and cumulative impact of Adaptive Feedback Systems [11,13].

3.3. Performance Transitions Between Consecutive Levels Across the Program

The analysis of transitions between consecutive levels reveals important patterns in how students progressed through the four-level program, as shown in Table 5.

In the transition from Level 1 to 2, both groups experienced a decrease in performance, with the decline being significantly smaller in the experimental group (−8.88% vs. −16.56%; p = 0.032). A significantly higher proportion of students in the experimental group managed to improve their performance (31.0% vs. 2.6%; p < 0.001).

In the transition from Level 2 to 3, the patterns reversed. The control group showed a significant recovery (+11.77%), while the experimental group continued its downward trend (−7.84%, p < 0.001).

The most striking difference occurred in the transition from Level 3 to 4. The experimental group showed a dramatic improvement (+17.47%), while the control group experienced a slight decline (−3.15%, p < 0.001). A vast majority of experimental group students improved their performance (84.8% vs. 30.3% in the control group, p < 0.001).

These transition patterns across all four levels suggest that the AFS had different effects at different stages of the program. The system appeared to provide initial resilience (Level 1 to 2), faced challenges in the middle stage (Level 2 to 3), but showed its greatest impact in the final transition (Level 3 to 4). This pattern of delayed effectiveness across a four-level program has been documented in other studies of AI-mediated educational interventions [11,12].

3.4. Model Results and Selection Rationale

Table 6 summarizes the performance metrics obtained by each model on the test set. The neural network achieved the highest F1 score (0.923) and ROC-AUC (0.946), closely followed by logistic regression (F1 = 0.917; ROC-AUC = 0.942) and the random forest, with differences likely not statistically significant. In contrast, the decision tree showed comparatively lower performance.

Figure 4 presents the ROC curves for the four evaluated models. Both the neural network and logistic regression display curves approaching the ideal, with a slight advantage observed for the former.

Interpretability was also assessed through feature importance analysis. For logistic regression, decision tree, and random forest models, feature importance was computed using model-specific internal mechanisms (e.g., coefficient magnitudes or impurity-based scores). Figure 5 shows the relative importance assigned to each feature across these models. The results reveal distinct distribution patterns: logistic regression exhibits a 3.6-fold difference between its most and least influential variables; the random forest model shows a 7.6-fold variation; and the decision tree displays the most imbalanced distribution, with a nearly 29-fold difference between performance in in-class tests and forum participation.

For the neural network, interpretability was addressed using SHAP values derived from a KernelExplainer. This model-agnostic approach estimates the marginal contribution of each input feature to the model’s output. As seen in Figure 5, the neural network distributes importance more evenly, with only a twofold variation between the most and least influential variables. While more modest than in other models, this behavior may be desirable in educational contexts where student performance depends on a diverse set of factors. Nevertheless, a clear feature hierarchy remains identifiable within the network.

Interestingly, the models differ in identifying the most influential variable: both logistic regression and the neural network prioritize attendance to pre-recorded sessions, while the tree-based models emphasize performance in in-class tests. This divergence reflects the different inductive biases of the algorithms when modeling student behavior.

The selection of the neural network as the core model for the Adaptive Feedback System (AFS) was not based solely on its marginal performance advantages or on its more balanced feature attribution. Instead, it was primarily motivated by its greater flexibility for future scalability. Neural networks are better equipped to integrate heterogeneous variables (e.g., sociodemographic data or learning styles) and to uncover nonlinear relationships—capabilities that are critical for the next phase of system development.

Furthermore, this performance was achieved without requiring specialized computational infrastructure: training was completed in approximately 16 s using only a free Google Colab instance. This computational efficiency underscores the feasibility of deploying advanced models in educational environments with limited technological resources, facilitating broader accessibility and adoption.

3.5. Analysis by Learning Components

To understand how the AFS affected different aspects of learning throughout the four-level program, we analyzed its impact on six key components. Table 7 presents the evolution of these components across all four levels.

This comprehensive analysis reveals several important patterns:

Test performance: The experimental group maintained higher test scores throughout all four levels, with the gap widening at Level 2 (84.14% vs. 49.87%) and Level 4 (82.45% vs. 74.06%).
Pre-recorded classes: The experimental group consistently showed higher attendance, with the largest difference at Level 1 (95.50% vs. 75.82%) and Level 2 (87.50% vs. 53.49%).
Live classes: After an initial advantage at Level 1 (59.12% vs. 26.13%), the experimental group fell behind at Levels 2 and 3 but showed a dramatic recovery at Level 4 (73.70% vs. 43.48%).
Forum participation: The experimental group showed substantially higher participation at Level 1 (61.90% vs. 5.13%) and Level 2 (60.71% vs. 28.38%), a slight drop at Level 3, and recovered at Level 4 (70.45% vs. 60.67%).
Portfolio components: The control group initially performed better in the portfolio assessment (Level 1), but by Level 4, the experimental group showed superior performance in both submissions (74.33% vs. 70.39%) and assessment (79.30% vs. 67.27%).

The observed evolution of these components across the four levels is consistent with the hypothesis that the AFS may have contributed to improvements in multiple aspects of learning, although alternative explanations cannot be ruled out due to the study’s quasi-experimental design. By Level 4, the experimental group showed superior performance across all components, with the most notable differences in live class attendance (+30.21 points), portfolio assessment (+12.03 points), and forum participation (+9.79 points).

These findings indicate that the AFS, when sustained across a four-level program, can have a comprehensive positive impact on multiple dimensions of learning, particularly those related to active participation and engagement.

Figure 6 provides a multidimensional visualization of learning components at Level 4, clearly illustrating how the experimental group (green area) consistently outperformed the control group (orange area) across all dimensions. This radar representation highlights the comprehensive nature of the AFS’s impact, with the most substantial differences observed in live class attendance (+30.2pp), portfolio assessment (+12.0pp), and forum participation (+9.8pp). The pattern revealed in this visualization reinforces our finding that adaptive feedback had the greatest impact on components related to active engagement and participation.

3.6. Understanding the Overcoming Effect: Predictive Model Efficacy

The core contribution of this study revolves around what we have termed the Overcoming Effect—the ability of students receiving adaptive feedback to surpass negative predictions and alter their expected learning trajectory across a four-level program. Figure 7 conceptualizes this phenomenon.

The conceptual model illustrates how adaptive feedback transforms predicted academic trajectories across all four levels. When students receive risk predictions accompanied by personalized and actionable weekly feedback, they can modify their academic behavior, generating a progressive divergence between the initially predicted trajectory (red dotted line) and the actual trajectory (green line). The green shaded area represents the “Overcoming Effect,” highlighting the ability of the AFS to positively transform initial expectations and enhance academic achievement, particularly for students initially identified as at-risk. Table 8 presents the quantitative evolution of this Overcoming Effect across all four levels of the program.

The predictive accuracy of the model improved progressively throughout the four-level program, reaching its peak at Level 4 (78.6%). This increase suggests that the model was learning and adapting with more data, a finding aligned with the literature on the potential for improvement of deep learning models with continuous feedback [3].

A particularly interesting finding is the steady increase in the percentage of students who managed to surpass an initial negative prediction across all four levels. At Level 1, only 16.7% of students in the experimental group with an initial failure prediction managed to pass. This figure consistently increased through each level, reaching 42.9% at Level 4. This pattern may be indicative of what we describe as the overcoming effect—a phenomenon in which students, upon receiving information about their risk along with specific recommendations, appear to modify their academic behavior and exceed the initial expectations of the model.

The correlation analysis between the probability of passing assigned by the model and the final performance showed a moderate but significant correlation (r = 0.62, p < 0.001), indicating that, despite the initial limitations, the assigned probabilities had predictive value for final performance across the entire four-level program.

3.7. Educational Gap Reduction Effect Across the Four-Level Program

One of the most significant findings of this study was the AFS’s capacity to reduce educational gaps between students with different levels of prior knowledge as they progressed through all four levels of the program, as shown in Table 9.

The tracking of educational gaps across all four levels reveals a fascinating pattern. At Level 1, the experimental group showed a larger initial gap between students with basic and low prior knowledge (14.36pp vs. 6.98pp in the control group). This gap widened at Levels 2 and 3 for both groups.

However, at Level 4, a dramatic reversal occurred: the experimental group’s gap narrowed significantly to 6.24pp (a 56.5% reduction from Level 1), while the control group’s gap continued to widen to 14.19pp (a 103.3% increase from Level 1).

At Level 4, students with low prior knowledge in the experimental group achieved a performance of 78.36%, while their counterparts in the control group only achieved 58.31%, a difference of 20.05 percentage points. This interaction was statistically significant (F = 4.75, p = 0.033), suggesting that the AFS was particularly beneficial for students with lower levels of prior knowledge in the long term, a finding that reinforces the potential of Adaptive Feedback Systems to promote educational equity [15].

3.8. Retention and Completion Across All Four Levels

Of the original participants, 78.6% of the experimental group (33 of 42) and 84.6% of the control group (33 of 39) completed all four levels of the program. This difference in completion rates was not statistically significant (χ² = 0.51, p = 0.477).

In the experimental group, the completion rate was lower among students with basic prior knowledge (60.0%, 3 of 5) than among those with low prior knowledge (81.1%, 30 of 37). In the control group, the completion rate was similar between students with basic prior knowledge (84.2%, 16 of 19) and low prior knowledge (82.4%, 14 of 17).

Follow-up interviews with students who did not complete all four levels indicated that most dropouts were due to personal circumstances (such as changes in work schedules or family responsibilities) and time constraints related to the program’s demanding schedule, rather than factors related to the educational intervention itself. This finding aligns with research on dropout factors in online and blended learning environments [13,15], which consistently identify external factors as primary drivers of non-completion.

3.9. Robustness Analysis Across Different Thresholds

To verify the consistency of the findings, analyses were conducted with different passing thresholds, as presented in Table 10.

The predictive accuracy improved notably with a more demanding threshold (70%), suggesting that the model is more effective at identifying students with very high or very low performance, but has difficulties in the intermediate ranges, a finding consistent with research on the variable accuracy of predictive models according to the classification threshold [12].

A multiple regression analysis for performance across all four levels, controlling for multiple covariates, was statistically significant (F = 28.45, p < 0.001, adjusted R² = 0.763), explaining approximately 76% of the variance. After controlling all covariates, the effect of the group was statistically significant (β = 0.214, p = 0.018), reinforcing the conclusion that the AFS had a substantial impact on student performance throughout the four-level program.

These results suggest that the integration of a deep learning model (RNN) and GPT-4 within the AFS has the potential to support improved academic performance. However, further studies are needed to confirm the robustness and generalizability of these effects in generating personalized feedback across multi-level educational programs. Beyond overall performance gains, the system showed particularly positive outcomes in reducing educational gaps and enabling students to overcome initial negative predictions—findings that align with recent literature on the potential of AI technologies to foster more equitable and personalized learning environments [3,11,13].

4. Discussion

4.1. Interpretations and Alignment with the Previous Literature

The results of this research provide substantial evidence on the potential of Adaptive Feedback Systems (AFS) based on deep learning and LLMs in online educational environments. A key outcome is the cumulative effect observed throughout the four levels of the program, becoming more evident in advanced stages. This is consistent with studies in adaptive feedback [11,13] that also report a long-term buildup of positive effects:

Students become accustomed to incorporating iterative feedback into their self-regulated learning strategies;
The system “learns” more about each student, thus refining its predictive power;
Repeatedly receiving feedback prompts progressive behavioral changes that accumulate over time.

The fact that the experimental group managed to reduce educational gaps and even outperform the control group by Level 4—despite starting with lower prior knowledge—points toward the equity-promoting potential of AFS-like interventions. Such findings echo pedagogical scaffolding principles [24,25] wherein adaptive and differentiated support helps level the playing field for students with varying backgrounds.

4.2. Delayed Recovery and the Overcoming Effect

An intriguing pattern that emerged is the delayed improvement in the experimental group, where performance dipped at Level 3 but rebounded significantly at Level 4. This dip-then-spike trajectory may be explained by a combination of increasing course difficulty during the mid-levels and the time it takes for students to internalize the suggestions consistently. Similar phenomena have been documented in prolonged adaptive feedback interventions, where students sometimes display a temporary plateau or regression before fully capitalizing on personalized recommendations [12].

Within this context, the overcoming effect—students surpassing initially negative predictions—was seen in up to 42.9% of learners by the final stage. Instead of demotivating at-risk students, early alerts seem to trigger behavior changes that reverse their predicted trajectory. This contrasts with concerns that risk labeling may lead to self-fulfilling prophecies and instead shows how timely, actionable feedback can empower students to outperform the system’s early forecasts.

4.3. Stronger Impact on Active Participation

Breaking down learning components revealed the strongest gains in active participation metrics, notably live class attendance (+30.21) and forum engagement (+9.79). Such findings support the view that social presence and interactive engagement are crucial predictors of success in online education [2]. By making students aware of how attendance and forum posts correlate with performance, the AFS seemed to foster a more proactive approach, encouraging them to attend synchronous sessions and contribute to collaborative discussions.

4.4. Potential Biases and Model Transparency

Although the Adaptive Feedback System (AFS) demonstrated high predictive accuracy (up to 78.6% in later stages), several interpretability issues and potential biases warrant attention. The system integrates two main components: (1) a compact deep learning model (specifically, a recurrent neural network) that forecasts student performance based on behavioral features, and (2) a GPT-4-based module that generates individualized feedback messages.

A major concern lies in the opacity of neural network models, which tend to function as black boxes. This makes it difficult for instructors and learners to understand how specific inputs—such as forum participation or attendance—contribute to the predicted outcomes. As Lipton [26] emphasizes, the lack of transparency can hinder trust and complicate efforts to detect fairness issues, biases, or model drift. While our current implementation prioritized performance, we mitigated this limitation by integrating SHAP (Shapley additive explanations) into the interpretability workflow. SHAP offers model-agnostic explanations by estimating the marginal contribution of each input feature to a prediction. This aligns with current efforts in educational AI to balance performance with interpretability [12,15].

On the feedback generation side, large language models such as GPT-4 may inherit and reproduce cultural, demographic, or linguistic biases embedded in their training data. This risk is especially relevant in educational settings, where the framing and tone of recommendations can significantly affect student engagement. Prior studies have raised concerns that unmonitored LLM outputs may reinforce inequities or yield inconsistent advice across student profiles [21,22,24].

Although our content review process did not identify overtly biased feedback during deployment, this limitation suggests a critical need for future safeguards. These may include content audits, prompt engineering guidelines, and co-design sessions with educators to ensure feedback remains aligned with pedagogical intent and equity standards. Moreover, emerging studies propose combining human-in-the-loop strategies with automated moderation to increase transparency and ethical oversight in generative educational systems [16,19].

4.5. Generic or Uncontextualized Feedback

If the input data passed to GPT-4 does not adequately capture the nuances of a student’s context—such as low digital literacy, language barriers, or fluctuating motivation—the resulting feedback may lack personalization. This risk increases when the system is scaled to large cohorts, where real-time prompt monitoring and manual review become less feasible.

To address this issue, the system integrated explainable AI (XAI) techniques, specifically the SHAP (Shapley additive explanations) method. SHAP allowed the academic team to visualize which features had the greatest influence on each prediction made by the model, improving transparency and enabling more pedagogically informed interventions.

Although we did not conduct structured interviews or surveys specifically focused on the AI, anecdotal evidence of student engagement was observed in discussion forums and support chats. Some students referred directly to the AI-generated messages—calling it “the Oracle”—and requested manual edits when they felt the feedback did not reflect their actual progress. While informal, these instances suggest partial student appropriation and perceived relevance of the system.

Looking ahead, the next version of the system will include a conversational agent that allows students to reply directly to feedback via WhatsApp. This two-way interaction channel will not only enhance personalization but also enable the collection of valuable data on how students perceive, interpret, and act upon the feedback provided.

Based on these insights, we recommend that future research combines XAI outputs with more systematic qualitative analysis (e.g., surveys, interviews, behavioral trace analysis) to evaluate how students engage with AI-generated feedback and how it influences their learning trajectories.

4.6. Limitations

Despite the promising outcomes, this study has certain limitations that call for cautious interpretation.

4.6.1. Quasi-Experimental Design and Cohort Differences

The study employed a quasi-experimental design involving different cohorts (2022 for the control group and 2023 for the experimental group), which limits internal validity and restricts the ability to establish causal relationships. Although statistical controls were applied using ANCOVA to adjust for baseline disparities in prior knowledge and gender composition, unmeasured external factors—such as curricular updates or increased exposure to AI technologies in 2023—may have influenced the results. Nonetheless, since the control group did not receive AI-generated feedback, the potential impact of these external factors was likely reduced. In light of this limitation, findings are interpreted as consistent with the hypothesis that the Adaptive Feedback System may have contributed to the improvements observed, rather than as definitive proof of causality.

4.6.2. Sample Size and Power Analysis

While the experimental and control groups had 42 and 39 students, respectively, such sample sizes might limit generalizability. Future studies should consider larger cohorts or multi-institutional settings, ideally accompanied by a power analysis to ensure the sample is sufficient to detect meaningful differences.

4.6.3. Contextual Scope

Participants were part of a highly selective digital art program, which may not represent the full spectrum of online learners. Their motivation levels, scholarship context, and professional aspirations could have amplified the receptiveness to AI-driven feedback.

4.6.4. Model Overfitting or Narrow Feature Selection

Although cross-validation was performed, high accuracy values (particularly early in the pipeline) may raise concerns about potential overfitting. Moreover, the predictors used (attendance, quiz scores, forum activity, etc.) might not capture broader factors (e.g., emotional well-being, socio-economic conditions) that also shape student success.

4.6.5. Ethical and Privacy Considerations

AI systems handling student data must comply with stringent privacy regulations and ensure informed consent. Institutions adopting such systems should implement robust data protection measures and remain vigilant against inadvertently reinforcing biases, in alignment with guidelines [21].

5. Conclusions

5.1. Main Contributions

This study empirically demonstrates that integrating deep learning (RNN) and GPT-4 for personalized, weekly feedback can significantly enhance online learning outcomes. Notably, we identified the overcoming effect, wherein students who were initially projected to fail changed their learning trajectories through actionable feedback. The findings reinforce the notion that predictive analytics should be viewed not merely as a detection tool for at-risk students, but as an intervention that can actively redirect learners’ academic paths.

Additionally, the research shows that such AI-powered feedback can close educational gaps, a critical advance in ensuring more equitable outcomes. By providing differential support to those with lower prior knowledge, AFS aligned with Vygotskyian scaffolding principles and effectively improved engagement and participation.

5.2. Practical Implications

For Educational Institutions: Adopting Adaptive Feedback Systems can bolster student engagement and performance, especially among those starting with lower skills. This has strong potential for promoting inclusion and equity in online courses.

For EdTech Developers: The demonstrated synergy between an RNN for predictions and GPT-4 for messaging can guide future product roadmaps. Emphasis on interpretability, data security, and robust prompting strategies will be crucial to maximize effectiveness.

For Policymakers: Balancing innovation with ethical frameworks is essential. Policymakers could incentivize institutions to pilot similar AI interventions, but accountability, transparency, and fairness during deployment must be ensured.

5.3. Future Directions

Looking ahead, several promising avenues exist for refining and expanding the potential of the Adaptive Feedback System (AFS). First, longitudinal research examining whether the observed improvements persist beyond the immediate course cycle—such as follow-up performance one year later—would provide insight into the sustainability of the system’s impact. Second, cross-domain studies could explore the adaptability of the AFS across diverse subject areas (e.g., STEM or humanities), assessing whether domain-specific customization is necessary.

Another key direction involves integrating multimodal data, such as video engagement and affective indicators, to improve the system’s predictive power and relevance—provided strict ethical and privacy safeguards are enforced. Importantly, we have already implemented explainable AI (XAI) methods such as SHAP to interpret feature contributions in the prediction model, and we plan to expand the use of such tools to increase transparency and foster trust among users.

Future iterations of the AFS will also incorporate a conversational agent that allows students to respond to feedback directly via WhatsApp. This two-way interaction will enable the collection of behavioral and perceptual data, providing valuable insight into how students interpret and react to feedback. Such data can inform continuous improvements to the system’s tone, structure, and personalization strategies.

Researchers may also explore differentiated feedback modalities—such as textual vs. visual formats, or directive vs. motivational tones—to identify what styles best support various learner profiles. Finally, embedding a co-design process with both educators and students will help adapt the system’s frequency, delivery methods, and pedagogical voice to the needs of real-world learning environments, ensuring continued relevance and effectiveness.

Author Contributions

Conceptualization, Ó.C., M.C. and M.H.; methodology, Ó.C.; software, Ó.C.; validation, Ó.C., M.C. and M.H.; formal analysis, Ó.C.; investigation, Ó.C.; resources, Ó.C.; data curation, Ó.C.; writing—original draft preparation, Ó.C.; writing—review and editing, M.C. and M.H.; visualization, Ó.C.; supervision, M.C. and M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study as it involved the analysis of educational data without direct intervention on human subjects. The study was conducted in accordance with institutional guidelines, and formal permission was obtained from the private company managing the educational program.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study and from the institution providing the educational program.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy restrictions related to student performance information.

Acknowledgments

The authors would like to thank PolygonUs for providing access to educational data and platform used in this study. Special appreciation to the tutors and educational staff who supported the implementation of the Adaptive Feedback System. We also thank the students who participated in the program.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Model Application and Weekly Prediction Generation

Having established the robust performance of the student success prediction model, the subsequent crucial step is to detail its practical application within AFS. A defining feature of the AFS is its personalized feedback delivery cycle, operating on a weekly basis. Accordingly, the prediction model is designed to be applied each week, generating updated predictions for each student at the start of this cycle. This periodic prediction update allows the AFS to dynamically adapt to students’ evolving performance and engagement throughout the course, providing timely and relevant feedback that reflects their most recent academic activities.

To generate these recurring predictions, the AFS constructs a personalized student dictionary. This dictionary serves as the input for the prediction model, encapsulating the student’s current performance across key indicators. Specifically, each student dictionary includes the following features, updated with data from the prior week: performance in in-class tests, attendance to pre-recorded classes, attendance at live classes, participation in forums, and performance in deliverables to create a portfolio. This structured dictionary format ensures that the model receives consistent and relevant input for each student, on a recurring basis, enabling the generation of accurate and personalized predictions. An example of such a student dictionary, with illustrative sample data, is provided in Figure A1.

Figure A1. Example of student dictionary.

With the personalized student dictionary constructed, the AFS then proceeds to utilize the already trained deep learning model to generate individual predictions. For each student, their respective dictionary is transformed into a numerical input vector, formatted to match the expected input structure of the trained model. This involves extracting the feature values from the dictionary and organizing them in the specific order required by the model’s input layer. Subsequently, this input vector is presented to the pre-trained prediction model. Operating in inference mode, the model processes this input, leveraging its learned neural network architecture to output a probability score through a sigmoid activation function. This probability score, ranging from 0 to 1, represents the model’s estimated likelihood of the student belonging to the fail class (class 0). To facilitate interpretation and integration with subsequent AFS components, this probability score is then converted into a binary textual prediction (pass or fail) based on the optimal threshold determined previously (0.6048). Furthermore, both the numerical probability score and the binary textual prediction are incorporated into a final student dictionary, which now includes not only the original input features but also these crucial prediction outputs. This final student dictionary, containing the input features and the model-generated probability score and textual prediction, is then used as a crucial input for the subsequent phase of the AFS, namely the generation of personalized feedback, as detailed in the following section.

Figure A2. Example of student dictionary with prediction.

Appendix B. Feedback Communication and Example

To illustrate the mechanism for delivering personalized feedback, Figure A3 presents an exemplary prompt utilized to guide the OpenAI API in the generation of student-specific feedback. This prompt, structured as a Python dictionary and formatted in JSON, serves as the blueprint for creating individualized messages that are both informative and motivational. The subsequent subsections will dissect the key elements within this prompt structure, detailing how each component contributes to the generation of feedback that is tailored to the unique performance profile and predicted learning trajectory of each student within the AFS.

Figure A3. Initial structure and AI role definition of the feedback prompt.

Figure A3 illustrates the initial structure of the prompt, focusing on the ”instruction” section and the definition of the AI’s role. This section establishes the foundational directives for the AI, defining its persona as an AI expert in personalized tutoring with a specialty in digital environment assessment. It specifies the voice as first person to create a more direct and personal interaction with the student. Crucially, it sets the tone to be adaptive based on student performance, especially considering the success prediction outcome. This adaptive tone is a key element, ensuring that the feedback is not only informative but also appropriately nuanced based on the student’s academic standing and predicted trajectory. The prompt also includes placeholders for student_id and context, allowing for dynamic insertion of student-specific information and course descriptions to further personalize the feedback.

Figure A4. Definition of the performance scale and course components within the feedback prompt.

Moving into the more granular aspects of the prompt, Figure A4 details the performance_scale and course_components sections. The performance_scale establishes a detailed rubric for categorizing student performance, defining ranges from very_low to excellent and providing descriptive labels for each range. This scale enables the AI to contextualize student performance within predefined levels, ensuring consistent and interpretable feedback across different performance metrics. Concurrently, the course_components section meticulously outlines the key elements of student activity that are considered for feedback generation. This includes a comprehensive list of activities such as live_attendance, recorded_classes, quizzes, portfolio, industry_projects, and forum_participation. For each component, the prompt specifies its type, weight and description, allowing the AI to understand the nature and relative importance of each activity in the overall student assessment.

Figure A5 shows the required_analysis section of the prompt. This section is crucial as it dictates the specific queries and analytical tasks the AI is expected to perform. For each topic, the prompt defines a question that represents the student’s potential inquiry (e.g., “How is my performance in the course, what percentages have I obtained in each category?” for overall_performance), and a focus that guides the AI’s analytical approach (e.g., comprehensive analysis of all components). These initial topics cover a range of performance indicators, from holistic course performance to engagement in specific activities like forum discussions and class attendance, as well as performance in assessments like quizzes.

Figure A5. Initiation of the required analysis section and definitions for the analysis of overall performance.

A pivotal element of the prompt, and a key innovation of the AFS, is highlighted in Figure A6, the inclusion of success_prediction within the required_analysis section. This subsection explicitly integrates the predictive capabilities of the deep learning model into the feedback generation process. Notably, it includes an information field that dynamically injects the prediction outcome “Academy’s AI prediction for the student is:”. More importantly, it provides specific instruction on how to utilize this prediction: “Based on this prediction, adjust the tone of the feedback. […] Ensure that the feedback aligns with the prediction outcome and provides actionable advice accordingly.” This instruction is paramount, directing the AI to adapt its feedback tone based on whether the student is predicted to pass or fail, ensuring that the feedback is contextually relevant and appropriately supportive or encouraging.

Figure A6. Definition of success prediction.

Continuing with the required_analysis section, Figure A7 presents the final analysis topic, improvement_areas, which prompts the AI to provide targeted improvement suggestions based on the student’s data. Following this, the figure transitions into the response_requirements section, marking the shift from analysis directives to output specifications. This section begins by defining the format of the AI’s response as JSON, ensuring structured and machine-readable output that can be easily integrated with other components of the AFS.

Figure A7. Definition of improvement_areas.

Finally, Figure A8 concludes the response_requirements section and the entire prompt structure. Within response_requirements, the style_elements subsection details specific stylistic guidelines for the AI’s output. These include emojis (“Use contextually appropriate emojis reflecting performance levels”), movie_quotes (“End each response with relevant geek movie quotes that match the performance context”), and tone (“Encouraging and constructive while being honest about performance”). These style elements aim to enhance the feedback’s engagement and motivational impact. The prompt concludes by specifying the data_context as All assessments are partial/preliminary, not final, setting the appropriate expectation for students regarding the provisional nature of the feedback. This comprehensive prompt, therefore, encapsulates all the necessary instructions for the OpenAI API to generate personalized, predictive, and actionable feedback within the AFS framework.

Figure A8. Definition of response_requirements.

Appendix C. Resulting Student Notification: A Practical Instance

Having detailed the comprehensive structure and key components of the feedback generation prompt designed for the OpenAI API, it is crucial to understand how this prompt is applied within the AFS architecture, particularly in relation to the dual-API communication strategy. The OpenAI API, guided by the full prompt, is responsible for generating the detailed, comprehensive feedback reports accessed by students within the LCMS platform. Concurrently, a distinct, streamlined communication channel is utilized for weekly student notifications via WhatsApp. Figure A9 provides a visual representation of such a WhatsApp notification as received by a student, demonstrating a tangible, concise output of the AFS, designed as an engaging hook to direct students towards the LCMS for their full feedback report. The following paragraphs will analyze this example WhatsApp notification, highlighting how it, while concise, effectively embodies key personalized elements derived from the broader AFS framework and, ultimately, drives student engagement with the LCMS platform for more in-depth insights.

Figure A9. Feedback using Whatsapp. Note. Image translation: “If you want to know your performance in the program! We invite you to check the new report from our AI on the platform.” Message translation: “Miguel, according to the AI Oracle your performance in tests is 63.91%. Wow! You have the power. If you want more details, visit the platform. P.S: May the Force be with you #AIpredictions. Artists who make Artists Academy by PolygonUs”.

After receiving this message, the student can click to see a more complete version of the feedback on the platform. See Figure A10.

Figure A10. Feedback platform.

To further illustrate the practical application of the AFS feedback mechanism, consider the student experience within the LCMS platform upon accessing their weekly feedback report. Based on the comprehensive prompt and the student’s data, the OpenAI API generates a detailed JSON output, which is then rendered in a user-friendly format within the LCMS. For instance, Miguel, upon logging into the platform, would find a Weekly Progress report clearly dated for the relevant week (19 April 2024/26 April 2024). The report opens with a motivational quote from The Lord of the Rings, setting an encouraging tone for self-reflection on their learning journey. Following this, Miguel would see a section dedicated to My Projection, where the system presents the AI-driven success prediction, stating “According to current data, the projection indicates that you will pass the course successfully”, reinforced by a quote from Yoda from Star Wars to inspire perseverance. Beyond this overarching prediction, the detailed report would also provide feedback across specific performance areas. For overall_performance, Miguel would read an enthusiastic message highlighting his promising start to the week and solid percentages across categories, accompanied by rocket ship, chart, and graph emojis, and a quote from The Hunger Games. Similarly, for forum_engagement, class_participation, and quiz_performance, the report would present tailored feedback, each with relevant emojis and movie quotes from figures like Bruce Lee and movies like Toy Story and Star Wars, celebrating Miguel’s strengths and encouraging continued engagement. Finally, even with a positive prediction, the report would offer improvement_areas, suggesting Miguel could further enhance his forum participation to diversify perspectives, ending with a quote from The Lord of the Rings emphasizing the impact of even small actions. This detailed feedback, presented within the LCMS, offers Miguel a holistic, personalized, and actionable overview of his performance, directly driven by the AFS methodology and designed to foster deeper engagement and learning.

References

Bozkurt, A.; Jung, I.; Xiao, F.; Vladimirschi, V.; Schuwer, R.; Eгoрoв, Г.; Lambert, S.; Al-Freih, M.; Pete, J.; Olcott, D.; et al. A global outlook to the interruption of education due to COVID-19 pandemic: Navigating in a time of uncertainty and crisis. Asian J. Distance Educ. 2020, 15, 1–126. [Google Scholar]
Fabriz, S.; Mendzheritskaya, J.; Stehle, S. Impact of synchronous and asynchronous settings of online teaching and learning in higher education on students’ learning experience during COVID-19. Front. Psychol. 2021, 12, 733554. [Google Scholar] [CrossRef]
Hussain, M.; Zhu, W.; Zhang, W.; Abidi, S.M.R. Student engagement predictions in an e-learning system and their impact on student course assessment scores. Comput. Intell. Neurosci. 2018, 2018, 6347186. [Google Scholar] [CrossRef] [PubMed]
Abuzinadah, N.; Umer, M.; Ishaq, A.; Al Hejaili, A.; Alsubai, S.; Eshmawi, A.A.; Mohamed, A.; Ashraf, I. Role of convolutional features and machine learning for predicting student academic performance from Moodle data. PLoS ONE 2023, 18, e0293061. [Google Scholar] [CrossRef]
Sajja, R.; Sermet, Y.; Cikmaz, M.; Cwiertny, D.; Demir, I. Artificial Intelligence-enabled intelligent assistant for personalized and adaptive learning in higher education. arXiv 2023, arXiv:2309.10892. [Google Scholar] [CrossRef]
Hattie, J.; Timperley, H. The power of feedback. Rev. Educ. Res. 2007, 77, 81–112. [Google Scholar] [CrossRef]
Aleven, V.; McLaren, B.M.; Sewall, J.; Koedinger, K.R. Example-tracing tutoring: A cognitive tutor for geometry proof. Int. J. Artif. Intell. Educ. 2017, 27, 270–304. [Google Scholar]
Baker, R.S.; Corbett, A.T.; Koedinger, K.R. Detecting student learning disengagement during mathematics tutoring with educational data mining. In International Conference on User Modeling, Adaptation, and Personalization; Springer: Berlin/Heidelberg, Germany, 2007; pp. 125–134. [Google Scholar]
VanLehn, K. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educ. Psychol. 2011, 46, 197–221. [Google Scholar] [CrossRef]
Anderson, J.R.; Boyle, C.F.; Reiser, B.J. Intelligent tutoring systems. Science 1985, 228, 456–462. [Google Scholar] [CrossRef]
Wang, X.; Gao, Y.; Passonneau, R. SciBench: Evaluating college-level scientific problem-solving abilities of large language models. arXiv 2023, arXiv:2307.10635. [Google Scholar]
Messer, M.; Brown, N.C.C.; Kölling, M.; Shi, M. Automated grading and feedback tools for programming education: A systematic review. ACM Trans. Comput. Educ. 2024, 24, 1–43. [Google Scholar] [CrossRef]
Meyer, J.; Jansen, T.; Schiller, R.; Liebenow, L.W.; Steinbach, M.; Horbach, A.; Fleckenstein, J. Using LLMs to bring evidence-based feedback into the classroom: AI-generated feedback increases secondary students’ text revision, motivation, and positive emotions. Comput. Educ. Artif. Intell. 2024, 6, 100199. [Google Scholar] [CrossRef]
Li Wang, X.; Chen, X.; Wang, C.; Xu, L.; Shadiev, R.; Li, Y. ChatGPT’s capabilities in providing feedback on undergraduate students’ argumentation: A case study. Think. Ski. Creat. 2024, 51, 101440. [Google Scholar] [CrossRef]
Alnasyan, B.; Basheri, M.; Alassafi, M. The power of deep learning techniques for predicting student performance in virtual learning environments: A systematic literature review. Comput. Educ. Artif. Intell. 2024, 6, 100231. [Google Scholar] [CrossRef]
Hellas, A.; Leinonen, J.; Sarsa, S.; Koutcheme, C.; Kujanpää, L.; Sorva, J. Exploring the responses of large language models to beginner programmers’ help requests. In Proceedings of the ACM Conference on International Computing Education Research, Chicago, IL, USA, 8–10 August 2023; ACM: Chicago, IL, USA, 2023; pp. 310–327. [Google Scholar] [CrossRef]
Shaikh, O.; Zhang, H.; Held, W.; Bernstein, M.; Yang, D. Sentiment analysis in the era of large language models: A reality check. arXiv 2023, arXiv:2305.15005. [Google Scholar] [CrossRef]
Golchin, S.; Garuda, N.; Impey, C.; Wenger, M. Large language models as MOOCs graders. arXiv 2024, arXiv:2402.03776. [Google Scholar] [CrossRef]
Extance, A. ChatGPT is shaping the future of education, but will it improve equity? Nature 2023, 623, 474–477. [Google Scholar] [CrossRef]
UNESCO. Guidance for Generative AI in Education and Research; United Nations Educational, Scientific and Cultural Organization: Paris, France, 2023; Available online: https://unesdoc.unesco.org/ark:/48223/pf0000389227 (accessed on 26 March 2025).
Lund, B.D.; Wang, T.; Mannuru, N.R.; Nie, B.; Shimray, S.; Wang, Z. ChatGPT and a new academic reality: AI-written research papers and the ethics of the large language models in scholarly publishing. J. Assoc. Inf. Sci. Technol. 2023, 74, 570–581. [Google Scholar] [CrossRef]
Farrokhnia, M.; Banihashem, S.K.; Noroozi, O.; Wals, A. A SWOT analysis of ChatGPT: Implications for educational practice and research. Innov. Educ. Teach. Int. 2023, 61, 460–474. [Google Scholar] [CrossRef]
Jeon, J.; Lee, S. Large language models in education: A focus on the complementary relationship between human teachers and ChatGPT. Educ. Inf. Technol. 2023, 28, 15873–15892. [Google Scholar] [CrossRef]
Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the FAccT’ 21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Toronto, Canada, 3–10 March 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
Vygotsky, L.S. Mind in Society: The Development of Higher Psychological Processes; Harvard University Press: Cambridge, MA, USA, 1978. [Google Scholar]
Lipton, Z.C. The mythos of model interpretability. Commun. ACM 2018, 61, 36–43. [Google Scholar] [CrossRef]

Figure 1. (A) Gender distribution in the control group and experimental group. (B) Distribution of diagnostic test scores.

Figure 2. Evolution of academic performance (cumulative impact of the Adaptive Feedback System). Key finding: The graph shows a notable cumulative effect of the Adaptive Feedback System. Although both groups experienced fluctuations, the experimental group showed a significant recovery at Level 4 (79.3%), substantially outperforming the control group (66.7%) with a difference of +12.6 percentage points.

Figure 3. Architecture of the Adaptive Feedback System (AFS). Key components: The diagram illustrates the five-layer architecture of the AFS. Each layer is interconnected to ensure weekly automated cycles of prediction and feedback delivery. Data flows securely from the LCMS through AWS infrastructure, is processed by a neural network model, and then transformed into personalized messages via GPT-4. The system integrates real-time interfaces (e.g., Google Sheets, WhatsApp API) and maintains a continuous improvement loop.

Figure 4. ROC curves of the four evaluated models.

Figure 5. Feature importance comparison across machine learning models. Each chart displays the relative importance assigned to educational engagement metrics.

Figure 6. Components with highest impact at Level 4.

Figure 7. Conceptual Framework of the Overcoming Effect (Mechanism by which adaptive feedback transforms predicted academic trajectories).

Table 1. Challenges and opportunities in the use of deep learning models and LLMs in education.

Challenges	Opportunities
Difficulty in building accurate predictive models with data collected at the beginning of the course	Development of more robust and accurate predictive models using deep learning techniques [12]
Impact of misclassifying students in early predictions [20]	Significant improvement in model accuracy as the course progresses [20]
Cold start problems in early implementation of analysis	Ability to predict outstanding and at-risk students with high accuracy [11]
Managing an increasing number of students and data [12]	Creation of robust predictive models
Obtaining sufficient and relevant data while facing privacy issues [20]	Using built-in platform features to track participation [12]
Difficulties in interpreting complex models [20]	Enhanced collection of student interaction data [20]
Less room for improvement for groups with imbalanced data [11]	Identifying relevant attributes to improve prediction accuracy [12]
Ethical concerns about the fairness and transparency of AI systems in education [20,21]	Obtaining interpretable models of student performance [11,24], addressing the “black box” nature of DL models
Biases in AI-generated feedback [20,22], requiring human mitigation	Generation of timely and personalized feedback [11,13]
Risk of over-reliance on AI technology [13]	Improving student motivation and performance [13]
Lack of time and resources for educators to deliver individualized feedback manually [19]	Facilitating teacher workload, allowing more creative approaches [19]

Table 2. Neural network hyperparameter configuration.

Hyperparameter	Hyperparameter
Architecture	Three dense hidden layers
Units per layer	64, 32, 16
Activation	ReLU
Regularization	L2 (λ = 0.01)
Dropout	0.3 (after each layer)
Batch normalization	Yes
Optimization	Adam
Learning rate	0.001
Loss function	Binary cross-entropy
Metrics	Accuracy, AUC
Internal validation	20% of training set
Early stopping	Patience = 15; monitor = AUC
Maximum epochs	150
Batch size	64

Table 3. Initial characteristics of the study groups.

Characteristic	Experimental Group	Control Group
Number of participants	42	39
Gender distribution
Female	28 (66.7%)	17 (43.6%)
Male	14 (33.3%)	22 (56.4%)
Prior knowledge level
No prior knowledge	0 (0.0%)	2 (5.1%)
Low prior knowledge	37 (88.1%)	17 (43.6%)
Basic prior knowledge	5 (11.9%)	19 (48.7%)
Advanced prior knowledge	0 (0.0%)	1 (2.6%)

Table 4. Results of analysis of covariance (ANCOVA) for overall academic performance * p < 0.05. SD = standard deviation.

Level	Experimental Group (Adjusted Mean ± SD)	Control Group (Adjusted Mean ± SD)	F	p-Value	Effect Size (η²)
Level 1	78.55 ± 17.41	74.62 ± 10.32	3.28	0.074	0.041
Level 2	69.67 ± 26.14	58.05 ± 15.43	5.95	0.017 *	0.073
Level 3	61.83 ± 27.83	69.82 ± 16.30	2.76	0.102	0.036
Level 4	79.30 ± 8.42	66.67 ± 17.82	12.87	<0.001 *	0.183

Table 5. Progression of academic performance between consecutive levels.

Progression	Experimental Group	Control Group	p-Value
From Level 1 to Level 2
Mean difference	−8.88%	−16.56%	0.032 *
Students who improved	13 (31.0%)	1 (2.6%)	<0.001 *
From Level 2 to Level 3
Mean difference	−7.84%	0.1177	<0.001 *
Students who improved	12 (28.6%)	31 (93.9%)	<0.001 *
From Level 3 to Level 4
Mean difference	0.1747	−3.15%	<0.001 *
Students who improved	28 (84.8%)	10 (30.3%)	<0.001 *

* p < 0.05.

Table 6. Performance metrics comparison of machine learning models for predicting student outcomes.

Model	Accuracy	Precision	Recall	F1 Score	ROC-AUC	CV F1 (±SD)	Training Time
Neural Network	0.89	0.93	0.91	0.92	0.94	N/A	15.32 s
Logistic Regression	0.88	0.90	0.93	0.91	0.94	0.9126 ± 0.0223	0.003 s
Random Forest	0.87	0.88	0.93	0.90	0.93	0.8969 ± 0.0202	0.22 s
Decision Tree	0.80	0.86	0.84	0.85	0.86	0.8650 ± 0.0245	0.005 s

Table 7. Evolution of key learning components across all four levels (experimental vs. control).

Component	Level 1		Level 2		Level 3		Level 4
	Exp. *	Ctrl. **	Exp.	Ctrl.	Exp.	Ctrl.	Exp.	Ctrl.
Test performance	81.48	76.79	84.14	49.87	72.29	68.5	82.45	74.06
Pre-recorded classes	95.5	75.82	87.5	53.49	68.33	60.75	77.91	69.12
Live classes	59.12	26.13	47.4	60.1	55.67	60.45	73.7	43.48
Forum participation	61.9	5.13	60.71	28.38	54.29	57.56	70.45	60.67
Portfolio submissions	81.26	84.64	68.85	67.1	57.85	71.43	74.33	70.39
Portfolio assessment	75.74	85.69	61.19	52.91	60.95	74.83	79.3	67.27
Overall performance	78.55	74.62	69.67	58.05	61.83	69.82	79.3	66.67

* Exp. = experimental group. ** Ctrl. = control group.

Table 8. Evolution of predictive accuracy and the “Overcoming Effect” across all four levels.

Level	Initial Predictive Accuracy	Final Predictive Accuracy	Students Who Surpassed Negative Prediction
Level 1	35.70%	42.90%	7 (16.7%)
Level 2	47.60%	54.80%	12 (28.6%)
Level 3	52.40%	59.50%	15 (35.7%)
Level 4	64.30%	78.60%	18 (42.9%)

Table 9. Evolution of educational gaps between students with basic and low prior knowledge across all four levels.

Level	Experimental Group Gap (pp *)	Control Group Gap (pp)	Difference
Level 1	14.36	6.98	7.38
Level 2	24.9	11.16	13.74
Level 3	24.71	15.68	9.03
Level 4	6.24	14.19	−7.95
Change from Level 1 to 4	−56.50%	103%	Divergent trend

* pp = percentage points.

Table 10. Predictive accuracy under different passing thresholds across all four levels (average).

Threshold	Experimental Group	Control Group
50%	51.20%	39.10%
55%	49.50%	37.80%
60%	47.60%	36.40%
65%	55.60%	45.70%
70%	68.90%	59.60%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cuéllar, Ó.; Contero, M.; Hincapié, M. Personalized and Timely Feedback in Online Education: Enhancing Learning with Deep Learning and Large Language Models. Multimodal Technol. Interact. 2025, 9, 45. https://doi.org/10.3390/mti9050045

AMA Style

Cuéllar Ó, Contero M, Hincapié M. Personalized and Timely Feedback in Online Education: Enhancing Learning with Deep Learning and Large Language Models. Multimodal Technologies and Interaction. 2025; 9(5):45. https://doi.org/10.3390/mti9050045

Chicago/Turabian Style

Cuéllar, Óscar, Manuel Contero, and Mauricio Hincapié. 2025. "Personalized and Timely Feedback in Online Education: Enhancing Learning with Deep Learning and Large Language Models" Multimodal Technologies and Interaction 9, no. 5: 45. https://doi.org/10.3390/mti9050045

APA Style

Cuéllar, Ó., Contero, M., & Hincapié, M. (2025). Personalized and Timely Feedback in Online Education: Enhancing Learning with Deep Learning and Large Language Models. Multimodal Technologies and Interaction, 9(5), 45. https://doi.org/10.3390/mti9050045

Article Menu

Personalized and Timely Feedback in Online Education: Enhancing Learning with Deep Learning and Large Language Models

Abstract

1. Introduction

1.1. Personalized Feedback with ChatGPT

1.2. Deep Learning Models for Performance Prediction

1.3. Evaluation of LLM-Based Educational Applications and Their Implicit Comparison with Traditional Feedback Methods

1.4. Challenges and Opportunities

2. Methodology

2.1. Participants

2.2. Design and Procedure

2.3. Measurements

2.4. Data Collection

2.5. Adaptive Feedback System (AFS): Architecture and Implementation

2.5.1. Data Collection Layer

2.5.2. Prediction Layer

2.5.3. Feedback Generation Layer

2.5.4. Delivery and Notification Mechanism

2.6. Model Selection Process

2.7. Ethical Considerations

3. Results

3.1. Initial Characteristics of the Groups

3.2. Evolution of Academic Performance

3.3. Performance Transitions Between Consecutive Levels Across the Program

3.4. Model Results and Selection Rationale

3.5. Analysis by Learning Components

3.6. Understanding the Overcoming Effect: Predictive Model Efficacy

3.7. Educational Gap Reduction Effect Across the Four-Level Program

3.8. Retention and Completion Across All Four Levels

3.9. Robustness Analysis Across Different Thresholds

4. Discussion

4.1. Interpretations and Alignment with the Previous Literature

4.2. Delayed Recovery and the Overcoming Effect

4.3. Stronger Impact on Active Participation

4.4. Potential Biases and Model Transparency

4.5. Generic or Uncontextualized Feedback

4.6. Limitations

4.6.1. Quasi-Experimental Design and Cohort Differences

4.6.2. Sample Size and Power Analysis

4.6.3. Contextual Scope

4.6.4. Model Overfitting or Narrow Feature Selection

4.6.5. Ethical and Privacy Considerations

5. Conclusions

5.1. Main Contributions

5.2. Practical Implications

5.3. Future Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Model Application and Weekly Prediction Generation

Appendix B. Feedback Communication and Example

Appendix C. Resulting Student Notification: A Practical Instance

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI