Incremental Repair Feedback on Automated Assessment of Programming Assignments

Paiva, José Carlos; Leal, José Paulo; Figueira, Álvaro

doi:10.3390/electronics14040819

Open AccessArticle

Incremental Repair Feedback on Automated Assessment of Programming Assignments

by

José Carlos Paiva

^1,2,*

,

José Paulo Leal

^1,2

and

Álvaro Figueira

^1,2

¹

Centre for Research in Advanced Computing Systems, Institute for Systems and Computer Engineering, Technology and Science, 4169-007 Porto, Portugal

²

Department of Computer Science, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(4), 819; https://doi.org/10.3390/electronics14040819

Submission received: 8 January 2025 / Revised: 11 February 2025 / Accepted: 18 February 2025 / Published: 19 February 2025

(This article belongs to the Special Issue Program Slicing and Source Code Analysis: Methods and Applications)

Download

Browse Figures

Versions Notes

Abstract

Automated assessment tools for programming assignments have become increasingly popular in computing education. These tools offer a cost-effective and highly available way to provide timely and consistent feedback to students. However, when evaluating a logically incorrect source code, there are some reasonable concerns about the formative gap in the feedback generated by such tools compared to that of human teaching assistants. A teaching assistant either pinpoints logical errors, describes how the program fails to perform the proposed task, or suggests possible ways to fix mistakes without revealing the correct code. On the other hand, automated assessment tools typically return a measure of the program’s correctness, possibly backed by failing test cases and, only in a few cases, fixes to the program. In this paper, we introduce a tool, AsanasAssist, to generate formative feedback messages to students to repair functionality mistakes in the submitted source code based on the most similar algorithmic strategy solution. These suggestions are delivered with incremental levels of detail according to the student’s needs, from identifying the block containing the error to displaying the correct source code. Furthermore, we evaluate how well the automatically generated messages provided by AsanasAssist match those provided by a human teaching assistant. The results demonstrate that the tool achieves feedback comparable to that of a human grader while being able to provide it just in time.

Keywords:

automated feedback; automated assessment; programming assignments; programming learning; semantic graph

1. Introduction

Automated assessment of programming assignments is an invaluable resource for computer science education. Automated assessment tools can provide instant, accurate, and highly available feedback to students while reducing the workload on teaching assistants. Consequently, literature research in the area is extensive, and interest continues to grow to date [1]. Works have primarily focused on exploring alternatives for guaranteeing the proper isolation of source code execution, increasing the breadth of evaluation to cover various aspects of programming beyond functional correctness (e.g., style, plagiarism, and vulnerabilities), investigating the effects on students’ learning and behavior, and improving evaluation outcomes from a boolean value (i.e., correct or not) to useful information about errors and tips [1,2,3,4]. The latter is a widely open issue and the main driver of research in the area [1,5,6].

Among the most promising approaches to improve feedback, a common step is to compare the submitted code with a model solution [7]. This step allows for extracting a set of differences that transform the incorrect program into the model solution. Feedback consists of compiling these differences into a more granular and fair classification, which is accompanied by repair patches in the most advanced studies. The effectiveness of these approaches often depends on the quality of the model solution (e.g., similarity with the submitted program), the representations of source code and method used for comparing programs, the way of presenting the feedback to the student, and the response time.

However, replacing a human teaching assistant with automated feedback from a tool is still pedagogically challenging [5]. On the one hand, a teaching assistant would start either by pinpointing the block of code with the error or the exact error location, summarizing the algorithmic strategy to solve the problem, or providing hints on how to repair the program semantically (i.e., explain what the program is doing and what it should be doing). On the other hand, state-of-the-art automated assessment tools deliver the source code changes that transform the program into the correct one, sidelining the pedagogical side [7].

This paper proposes a novel tool, AsanasAssist, to automatically generate feedback on how to progress from incorrect submissions on programming assignments. Following the research line exploring the approximation of the submitted program to a model solution, our approach innovates in three key aspects. First, we cluster accepted solutions as they enter the system. Second, we select the most similar previously submitted correct solution considering the adopted algorithmic strategy. Lastly, our process of producing feedback messages mimics a human teaching assistant. Specifically, feedback from AsanasAssist must achieve the following objectives:

O1: be nearly real-time, i.e., should be generated in less than one minute;
O2: be independent of the programming language in which students coded their solutions;
O3: be adjusted to the problem-solving (or algorithmic) strategy adopted by students;
O4: reveal details gradually as students repeat mistakes, similarly to human teaching assistants (i.e., starting by guiding the student to find the error in the code and ending by providing information on fixing the error);
O5: learn from submissions as they are processed.

Our ultimate goal is to automatically produce feedback indistinguishable from that of a human teaching assistant. This is required to be carried out in near-real-time to cope with automated assessment requirements. To validate that, we conduct an evaluation of the tool on a public dataset of programming assignments, as well as an experiment with human teaching assistants to assess how likely they would deliver the same feedback as that automatically generated. The results are described and discussed.

The remainder of this paper is organized as follows. Section 2 provides an overview of the related work. Section 3 introduces AsanasAssist, describing the process of selecting the model solution, the method used for computing the differences between the paired programs, and the way of presenting the feedback to the student. Section 4 presents the evaluation of AsanasAssist and its results. Finally, Section 5 summarizes the contributions of this work, discusses its limitations, and points out directions for future work.

2. Related Work

This work aims to build an automated feedback mechanism for the automated assessment of programming assignments that can deliver helpful messages to struggling students. Messages should be pedagogical, similar to those provided by a human teaching assistant, i.e., help students progress towards a correct solution while promoting their reflection, not just revealing the correct source code. To achieve our objective, AsanasAssist includes techniques from three different research branches in its feedback generation process, namely program clustering (and program similarity), automated program repair, and automated feedback generation for programming assignments. This section presents literature works on these three research topics. Moreover, we also cover recent works exploring generative AI (GenAI) to support programming education and explain why it is still unreliable for our specific task.

2.1. Program Clustering and Similarity

Grouping submissions by source code similarity has several applications in the industry, such as software maintenance, understanding complex software designs, and detecting vulnerabilities. In the automated assessment of programming assignments, clustering submissions with similar functionality [8], structure [9,10], or behavior [11] can also facilitate feedback generation. For instance, it enables targeted feedback on common errors and individualized feedback to improve a program based on a solution adopting a similar but correct strategy [7,12,13,14,15].

Earlier approaches to cluster source code in programming education compute the pairwise similarity between the programs’ abstract syntax trees (ASTs) using edit distances or reducing them to canonical forms. Codewebs [16] is a search engine for submissions to programming assignments which extracts semantically equivalent sub-trees of the ASTs to calculate a matching index of a pair of programs.

OverCode [17] is an application for visualizing students’ submissions that canonicalizes solutions to generate collections of identical cleaned solutions. Programs go through a transformation pipeline that, for instance, uses the program execution traces to identify common and unique variables to rename them to be consistent across multiple solutions. CLARA [7], a fully automated program repair tool for introductory programming assignments, also clusters submissions based on control and data-flow information, matching programs with the same looping structure and whose variables take the same values in the same order.

SemCluster [8] proposes a vector representation of programs based on semantic program features from control and data flow, which can be used with common clustering methods. This approach reveals better run-time performance than previous state-of-the-art tools by avoiding pairwise comparisons and improves precision in separating different algorithmic strategies by reducing emphasis on the syntactic details of programs.

Novel approaches to program clustering propose deep learning to learn program embeddings from different source code representations, such as abstract syntax trees, control flow structures, token sequences, and program execution traces [18]. Nevertheless, these approaches demand high training efforts, both in computing and selecting an appropriate dataset to train on.

2.2. Automated Program Repair

Automated program repair (APR) is a vast area of research with numerous works covering fault localization, patch generation, ranking, validation, and correctness phases [19,20]. These approaches mainly target repairing large programs without reference solutions, using techniques ranging from symbolic execution [21,22] to program mutation [23], genetic programming [24,25], and, recently, deep learning [26]. In the case of automatic generation of feedback for programming assignments, there is at least one correct solution from the exercise author, as well as multiple submissions from students that were previously accepted. Therefore, we describe only approaches applicable to programming education.

Yi et al. [27] explore the feasibility of four state-of-the-art APR tools for feedback generation in intelligent tutoring systems for introductory programming. GenProg [28] uses genetic programming to generate candidate fixes, which are applied and tested successively until one passes all tests or a time limit is exceeded. Similarly, AE [29] modifies the program repeatedly until a solution is achieved. However, AE does it deterministically rather than patching the program differently at each run, applying mutation operators to the program. Prophet [30] follows a two-step process. Firstly, it looks up a transformation schema that can be applied to repairing the program. Then, it instantiates this schema to generate a repair, using a model trained with successful manual patches to prioritize the instantiation accordingly. Angelix [31] searches a set of angelic values (i.e., a set of concrete values for each symbol that makes the tests pass) such that the program passes all tests when these values substitute a potentially incorrect expression. Once these values are found, patch expressions that return them are synthesized. These four tools reveal a low repair rate but can still generate partial repairs.

AutoGrader [32] applies program synthesis techniques to select a minimal set of fixes that make the student’s solution match the behavior of the model solution. This tool is not fully automatic, requiring the instructor to write a model solution and a set of potential corrections in the form of expression rewrite rules. REFAZER [33] learns program transformations from examples of code edits applied by students and uses them to fix incorrect submissions with similar faults. This tool does not guarantee that the program is completely repaired, but it is still capable of recommending fixes to common local faults. ITAP [34] canonicalizes programs through semantic-preserving syntax transformations. Then, the generated feedback consists of the syntax differences between the submitted program and the most similar model solution. However, ITAP is only evaluated in simple programs covering boolean logic, comparisons, and some conditional statements. CLARA [7] runs a trace-based repair procedure against the canonical representative of the closest cluster and selects the minimal local repair from the candidates. Sarfgen [35] searches for reference solutions similar to the submitted program, aligns each statement in the incorrect source code with a corresponding statement in the reference solutions and derives minimal fixes by patching the program.

2.3. Automated Feedback Generation vs. A Human Teaching Assistant

One of the core tasks of an automated assessment system is feedback generation. There have been multiple literature reviews of tools for the automatic assessment of programming exercises [1,2,36]. To the best of the authors’ knowledge, Ref. [1] is the most recent large literature survey, giving particular relevance to automatically generated feedback for overcoming mistakes. It highlights the importance and still evident lack of pedagogical effectiveness of automated feedback compared to human teaching assistants, which is also demonstrated in more recent studies [5,37].

There are only a few works that evaluate the proposed tools by comparing the automatically generated feedback to human-generated feedback. Leite et al. [38] compare students’ performance on programming assignments with automated and detailed feedback from a syntax-logic grading tool against manual feedback from a teaching assistant. Results revealed that students who received manual feedback performed better in the course. Feldman et al. [39] use model solutions provided by the instructor to compare against the student’s solution and provide feedback on whether the student is progressing in the right direction or not. The feedback generated by the tool was evaluated by expert teaching assistants, considering what they would do when reviewing incorrect submissions. Even though the potential of the tool to aid struggling students is recognized by teaching assistants, they had negative comments about the feedback and how it was presented. TEGCER [40] uses code fixes performed by other students facing similar errors to provide targeted examples of fixes for compilation errors to students. The evaluation of TEGCER consisted of measuring the time taken by students to repair compilation errors with and without the tool, both groups having access to human teaching assistants. Results show that students could resolve errors 25% faster on average with the tool.

2.4. GenAI Feedback for Programming Assignments

Generative AI (GenAI) is quickly becoming a part of various aspects of modern life, changing how we work and learn by leveraging advanced neural network models such as generative pre-trained transformers (GPTs), which are able to produce complex outputs. Recently, multiple studies have investigated the influence of GenAI on programming education in terms of learning and assessment. Finnie-Ansley et al. [41] concluded that OpenAI’s Codex can solve most Python programming exercises from an undergraduate CS class, while Denny et al. [42] assessed the performance of Copilot in solving Python programming tasks and discovered that it successfully solved half of them on the first try. Barke et al. [43] conducted a study involving 20 programmers to examine their interactions with Copilot and noted occasional dependence among first-time users. Similar results were found by Prather et al. [44] and Zastudil et al. [45], who conducted interviews and analyzed interactions, respectively, of novice programmers with Copilot. They found a positive stance on integrating GenAI into the educational process, along with concerns about excessive dependence, reliability, and academic integrity.

Even though the potential of GenAI for supporting code development is clear, it relies on statistical relationships learned from its training data, lacking a true understanding of code functionality and problem-solving logic. This often leads to inconsistent and unreliable feedback, as slight code structure or syntax variations may result in significantly different judgments. Furthermore, programming assignments often involve critical thinking, problem-solving skills, and understanding requirements. Students’ attempts are not always the obvious or most optimal solutions, but their strategies are valid, and the program works as expected except for some minor mistakes. Yet, GenAI would suggest changes to match code samples in its training dataset even if they do not solve the specific assignment, possibly confusing those who are learning programming [44,46,47]. To be able to better assist students in getting “unstuck” in programming assignments, GenAI needs to be reliable, i.e., ensuring the set of suggested hints would lead to a correct solution and is minimal, learn from correct solutions as they become available, and be capable of delivering feedback in small portions and incrementally in detail. This paper describes an approach to generate feedback without using GenAI in any phase of the process. However, our intention is to explore GenAI in the last phase of the feedback generation process (i.e., producing the messages after having selected a target solution) and compare it with our current approach. Having a target solution and specific “prompts” can significantly increase the reliability of GenAI.

2.5. Summary

Even though a few similar tools [7,8,35] have been proposed to address the issue of automatic feedback generation to help struggling students progress in programming assignments, none pf them fulfill all our feedback objectives mentioned in Section 1.

CLARA [7], the most similar tool to the best of authors’ knowledge, overly focuses on structural aspects of the source code in its clustering phase. This causes the generation of a large number of clusters even for small programming assignments, making the whole process too expensive in terms of time, i.e., it does not fulfill O1. For instance, in a programming assignment with an average of 38 lines of code per submission, it takes 104 min to generate repairs [8]. Similarly, Overcode [17] fails for the same reason (that is, it takes 112 min to perform the same task).

SemCluster [8] addresses this problem, achieving near-real-time performance in clustering programs by their algorithmic strategy. However, it is only a clustering tool and is not designed to generate program repairs or feedback. Sarfgen [35] generates personalized feedback from program repairs in nearly real-time and is language-agnostic (O1, O2, and O3), but requires significant effort in training and selecting training input. This is particularly relevant as the model needs to be re-trained if we want to include new correct submissions.

In addition to that, there is no tool for (1) delivering personalized feedback incrementally with several levels of detail (O4) and (2) learning from correct submissions as they enter the system (O5). Table 1 summarizes the fulfilment of the objectives by similar tools.

3. AsanasAssist

AsanasAssist is a tool designed to generate incremental feedback for programming assignments automatically. It acts after the submission correctness is determined by an automated assessment tool, typically through output comparison. Feedback aims to be personalized and pedagogical in that it aids students facing difficulties in solving an assignment to proceed towards a functionally correct solution, promoting thinking rather than revealing the code fixes directly. To achieve this, AsanasAssist splits developments over three phases: program clustering, program comparison, and feedback generation. Each of these phases is described in the subsections below.

3.1. Program Clustering

One key characteristic of AsanasAssist is its novel approach to real-time clustering source code solutions submitted to programming assignments. This approach has been implemented separately in a tool named AsanasCluster (further details can be found in [48]) and introduces a few improvements on the state-of-the-art of program clustering in education: (1) grouping the source code by algorithmic strategy (i.e., a high-level perspective), generating fewer clusters; (2) working with a vector of features from semantic graph representations of the program, reducing expenses in computation; (3) adopting an incremental clustering model, assigning solutions to clusters as they enter the system. The clustering process is depicted in Figure 1.

Given a new program as input, AsanasCluster starts by extracting its evaluation order graph (EOG) [49] and data flow graph (DFG). For that, we extended a Kotlin library [50], initially developed to build the code property graph (CPG) [51] from the source code. The CPG is a data structure combining the abstract syntax tree (AST), DFG, and EOG for searching programming patterns that represent security vulnerabilities in large repositories of code. Therefore, our extension consists of splitting such representation into the multiple sub-graphs composing it and exporting them in the comma-separated value (CSV) format. As the EOG is a non-reduced variant of the control flow graph (CFG), the EOG is further subject to an edge contraction transformation process to obtain a CFG.

The two final graphs obtained in the previous step, i.e., the DFG and CFG, are analyzed to compute the control flow and data flow features that compose the feature vector of a program. This vector includes the features presented in Table 2.

Each of these features has a different weight in the distance calculation. As our aim is to group programs by their algorithmic strategy, features related to the control flow have significantly more impact than those related to the data flow.

Finally, the resulting feature vector is the input to a k-means clustering algorithm [52]. This specific implementation of k-means randomly instantiates k centroids following a Gaussian distribution. We define

k = 16

to limit the maximum number of formed clusters, which is a reasonable value considering the expected count of different algorithmic strategies in an academic-level programming assignment (note that this value can be defined explicitly per assignment). For each new observation, the algorithm starts by identifying the closest centroid and measuring the Euclidean distance from the new feature vector to all centroids. If the submission is correct, after identifying the closest cluster, the centroid’s position is updated according to the new element using the product of their scalar distance and the algorithm’s learning rate (i.e., the inverse of the number of solutions assigned to a cluster during the process). Otherwise, the representative solution of the closest cluster is selected and returned for further processing.

The model (or representative) solution from the closest cluster is selected on the basis of the number of lines of code. That is, the closest solution to the incorrect submission considering the number of lines of code.

3.2. Program Comparison

AsanasAssist generates program repairs by approximating the incorrect source code to a model solution, i.e., the representative solution of the cluster with the most similar algorithmic strategies to the submitted program. For that, it reuses the code property graphs (CPGs) obtained from source code (for clustering) to compute the differences between the programs using an extended version of GumTree [53]. GumTree is a syntax-aware diff tool that operates at the abstract syntax tree (AST) level, enhancing traditional text-based diff tools by aligning edit actions with syntax and detecting moved or renamed elements in addition to deleted and inserted code. It is designed to work with various programming languages such as C, C++, Java, JavaScript, Python, R, and Ruby (see [54] for full language support information).

Our extension works on the CPG rather than the AST. An overview of the implemented process for comparing the program submitted by a student against the closest similar correct solution is presented in Figure 2. Firstly, the data flow information is used to identify and rename named elements (e.g., variables and method names) to match in compared solutions. Then, the core algorithm of GumTree runs on the AST sub-component of the CPG, computing the edit operations, such as insert, update, move, and delete, that transform the CPG of the incorrect program into the CPG of the model solution.

For feedback purposes, differences are sorted by the type of operation and depth in the AST. In particular, edits closer to the root are the first to perform and delete operations with priority, followed by insertions, moves, and updates. Such an order of edits aims to follow common patterns of code development, i.e., code is typically developed and executed sequentially from the top to the bottom. Furthermore, useless or wrong code is removed as the first step to simplify further edits. Finally, the nodes included in the differences selected to be delivered as feedback are translated back to code.

3.3. Feedback Generation

Our feedback mechanism aims to be incremental and individualized. We define two feedback levels: error localization (first level) and repair suggestion (second level). Each level has inner levels corresponding to the amount of detail given in the message, which increases as the student repeats the same error. For instance, considering the incorrect source code in Listing 1 to print the first N even numbers, there are three inner levels at each level. Table 3 presents the feedback messages of each level.

Listing 1. Incorrect solution written in C++ to print the first N even numbers.
#include <iostream> using namespace std; int main() { int N; cin >> N; for (int i = 1; i <= N; i++) { if (i == 0) cout << i << " "; } cout << endl; return 0; }	1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

To achieve this, we designed and implemented the process illustrated in Figure 3. Once a submission enters the system, its evaluation takes place. In case the program is functionally correct, it is added to the clustering model, i.e., the submission is assigned to the closest cluster, and the centroid’s cluster is updated. Otherwise, the database is consulted for previously generated feedback. The database keeps track of feedback generated for previous submissions, as well as an automatically generated identifier to link the students to their submissions. If no feedback has been given previously to the submitting student on the same programming assignment, the main flow of feedback generation executes the following: (1) identifying the cluster with algorithmic strategies more similar to the incorrect attempt (refer to Section 3.1); (2) computing syntactic and semantic differences between the representative solution of the closest cluster and the submitted source code (refer to Section 3.2); (3) sorting the obtained differences according to their relevance; (4) generating human-friendly messages from the top differences. Note that the maximum number of differences to include in the fourth step is a parameter set by the instructor.

In the next incorrect submission, the distance between the current and last attempt is computed. If the distance is above a threshold value, the feedback is generated as in the first submission. Otherwise, the tool checks whether changes were made addressing the previously highlighted errors, increasing the level of detail if that was not the case or proceeding to the next difference. The threshold value is dynamically calculated from the average centroids’ distance.

4. Evaluation

The evaluation of the proposed tool is twofold. Firstly, we demonstrate the suitability of its run-time performance for the automated assessment of programming assignments in education. In the second phase, we compare the feedback automatically generated by the tool against what a human teaching assistant would provide. The next subsections describe both evaluation experiments and their outcomes, discuss the results of the evaluation, and present the threats to its validity.

4.1. Run Time

Automated assessment tools for programming assignments need to guarantee timely feedback to students. While no time limit is established for assessment using these tools, it is claimed that these systems deliver near real-time feedback [2], i.e., should not exceed a couple of minutes. To evaluate this dimension, we run AsanasAssist on a public dataset of real undergraduate students’ solutions to programming assignments, PROGpedia [55]. PROGpedia is a heterogeneous dataset composed of 16 programming assignments of different complexities and covering several topics written in C, C++, Java, and Python.

Experiments consist of measuring the time taken to process the following: (1) a correct solution, i.e., identify the closest cluster to this new solution, add it into the model, and re-compute clusters; (2) an incorrect solution, i.e., identify the cluster of solutions more similar to this newly submitted attempt, select the representative solution of the cluster as the model solution, and generate feedback as described in Section 3. Both experiments use clustering models built with all dataset solutions (note the following: only solutions compatible with either C17, C++17, Java 8, or Python 3 have been taken into account) per assignment and programming language and consider a new submission. Table 4 presents the details of the clustering models, including, for each assignment and programming language, the number of submissions in the clustering model, the average number of lines of code in those solutions, and the number of formed clusters.

We measured run times on an Intel Core i7-8750H processor with 16 GB of RAM. The heatmaps of Figure 4 present the results of these experiments, in particular, the average time in seconds to process a functionally correct solution (on the left) and an incorrect one (on the right).

The maximum average time taken to process submissions is 3.7 and 27.9 s, while the minimum is 2.3 and 2.5 s for correct and incorrect solutions, respectively. The processing time is not only correlated with the programs’ size (i.e., larger programs frequently imply more processing time) but also the distance between the wrong source code and the closest accepted. There is no noticeable direct correlation between the programming language and the run-time performance other than with the size of the solution.

4.2. AsanasAssist vs. Teaching Assistants

The main goal of AsanasAssist is to support students in overcoming their mistakes, as a teaching assistant does when asked to review an incorrect code. Therefore, we aim to evaluate how well automatically generated feedback matches the feedback that a human teaching assistant would provide, similar to previous works on automated assessment [10,56]. To this end, we designed a questionnaire with numerous scenarios that teaching assistants may encounter when reviewing students’ code.

Scenarios cover different programming languages (C++, Java, and Python) and error types. Each scenario starts with a description of the programming assignment, followed by a series of questions presenting code submitted by past students to a programming assignment and asking respondents which message they would deliver when reviewing the code. Questions follow the submission history of one or more students towards the solution, i.e., they evolve according to the consecutive errors from students, simulating a practice class. For instance, the first scenario presents the task of printing even numbers greater than 0 up to the number given as input (including it). Then, it provides the code submitted by a student containing two errors. The code is a former version of the program in Listing 1, with a wrong loop condition (

i < N

). Teaching assistants are asked to select the feedback message they would deliver from a set of valid messages covering both errors, individually and simultaneously, with multiple levels of detail. In the next question, the respondent is instructed to consider that the same student returns with a code where the loop condition is fixed (i.e., the code in Listing 1) and decide which message would be provided in such a situation.

The questionnaire was delivered to 32 experienced teaching assistants (7 female), satisfying the minimum acceptable number of experts recommended for content validation [57]. Of these, 19 were Portuguese, 5 were Turkish, 2 were Lithuanian, 2 were Polish, 1 was British, 1 was Danish, 1 was Spanish, and 1 was Malaysian. The experience of participants as teaching assistants ranges from 2 up to 39 years. Responses were collected from 1 March 2024 to 11 June 2024. Figure 5 presents the average distance, in terms of content/detail, between teaching assistants’ feedback and automatically generated feedback by AsanasAssist (blue bars), as well as the standard deviation among teaching assistants’ responses (red dashed lines) for the use cases of the questionnaire. Positive distances mean that AsanasAssist gave messages with less detail than the average of teaching assistants’ responses, whereas negative values indicate that AsanasAssist delivered more detailed feedback. Possible distance values range between −2 (lowest detail) and 2 (highest detail). A value of 0 indicates a match on the feedback from all teaching assistants and AsanasAssist.

The Cronbach’s alpha coefficient for the ten questionnaire items is 0.865, suggesting a relatively high internal consistency among the questions. The maximum absolute distance is 0.438 out of 2, reached in use case Q1-2. One of the use cases (Q2-2) had a nearly matching decision on the feedback message by the respondents and the tool, with an average distance of

- 0.031

. In general, teaching assistants give more detail than the tool, only providing less detail in 2 out of 10 use cases. For instance, in Q1-1, some respondents provided less detail than AsanasAssist. However, for Q1-2, in which the student repeated the exact same error as in the question immediately before (Q1-1), a few teaching assistants decided to reveal the correct code, while AsanasAssist and a few other respondents increased the detail slightly, not revealing the complete fix. For all use cases, the standard deviation among teaching assistants’ responses is greater or equal to the average distance between teaching assistants and AsanasAssist.

We also explored whether the gender and experience of the teaching assistants influence the results. For that, we assume the heterogeneity of the groups (hypothesis) and calculate the Pearson’s chi-square (

χ^{2}

) value for each questionnaire item. Table 5 shows the critical values with a significance level

α = 0.05

and the obtained chi-square values, highlighting values higher than the critical value (

ρ

). Considering gender, the critical value is larger than all

χ^{2}

values, which validates the hypothesis, i.e., we can conclude that the gender of the teaching assistants does not have statistical significance on the feedback delivered. For the experience level of participants, we considered intervals of 10 years (i.e., 0–10, 10–20, 20–30, and 30–40), finding a statistically significant difference only in Q2-3. From a closer look at the data, we conclude that while some teachers with less than 20 years of experience give slightly more or less detail than AsanasAssist, more experienced teachers completely agree with the level of detail offered by the tool. This seems to be generally applicable to all questionnaire items, although without statistical evidence.

The last question, not included in the figure, aims to understand which detail levels the teaching assistants consider when reviewing students’ code. More than half of the respondents included messages of three distinct levels of detail, while the remaining considered the same five messages as AsanasAssist.

4.3. Discussion

Previous studies comparing feedback from automated assessment tools against teaching assistants’ feedback demonstrated that tools have better grading precision but lack feedback quality [5,56]. AsanasAssist delivers incremental and individualized feedback into the automated assessment of programming assignments, aiming to achieve the pedagogical effectiveness of human teaching assistants while guaranteeing timely feedback. The conducted evaluation covers two objectives: measuring time for generating feedback and distance to feedback delivered by human teaching assistants. Both achieved satisfactory results.

Generating feedback to incorrect solutions has a noticeable extra cost compared to adding a solution to the clustering model of known correct submissions (see Figure 4). Nevertheless, the total time spent to generate feedback does not exceed 30 s in any of the evaluated use cases, which is in line with the run-time performance of automated assessment tools [2]. In comparison, CLARA [7] (the most similar tool described in the literature, to the best of authors’ knowledge) can take up to 104.2 min on average for programs with less than 50 lines of code [8].

Feedback messages from AsanasAssist can be mixed with those of teaching assistants without standing out, as the average distance to teaching assistants’ responses is always lower or equal to the standard deviation among teaching assistants. However, teaching assistants were demonstrated to have slightly lower “levels” of feedback. In particular, a few teaching assistants indicated a complete fix on the second or third feedback iteration. This can be due to multiple reasons, such as the lack of time from teaching assistants or the prevention of students’ frustration. Nonetheless, AsanasAssist allows teachers to skip some feedback levels through execution parameters.

More experienced teachers tended to provide feedback closer to that automatically generated by the tool, even though this is only statistically significant for one of the items. Also, female teaching assistants who participated in the evaluation provided more detailed feedback than male respondents (and AsanasAssist) to students repeating the same errors, but the difference between genders is not statistically significant. Nevertheless, in such a small sample, personal traits may distort group tendencies.

4.4. Threats to Validity

The major threat to the validity of this evaluation is the difference among the human teaching assistants, who may have separate strategies for handling students struggling with programming assignments. This is particularly relevant in small samples. We, however, selected teaching assistants who lectured introductory programming for at least one year from both genders and multiple nationalities (three) to mitigate this issue. The group differences based on gender and nationality were also investigated, finding no statistically significant relationship between the gender of the teaching assistant and the selected feedback message.

Moreover, we assessed our approach to small to medium-sized programs that are usually encountered in introductory programming assignments. Although this aligns with previous research works, we intend to further explore the applicability of our approach to larger programs, with possibly multiple files. Although CPG extraction is built to manage large projects, obtaining the set of differences between much larger graphs may be problematic in terms of time.

5. Conclusions and Future Work

In this paper, we present a novel approach and implementation to automatically generate feedback for introductory programming assignments. The key idea behind our approach is to use the existing correct solutions, either from instructors or students who have already solved the assignment, to pinpoint errors and provide hints on how to repair incorrect student attempts. Our evaluation shows that AsanasAssist can generate feedback messages similar to those that a teaching assistant would provide to help struggling students progress. Furthermore, feedback is delivered in under a minute in all tested scenarios, which guarantees seamless integration into the automated assessment systems from the student’s perspective.

While matching the feedback contents of human teaching assistants is a good indicator of the tool’s usefulness, only comparing two heterogeneous groups, one using AsanasAssist and the other using a common automated assessment tool supported by human teaching assistants, in a real educational setting can assess its pedagogical impact. We aim to do such an experimental evaluation in a one-semester introductory programming course with at least 60 undergraduate students randomly split into two groups. The evaluation should cover task effectiveness (objective) and student perspective (subjective).

As mentioned in Section 2.4, we intend to explore GenAI in the last phase of the feedback generation process to generate the actual feedback messages after having selected a target solution. We will use specific “prompts” to produce messages with the same levels of detail defined in Section 3. Another planned extension is the integration of program canonicalization techniques to reduce the representative program of each cluster. This can improve the computed set of differences, limiting it to the ones required for the correct functionality.

Finally, we aim to validate AsanasAssist on large programs spread over multiple files, such as those found in advanced programming courses.

Author Contributions

Conceptualization, J.C.P., J.P.L. and Á.F.; Data curation, J.C.P.; Formal analysis, J.C.P.; Funding acquisition, J.C.P.; Investigation, J.C.P.; Methodology, J.C.P., J.P.L. and Á.F.; Project administration, J.C.P., J.P.L. and Á.F.; Resources, J.C.P.; Software, J.C.P.; Supervision, J.P.L. and Á.F.; Validation, J.C.P., J.P.L. and Á.F.; Visualization, J.C.P., J.P.L. and Á.F.; Writing—original draft, J.C.P.; Writing—review and editing, J.C.P., J.P.L. and Á.F. All authors read and agreed to the published version of the manuscript.

Funding

J.C.P.’s work is funded by the FCT—Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology), Portugal, for the Ph.D. Grant 2020.04430.BD.

Data Availability Statement

Data are available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Paiva, J.C.; Leal, J.P.; Figueira, A. Automated Assessment in Computer Science Education: A State-of-the-Art Review. ACM Trans. Comput. Educ. 2022, 22, 1–40. [Google Scholar] [CrossRef]
Ala-Mutka, K.; Uimonen, T.; Jarvinen, H.M. Supporting Students in C++ Programming Courses with Automatic Program Style Assessment. J. Inf. Technol. Educ. Res. 2004, 3, 245–262. [Google Scholar] [CrossRef] [PubMed]
Souza, D.M.; Felizardo, K.R.; Barbosa, E.F. A Systematic Literature Review of Assessment Tools for Programming Assignments. In Proceedings of the 2016 IEEE 29th International Conference on Software Engineering Education and Training (CSEET), Dallas, TX, USA, 5–6 April 2016; pp. 147–156. [Google Scholar] [CrossRef]
Keuning, H.; Jeuring, J.; Heeren, B. A Systematic Literature Review of Automated Feedback Generation for Programming Exercises. ACM Trans. Comput. Educ. 2019, 19, 1–43. [Google Scholar] [CrossRef]
Kristiansen, N.G.; Nicolajsen, S.M.; Brabrand, C. Feedback on Student Programming Assignments: Teaching Assistants vs Automated Assessment Tool. In Proceedings of the 23rd Koli Calling International Conference on Computing Education Research, New York, NY, USA, 12–17 November 2024. Koli Calling ’23. [Google Scholar] [CrossRef]
Paiva, J.C.; Figueira, A.; Leal, J.P. Bibliometric Analysis of Automated Assessment in Programming Education: A Deeper Insight into Feedback. Electronics 2023, 12, 2254. [Google Scholar] [CrossRef]
Gulwani, S.; Radiček, I.; Zuleger, F. Automated clustering and program repair for introductory programming assignments. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, Philadelphia, PA, USA, 18–22 June 2018; PLDI 2018. pp. 465–480. [Google Scholar] [CrossRef]
Perry, D.M.; Kim, D.; Samanta, R.; Zhang, X. SemCluster: Clustering of Imperative Programming Assignments Based on Quantitative Semantic Features. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, Phoenix, AZ, USA, 22–26 June 2019; PLDI 2019. pp. 860–873. [Google Scholar] [CrossRef]
Gao, L.; Wan, B.; Fang, C.; Li, Y.; Chen, C. Automatic Clustering of Different Solutions to Programming Assignments in Computing Education. In Proceedings of the ACM Conference on Global Computing Education, Chengdu, China, 9–19 May 2019; CompEd ’19. pp. 164–170. [Google Scholar] [CrossRef]
Koivisto, T.; Hellas, A. Evaluating CodeClusters for Effectively Providing Feedback on Code Submissions. In Proceedings of the 2022 IEEE Frontiers in Education Conference (FIE), Uppsala, Sweden, 8–11 October 2022; pp. 1–9. [Google Scholar] [CrossRef]
Li, S.; Xiao, X.; Bassett, B.; Xie, T.; Tillmann, N. Measuring code behavioral similarity for programming and software engineering education. In Proceedings of the 38th International Conference on Software Engineering Companion, Austin, TX, USA, 14–22 May 2016; ICSE ’16. pp. 501–510. [Google Scholar] [CrossRef]
Kaleeswaran, S.; Santhiar, A.; Kanade, A.; Gulwani, S. Semi-Supervised Verified Feedback Generation. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Seattle, WA, USA, 13–18 November 2016; FSE 2016. pp. 739–750. [Google Scholar] [CrossRef]
Head, A.; Glassman, E.; Soares, G.; Suzuki, R.; Figueredo, L.; D’Antoni, L.; Hartmann, B. Writing Reusable Code Feedback at Scale with Mixed-Initiative Program Synthesis. In Proceedings of the Fourth (2017) ACM Conference on Learning @ Scale, Cambridge, MA, USA, 20–21 April 2017; L@S ’17. pp. 89–98. [Google Scholar] [CrossRef]
Chow, S.; Yacef, K.; Koprinska, I.; Curran, J. Automated Data-Driven Hints for Computer Programming Students. In Proceedings of the Adjunct Publication of the 25th Conference on User Modeling, Adaptation and Personalization, Bratislava, Slovakia, 9–12 July 2017; UMAP ’17. pp. 5–10. [Google Scholar] [CrossRef]
Emerson, A.; Smith, A.; Rodriguez, F.J.; Wiebe, E.N.; Mott, B.W.; Boyer, K.E.; Lester, J.C. Cluster-Based Analysis of Novice Coding Misconceptions in Block-Based Programming. In Proceedings of the 51st ACM Technical Symposium on Computer Science Education, Portland, OR, USA, 11–14 March 2020; SIGCSE ’20. pp. 825–831. [Google Scholar] [CrossRef]
Nguyen, A.; Piech, C.; Huang, J.; Guibas, L. Codewebs: Scalable homework search for massive open online programming courses. In Proceedings of the 23rd International Conference on World Wide Web, Seoul, Republic of Korea, 7–11 April 2014; WWW ’14. pp. 491–502. [Google Scholar] [CrossRef]
Glassman, E.L.; Scott, J.; Singh, R.; Guo, P.J.; Miller, R.C. OverCode: Visualizing Variation in Student Solutions to Programming Problems at Scale. ACM Trans. Comput.-Hum. Interact. 2015, 22, 1–35. [Google Scholar] [CrossRef]
Wang, K.; Singh, R.; Su, Z. Dynamic Neural Program Embedding for Program Repair. arXiv 2017, arXiv:1711.07163. [Google Scholar]
Goues, C.L.; Pradel, M.; Roychoudhury, A. Automated program repair. Commun. ACM 2019, 62, 56–65. [Google Scholar] [CrossRef]
Gazzola, L.; Micucci, D.; Mariani, L. Automatic Software Repair: A Survey. IEEE Trans. Softw. Eng. 2019, 45, 34–67. [Google Scholar] [CrossRef]
Könighofer, R.; Bloem, R. Automated error localization and correction for imperative programs. In Proceedings of the International Conference on Formal Methods in Computer-Aided Design, Austin, TX, USA, 30 October–2 November 2011; FMCAD ’11. pp. 91–100. [Google Scholar]
Nguyen, H.D.T.; Qi, D.; Roychoudhury, A.; Chandra, S. SemFix: Program repair via semantic analysis. In Proceedings of the 2013 International Conference on Software Engineering, San Francisco, CA, USA, 18–26 May 2013; ICSE ’13. pp. 772–781. [Google Scholar]
Debroy, V.; Wong, W.E. Using Mutation to Automatically Suggest Fixes for Faulty Programs. In Proceedings of the 2010 Third International Conference on Software Testing, Verification and Validation, Paris, France, 6–10 April 2010; pp. 65–74. [Google Scholar] [CrossRef]
Forrest, S.; Nguyen, T.; Weimer, W.; Le Goues, C. A genetic programming approach to automated software repair. In Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, Montreal, QC, Canada, 8–12 July 2009; GECCO ’09. pp. 947–954. [Google Scholar] [CrossRef]
Weimer, W.; Forrest, S.; Le Goues, C.; Nguyen, T. Automatic program repair with evolutionary computation. Commun. ACM 2010, 53, 109–116. [Google Scholar] [CrossRef]
Zhang, Q.; Fang, C.; Ma, Y.; Sun, W.; Chen, Z. A Survey of Learning-based Automated Program Repair. ACM Trans. Softw. Eng. Methodol. 2023, 33, 1–69. [Google Scholar] [CrossRef]
Yi, J.; Ahmed, U.Z.; Karkare, A.; Tan, S.H.; Roychoudhury, A. A feasibility study of using automated program repair for introductory programming assignments. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, Paderborn, Germany, 4–8 September 2017; ESEC/FSE 2017. pp. 740–751. [Google Scholar] [CrossRef]
Le Goues, C.; Nguyen, T.; Forrest, S.; Weimer, W. GenProg: A Generic Method for Automatic Software Repair. IEEE Trans. Softw. Eng. 2012, 38, 54–72. [Google Scholar] [CrossRef]
Weimer, W.; Fry, Z.P.; Forrest, S. Leveraging program equivalence for adaptive program repair: Models and first results. In Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering, Silicon Valley, CA, USA, 11–15 November 2013; ASE ’13. pp. 356–366. [Google Scholar] [CrossRef]
Long, F.; Rinard, M. Automatic patch generation by learning correct code. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, St. Petersburg, FL, USA, 20–22 January 2016; POPL ’16. pp. 298–312. [Google Scholar] [CrossRef]
Mechtaev, S.; Yi, J.; Roychoudhury, A. Angelix: Scalable multiline program patch synthesis via symbolic analysis. In Proceedings of the 38th International Conference on Software Engineering, Austin, TX, USA, 14–22 May 2016; ICSE ’16. pp. 691–701. [Google Scholar] [CrossRef]
Singh, R.; Gulwani, S.; Solar-Lezama, A. Automated feedback generation for introductory programming assignments. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, Seattle, WA, USA, 16–19 June 2013; PLDI ’13. pp. 15–26. [Google Scholar] [CrossRef]
Rolim, R.; Soares, G.; D’Antoni, L.; Polozov, O.; Gulwani, S.; Gheyi, R.; Suzuki, R.; Hartmann, B. Learning syntactic program transformations from examples. In Proceedings of the 39th International Conference on Software Engineering, Buenos Aires, Argentina, 20–28 May 2017; ICSE ’17. pp. 404–415. [Google Scholar] [CrossRef]
Rivers, K.; Koedinger, K.R. Data-Driven Hint Generation in Vast Solution Spaces: A Self-Improving Python Programming Tutor. Int. J. Artif. Intell. Educ. 2017, 27, 37–64. [Google Scholar] [CrossRef]
Wang, K.; Singh, R.; Su, Z. Search, align, and repair: Data-driven feedback generation for introductory programming exercises. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, Philadelphia, PA, USA, 18–22 June 2018; PLDI 2018. pp. 481–495. [Google Scholar] [CrossRef]
Ihantola, P.; Ahoniemi, T.; Karavirta, V.; Seppälä, O. Review of recent systems for automatic assessment of programming assignments. In Proceedings of the 10th Koli Calling International Conference on Computing Education Research, Koli, Finland, 28–31 October 2010; Koli Calling ’10. pp. 86–93. [Google Scholar] [CrossRef]
Messer, M.; Brown, N.C.C.; Kölling, M.; Shi, M. Automated Grading and Feedback Tools for Programming Education: A Systematic Review. ACM Trans. Comput. Educ. 2024, 24, 1–43. [Google Scholar] [CrossRef]
Leite, A.; Blanco, S.A. Effects of Human vs. Automatic Feedback on Students’ Understanding of AI Concepts and Programming Style. In Proceedings of the 51st ACM Technical Symposium on Computer Science Education, Portland, OR, USA, 11–14 March 2020; SIGCSE ’20. pp. 44–50. [Google Scholar] [CrossRef]
Feldman, M.Q.; Wang, Y.; Byrd, W.E.; Guimbretière, F.; Andersen, E. Towards answering “Am I on the right track?” automatically using program synthesis. In Proceedings of the 2019 ACM SIGPLAN Symposium on SPLASH-E, New York, NY, USA, 25 October 2019; SPLASH-E 2019. pp. 13–24. [Google Scholar] [CrossRef]
Ahmed, U.Z.; Sindhgatta, R.; Srivastava, N.; Karkare, A. Targeted example generation for compilation errors. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering, San Diego, CA, USA, 11–15 November 2020; ASE ’19. pp. 327–338. [Google Scholar] [CrossRef]
Finnie-Ansley, J.; Denny, P.; Luxton-Reilly, A.; Santos, E.A.; Prather, J.; Becker, B.A. My AI Wants to Know if This Will Be on the Exam: Testing OpenAI’s Codex on CS2 Programming Exercises. In Proceedings of the 25th Australasian Computing Education Conference, Melbourne, Australia, 30 January–3 February 2023; ACE ’23. pp. 97–104. [Google Scholar] [CrossRef]
Denny, P.; Kumar, V.; Giacaman, N. Conversing with Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, Toronto, ON, Canada, 15–18 March 2023; SIGCSE 2023. pp. 1136–1142. [Google Scholar] [CrossRef]
Barke, S.; James, M.B.; Polikarpova, N. Grounded Copilot: How Programmers Interact with Code-Generating Models. Proc. ACM Program. Lang. 2023, 7, 85–111. [Google Scholar] [CrossRef]
Prather, J.; Reeves, B.N.; Denny, P.; Becker, B.A.; Leinonen, J.; Luxton-Reilly, A.; Powell, G.; Finnie-Ansley, J.; Santos, E.A. “It’s Weird That it Knows What I Want”: Usability and Interactions with Copilot for Novice Programmers. ACM Trans. Comput.-Hum. Interact. 2023, 31, 1–31. [Google Scholar] [CrossRef]
Zastudil, C.; Rogalska, M.; Kapp, C.; Vaughn, J.; MacNeil, S. Generative AI in Computing Education: Perspectives of Students and Instructors. arXiv 2023. [Google Scholar] [CrossRef]
Dunder, N.; Lundborg, S.; Wong, J.; Viberg, O. Kattis vs ChatGPT: Assessment and Evaluation of Programming Tasks in the Age of Artificial Intelligence. In Proceedings of the 14th Learning Analytics and Knowledge Conference, Kyoto, Japan, 18–22 March 2024; LAK ’24. pp. 821–827. [Google Scholar] [CrossRef]
Prather, J.; Reeves, B.N.; Leinonen, J.; MacNeil, S.; Randrianasolo, A.S.; Becker, B.A.; Kimmel, B.; Wright, J.; Briggs, B. The Widening Gap: The Benefits and Harms of Generative AI for Novice Programmers. In Proceedings of the 2024 ACM Conference on International Computing Education Research—Volume 1, Melbourne, Australia, 13–15 August 2024; ICER ’24. pp. 469–486. [Google Scholar] [CrossRef]
Paiva, J.C.; Leal, J.P.; Figueira, Á. Clustering source code from automated assessment of programming assignments. Int. J. Data Sci. Anal. 2024. [Google Scholar] [CrossRef]
Weiss, K.; Banse, C. A Language-Independent Analysis Platform for Source Code. arXiv 2022. [Google Scholar] [CrossRef]
Fraunhofer AISEC. Code Property Graph. 2023. Available online: https://fraunhofer-aisec.github.io/cpg/ (accessed on 20 May 2023).
Yamaguchi, F.; Golde, N.; Arp, D.; Rieck, K. Modeling and Discovering Vulnerabilities with Code Property Graphs. In Proceedings of the 2014 IEEE Symposium on Security and Privacy, Berkeley, CA, USA, 18–21 May 2014; pp. 590–604. [Google Scholar] [CrossRef]
Sculley, D. Web-Scale k-Means Clustering. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 29–30 April 2010; WWW ’10. pp. 1177–1178. [Google Scholar] [CrossRef]
Falleri, J.R.; Morandat, F.; Blanc, X.; Martinez, M.; Monperrus, M. Fine-grained and accurate source code differencing. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, Vasteras, Sweden, 15–19 September 2014; ASE ’14. pp. 313–324. [Google Scholar] [CrossRef]
Falleri, J.-R. GumTree Languages. 2023. Available online: https://github.com/GumTreeDiff/gumtree/wiki/Languages (accessed on 17 February 2025).
Paiva, J.C.; Leal, J.P.; Figueira, Á. PROGpedia: Collection of source-code submitted to introductory programming assignments. Data Brief 2023, 46, 108887. [Google Scholar] [CrossRef]
Parihar, S.; Dadachanji, Z.; Singh, P.K.; Das, R.; Karkare, A.; Bhattacharya, A. Automatic Grading and Feedback using Program Repair for Introductory Programming Courses. In Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education, Bologna, Italy, 3–5 July 2017; ITiCSE ’17. pp. 92–97. [Google Scholar] [CrossRef]
Yusoff, M.S.B. ABC of Response Process Validation and Face Validity Index Calculation. Educ. Med. J. 2019, 11, 55–61. [Google Scholar] [CrossRef]

Figure 1. An overview of the implemented program clustering approach.

Figure 2. An overview of the program comparison process of AsanasAssist.

Figure 3. Flowchart of the automatic feedback generation process.

Figure 4. Heatmaps of the performance run-time for feedback generation (times in seconds).

Figure 5. Average distance between responses from teaching assistants and AsanasAssist (blue bars) and standard deviation among teaching assistants’ responses (red dashed lines) for the use cases of the questionnaire.

Table 1. Fulfilment of the objectives by similar tools.

Tool	O1	O2	O3
CLARA [7]		✓	✓
Overcode [17]			✓
SemCluster [8]	✓		✓
Sarfgen [35]	✓	✓	✓

Table 2. Features included in the vector of a program.

Feature	Description	Origin
`connected_components`	Number of connected components in the intra-procedural control flow graph.	CFG
`loop_statements`	Number of loop statements.	CFG
`conditional_statements`	Number of conditional statements.	CFG
`cycles`	Number of cycles in the control flow graph.	CFG
`paths`	Number of different paths in the control flow graph.	CFG
`cyclomatic_complexity`	Quantitative measure of the number of possible execution paths.	CFG
`variable_count`	Number of used variables in the program.	DFG
`total_reads`	Total number of read operations on variables.	DFG
`total_writes`	Total number of write operations on variables.	DFG
`max_reads`	Maximum number of read operations on a single variable.	DFG
`max_writes`	Maximum number of write operations on a single variable.	DFG

Table 3. Incremental feedback messages for the code in Listing 1.

Level	Message
Localization	Some code is missing at or near the if block.
	Some code is missing at or near line 11.
	Some code is missing at or near line 11 column 10.
Repair	A binary operation is missing in the if condition at or near line 11 column 10.
	A binary operation (%) is missing in the if condition at or near line 11 column 10.
	The if condition must be ’i % 2 == 0’ at or near line 11.

Table 4. Performance run-time for feedback generation.

ID	# of Submissions				Avg. LoC				Nr. of Clusters
ID	C	C++	J	PY	C	C++	J	PY	C	C++	J	PY
06	40	-	100	64	30	-	36	22	4	-	5	4
16	20	-	105	30	32	-	45	17	4	-	4	4
18	1	-	61	5	73	-	166	57	1	-	3	1
19	2	-	66	139	88	-	141	98	1	-	2	1
21	2	-	21	112	137	-	227	89	1	-	3	3
22	3	-	52	60	55	-	90	28	3	-	7	5
23	-	-	71	38	141	-	189	63	1	-	3	1
34	172	26	205	-	50	34	31	-	7	4	5	-
35	76	24	140	-	60	60	60	-	4	3	4	-
39	75	25	154	-	96	77	88	-	7	4	8	-
42	58	26	138	-	67	66	65	-	9	4	4	-
43	77	32	178	-	52	49	52	-	6	2	3	-
45	54	21	148	-	49	50	51	-	8	2	5	-
48	29	24	136	-	49	49	56	-	2	3	3	-
53	1	43	152	-	110	119	148	-	1	4	3	-
56	1	22	85	-	76	95	110	-	1	4	4	-

Table 5. Critical (with

α = 0.05

) and chi-square values for gender vs. feedback message and experience vs. feedback message of each question. Values in bold indicate statistical significance.

Table 5. Critical (with

α = 0.05

) and chi-square values for gender vs. feedback message and experience vs. feedback message of each question. Values in bold indicate statistical significance.

Item	Gender ( $χ^{2}$ )		Experience ( $χ^{2}$ )
	$ρ$ ( $α = 0.05$ )	$χ^{2}$	$ρ$ ( $α = 0.05$ )	$χ^{2}$
Q1-1	7.815	3.581	21.026	10.240
Q1-2	9.488	4.707	21.026	13.030
Q2-1	9.488	2.402	21.026	17.410
Q2-2	7.815	4.258	16.919	7.978
Q2-3	9.488	0.922	16.919	17.005
Q2-4	7.815	4.080	12.592	3.413
Q2-5	5.991	1.974	12.592	7.858
Q3-1	5.991	0.922	12.592	2.453
Q3-2	5.991	0.922	7.815	3.467
Q3-3	7.815	3.898	16.919	8.583

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Paiva, J.C.; Leal, J.P.; Figueira, Á. Incremental Repair Feedback on Automated Assessment of Programming Assignments. Electronics 2025, 14, 819. https://doi.org/10.3390/electronics14040819

AMA Style

Paiva JC, Leal JP, Figueira Á. Incremental Repair Feedback on Automated Assessment of Programming Assignments. Electronics. 2025; 14(4):819. https://doi.org/10.3390/electronics14040819

Chicago/Turabian Style

Paiva, José Carlos, José Paulo Leal, and Álvaro Figueira. 2025. "Incremental Repair Feedback on Automated Assessment of Programming Assignments" Electronics 14, no. 4: 819. https://doi.org/10.3390/electronics14040819

APA Style

Paiva, J. C., Leal, J. P., & Figueira, Á. (2025). Incremental Repair Feedback on Automated Assessment of Programming Assignments. Electronics, 14(4), 819. https://doi.org/10.3390/electronics14040819

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Incremental Repair Feedback on Automated Assessment of Programming Assignments

Abstract

1. Introduction

2. Related Work

2.1. Program Clustering and Similarity

2.2. Automated Program Repair

2.3. Automated Feedback Generation vs. A Human Teaching Assistant

2.4. GenAI Feedback for Programming Assignments

2.5. Summary

3. AsanasAssist

3.1. Program Clustering

3.2. Program Comparison

3.3. Feedback Generation

4. Evaluation

4.1. Run Time

4.2. AsanasAssist vs. Teaching Assistants

4.3. Discussion

4.4. Threats to Validity

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

ID	# of Submissions				Avg. LoC				Nr. of Clusters
ID	C	C++	J	PY	C	C++	J	PY	C	C++	J	PY
06	40	-	100	64	30	-	36	22	4	-	5	4
16	20	-	105	30	32	-	45	17	4	-	4	4
18	1	-	61	5	73	-	166	57	1	-	3	1
19	2	-	66	139	88	-	141	98	1	-	2	1
21	2	-	21	112	137	-	227	89	1	-	3	3
22	3	-	52	60	55	-	90	28	3	-	7	5
23	-	-	71	38	141	-	189	63	1	-	3	1
34	172	26	205	-	50	34	31	-	7	4	5	-
35	76	24	140	-	60	60	60	-	4	3	4	-
39	75	25	154	-	96	77	88	-	7	4	8	-
42	58	26	138	-	67	66	65	-	9	4	4	-
43	77	32	178	-	52	49	52	-	6	2	3	-
45	54	21	148	-	49	50	51	-	8	2	5	-
48	29	24	136	-	49	49	56	-	2	3	3	-
53	1	43	152	-	110	119	148	-	1	4	3	-
56	1	22	85	-	76	95	110	-	1	4	4	-

ID	# of Submissions				Avg. LoC				Nr. of Clusters
ID	C	C++	J	PY	C	C++	J	PY	C	C++	J	PY
06	40	-	100	64	30	-	36	22	4	-	5	4
16	20	-	105	30	32	-	45	17	4	-	4	4
18	1	-	61	5	73	-	166	57	1	-	3	1
19	2	-	66	139	88	-	141	98	1	-	2	1
21	2	-	21	112	137	-	227	89	1	-	3	3
22	3	-	52	60	55	-	90	28	3	-	7	5
23	-	-	71	38	141	-	189	63	1	-	3	1
34	172	26	205	-	50	34	31	-	7	4	5	-
35	76	24	140	-	60	60	60	-	4	3	4	-
39	75	25	154	-	96	77	88	-	7	4	8	-
42	58	26	138	-	67	66	65	-	9	4	4	-
43	77	32	178	-	52	49	52	-	6	2	3	-
45	54	21	148	-	49	50	51	-	8	2	5	-
48	29	24	136	-	49	49	56	-	2	3	3	-
53	1	43	152	-	110	119	148	-	1	4	3	-
56	1	22	85	-	76	95	110	-	1	4	4	-

ID	# of Submissions				Avg. LoC				Nr. of Clusters
ID	C	C++	J	PY	C	C++	J	PY	C	C++	J	PY
06	40	-	100	64	30	-	36	22	4	-	5	4
16	20	-	105	30	32	-	45	17	4	-	4	4
18	1	-	61	5	73	-	166	57	1	-	3	1
19	2	-	66	139	88	-	141	98	1	-	2	1
21	2	-	21	112	137	-	227	89	1	-	3	3
22	3	-	52	60	55	-	90	28	3	-	7	5
23	-	-	71	38	141	-	189	63	1	-	3	1
34	172	26	205	-	50	34	31	-	7	4	5	-
35	76	24	140	-	60	60	60	-	4	3	4	-
39	75	25	154	-	96	77	88	-	7	4	8	-
42	58	26	138	-	67	66	65	-	9	4	4	-
43	77	32	178	-	52	49	52	-	6	2	3	-
45	54	21	148	-	49	50	51	-	8	2	5	-
48	29	24	136	-	49	49	56	-	2	3	3	-
53	1	43	152	-	110	119	148	-	1	4	3	-
56	1	22	85	-	76	95	110	-	1	4	4	-