Next Article in Journal
Factors Influencing University Teachers’ Technological Integration
Previous Article in Journal
Evaluation of STEAM Project-Based Learning (STEAM PBL) Instructional Designs from the STEM Practices Perspective
Previous Article in Special Issue
Digital Assessment: A Survey of Romanian Higher Education Teachers’ Practices and Needs
 
 
Article
Peer-Review Record

Assessment Automation of Complex Student Programming Assignments

Educ. Sci. 2024, 14(1), 54; https://doi.org/10.3390/educsci14010054
by Matija Novak * and Dragutin Kermek
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Educ. Sci. 2024, 14(1), 54; https://doi.org/10.3390/educsci14010054
Submission received: 29 September 2023 / Revised: 10 December 2023 / Accepted: 11 December 2023 / Published: 1 January 2024
(This article belongs to the Special Issue Application of New Technologies for Assessment in Higher Education)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper explores the difficulties of automating the evaluation of complex programming assignments and suggests a method using custom scripts. According to the authors, pre-built solutions are frequently limited in their capacity to grade challenging assignments, whereas custom scripts provide greater flexibility and control. They also show how to automate various components of the evaluation process, such as compilation, testing, and grading. 

The work may be improved by providing clearer evidence of the advantages of employing custom scripts. The authors may, for example, present data on the time saved by using custom scripts or the improvement in assessment quality through better and more insightful feedback. The articulation of the limitations of pre-existing tools is lacking in clarity. Including a comparable table would improve the clarity and strength of the argument presented in this study. The paper briefly mentions some non-functional requirements that cannot be automated, but it does not delve into the limitations of automation in detail. It would be valuable to explore the challenges and potential pitfalls of automation, such as false positives/negatives in the testing by the different tools over time. The quality of the paper can further be improved with more recent references, for example [1]. 

However, the work makes an important contribution to the assessment automation literature. The author's insights into the difficulties of automating complex programming assignments are sound, and their recommended process is an excellent place to start for anyone wishing to automate their own programming assessment evaluation.

[1]. Messer, M., Brown, N. C., Kölling, M., & Shi, M. (2023). Automated Grading and Feedback Tools for Programming Education: A Systematic Review. arXiv preprint arXiv:2306.11722.

Comments on the Quality of English Language

Minor editing is required.

Some of the text within images is difficult to read. 

Author Response

Thank you for the review. We have revisited the manuscript according to the comments.. The new version of the manuscript with the changes is in attachment and here are the point by point answers to all reviewers comments.

Comment: The work may be improved by providing clearer evidence of the advantages of employing custom scripts. The authors may, for example, present data on the time saved by using custom scripts or the improvement in assessment quality through better and more insightful feedback.
Answer: We think this is an excellent suggestion. One of the authors is measuring his time over 10 years using the pomodoro technique. Because of that we know exactly how much time was spent on grading assignment in case 1 from 2018-2019 to 2022-2023. In 2018-2019 and 2019-2020 an fully manual grading was performed while from 2020-2021 we started building scripts. The table is presented in the Discussion section in Table 1. For case 2 which is a course only 2 years old we were fortunate that the review came in at this moment since we have finished grading the assignment. To make an estimation we grade 3 students manually afterwards and measured the time and then calculated the average time it took us to grade 3 students. This number was multiplied by the number of students to get an estimated total time needed to grade everything manually. This data is presented in Table 2 in the Discussion section. We have added a paragraph describing this in the Discussion section when answering the question "What is the time benefit of using a semi-automated process over a fully manually process?".
For the question "How much time is required for a custom implementation script?" we have expanded adding this: "In Table 1 and Table 2 column Script creation gives the exact number how much time was spent on building a custom script for an assignment in the particular academic year."

Comment: The articulation of the limitations of pre-existing tools is lacking in clarity. Including a comparable table would improve the clarity and strength of the argument presented in this study.
Answer: We did not do a table comparison of limitations since such work is already published. This was mentioned in related work but we agree it was not clearly stated so we added in Discussion the following sentence. "Pre-built tools might have the benefit of a clearly defined interface but have the restriction of a programming language and expectations of how and what to test. The biggest problem being that the existing pre-built tools do not have all the elements we needed so a semi-automated option is the best approach. As stated in [8] "They typically struggle, however, to grade design quality aspects and to identify student misunderstandings." For an in depth comparison of pre-built tools we recommend to look at Table 10 in [3] where 30 tools are compared. In addition, limitations of tools can be found in articles like [27]."

Comment: The paper briefly mentions some non-functional requirements that cannot be automated, but it does not delve into the limitations of automation in detail. It would be valuable to explore the challenges and potential pitfalls of automation, such as false positives/negatives in the testing by the different tools over time.
Answer part 1: we have expanded on this part in Discussion while answering the question "Which parts can be automated?". "In our case we first run the scripts to test correctness and later we manually go through the web application interface to test for usability to see how the design of the web page and overall user-experience. When talking about design patterns there are certain patterns that were done in lectures and it is expected to use those and not some other design patterns. When talking about source-code quality we look for: how well the variables, function and classes are named, how easy we can find out way in code and how good is the documentation (comments) written."
Answer part 2: Since we use custom scripts there are no false positives or false negatives. To explain that and some pitfalls and challenges we have given a description when answering the question "Are custom scripts difficult to implement?". "While implementing scripts is not difficult, there is one big question each year "Does the script work correctly?". To ensure that the script works at first it needs to be tested and we use for that the submissions of 5 students. To choose the 5 students we have an self evaluation survey where students have to estimate complenes of the assignment. The students that give themself 90\% or more are considered. In addition usually there are more than 5 so based on the teachers knowledge from exams or activity during class the most promising 5 are chosen. These students are tested first with the scripts and basically their solutions are test cases for the script. If some of the script tests fail they are checked manually and the source-code of the student submission. If there is an issue the script is corrected if not it is tested with the next student. Usually with the first three students all issues are resolved. Issues might be that there was an unintentional mistake in the assignment description, or some point was not clear enough, or students misunderstood something but based on the solution we think this should also be accepted as the correct solution. In rare cases it happens that a mistake is found while grading some other student, in this case the test case is corrected and rerun for all students. While all this takes up more time it ensures that there are no mistakes in grading. At the end after the grades are published the students get feedback and they can come to consultation if they think that some parts are not graded correctly. At this stage we never had a situation where the student did not get some points because of the failure in the test script. This confirms to us that the semi-automated approach was done well.”

Comment: The quality of the paper can further be improved with more recent references, for example [1]. [1]. Messer, M., Brown, N. C., Kölling, M., & Shi, M. (2023). Automated Grading and Feedback Tools for Programming Education: A Systematic Review. arXiv preprint arXiv:2306.11722.
Answer: Thank you for that paper we have included it in the related work section "The most recent systematic review on the topic "Automated Grading and Feedback Tools for Programming Education" can be found in [8]"

Comment: Some of the text within images is difficult to read.
Answer: Since we used high resolution pictures the can be zoomed in. In print version this might be an issue, but since we used latex it is possible to put the picture in landscape mode over a whole page. We did not change that at this point but can do it if the editor requires it.

Comment: The work meets the requirements to be accepted after a minor review, in which the authors should contrast their conclusions with bibliography that substantiates their findings.
Answer: We have expanded the paragraf in conclusion to contain references: A lot of papers [13-15] focus on introductory programming courses but there is an open space in the domain of more complex assignments. Also, researchers often focus on building different tools (as seen from [8]) rather than focusing on the process that goes into it. This paper fills this gap. Pre-built tools are useful but they have their limits especially in grading complex assignments due to a complex environment that has to be established.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

An interesting study is presented about the automation of student evaluation in complex programming tasks.

The work meets the requirements to be accepted after a minor review, in which the authors should contrast their conclusions with bibliography that substantiates their findings.

Author Response

Thank you for the review. We have revisited the manuscript according to the comments.. The new version of the manuscript with the changes is in attachment and here are the point by point answers to all reviewers comments.

Comment: The work may be improved by providing clearer evidence of the advantages of employing custom scripts. The authors may, for example, present data on the time saved by using custom scripts or the improvement in assessment quality through better and more insightful feedback.
Answer: We think this is an excellent suggestion. One of the authors is measuring his time over 10 years using the pomodoro technique. Because of that we know exactly how much time was spent on grading assignment in case 1 from 2018-2019 to 2022-2023. In 2018-2019 and 2019-2020 an fully manual grading was performed while from 2020-2021 we started building scripts. The table is presented in the Discussion section in Table 1. For case 2 which is a course only 2 years old we were fortunate that the review came in at this moment since we have finished grading the assignment. To make an estimation we grade 3 students manually afterwards and measured the time and then calculated the average time it took us to grade 3 students. This number was multiplied by the number of students to get an estimated total time needed to grade everything manually. This data is presented in Table 2 in the Discussion section. We have added a paragraph describing this in the Discussion section when answering the question "What is the time benefit of using a semi-automated process over a fully manually process?".
For the question "How much time is required for a custom implementation script?" we have expanded adding this: "In Table 1 and Table 2 column Script creation gives the exact number how much time was spent on building a custom script for an assignment in the particular academic year."

Comment: The articulation of the limitations of pre-existing tools is lacking in clarity. Including a comparable table would improve the clarity and strength of the argument presented in this study.
Answer: We did not do a table comparison of limitations since such work is already published. This was mentioned in related work but we agree it was not clearly stated so we added in Discussion the following sentence. "Pre-built tools might have the benefit of a clearly defined interface but have the restriction of a programming language and expectations of how and what to test. The biggest problem being that the existing pre-built tools do not have all the elements we needed so a semi-automated option is the best approach. As stated in [8] "They typically struggle, however, to grade design quality aspects and to identify student misunderstandings." For an in depth comparison of pre-built tools we recommend to look at Table 10 in [3] where 30 tools are compared. In addition, limitations of tools can be found in articles like [27]."

Comment: The paper briefly mentions some non-functional requirements that cannot be automated, but it does not delve into the limitations of automation in detail. It would be valuable to explore the challenges and potential pitfalls of automation, such as false positives/negatives in the testing by the different tools over time.
Answer part 1: we have expanded on this part in Discussion while answering the question "Which parts can be automated?". "In our case we first run the scripts to test correctness and later we manually go through the web application interface to test for usability to see how the design of the web page and overall user-experience. When talking about design patterns there are certain patterns that were done in lectures and it is expected to use those and not some other design patterns. When talking about source-code quality we look for: how well the variables, function and classes are named, how easy we can find out way in code and how good is the documentation (comments) written."
Answer part 2: Since we use custom scripts there are no false positives or false negatives. To explain that and some pitfalls and challenges we have given a description when answering the question "Are custom scripts difficult to implement?". "While implementing scripts is not difficult, there is one big question each year "Does the script work correctly?". To ensure that the script works at first it needs to be tested and we use for that the submissions of 5 students. To choose the 5 students we have an self evaluation survey where students have to estimate complenes of the assignment. The students that give themself 90\% or more are considered. In addition usually there are more than 5 so based on the teachers knowledge from exams or activity during class the most promising 5 are chosen. These students are tested first with the scripts and basically their solutions are test cases for the script. If some of the script tests fail they are checked manually and the source-code of the student submission. If there is an issue the script is corrected if not it is tested with the next student. Usually with the first three students all issues are resolved. Issues might be that there was an unintentional mistake in the assignment description, or some point was not clear enough, or students misunderstood something but based on the solution we think this should also be accepted as the correct solution. In rare cases it happens that a mistake is found while grading some other student, in this case the test case is corrected and rerun for all students. While all this takes up more time it ensures that there are no mistakes in grading. At the end after the grades are published the students get feedback and they can come to consultation if they think that some parts are not graded correctly. At this stage we never had a situation where the student did not get some points because of the failure in the test script. This confirms to us that the semi-automated approach was done well.”

Comment: The quality of the paper can further be improved with more recent references, for example [1]. [1]. Messer, M., Brown, N. C., Kölling, M., & Shi, M. (2023). Automated Grading and Feedback Tools for Programming Education: A Systematic Review. arXiv preprint arXiv:2306.11722.
Answer: Thank you for that paper we have included it in the related work section "The most recent systematic review on the topic "Automated Grading and Feedback Tools for Programming Education" can be found in [8]"

Comment: Some of the text within images is difficult to read.
Answer: Since we used high resolution pictures the can be zoomed in. In print version this might be an issue, but since we used latex it is possible to put the picture in landscape mode over a whole page. We did not change that at this point but can do it if the editor requires it.

Comment: The work meets the requirements to be accepted after a minor review, in which the authors should contrast their conclusions with bibliography that substantiates their findings.
Answer: We have expanded the paragraf in conclusion to contain references: A lot of papers [13-15] focus on introductory programming courses but there is an open space in the domain of more complex assignments. Also, researchers often focus on building different tools (as seen from [8]) rather than focusing on the process that goes into it. This paper fills this gap. Pre-built tools are useful but they have their limits especially in grading complex assignments due to a complex environment that has to be established.

Author Response File: Author Response.pdf

Back to TopTop