Mass Generation of Programming Learning Problems from Public Code Repositories
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors
The paper presents a database of programming problems suitable for populating an automated learning system about introductory programming. The novelty here is the automated generation of the problems from real life code from GitHub; and the translation between languages (Python, C++ and Java). The translation is achieved by using a language-independent syntax tree representation of the problems in the database.
The work presented is limited to a single type of learning resource: one that has the student determine the order of evaluation of an expression in a programming language. It is not obvious that the tools described will generalise to other question types without significant extra work.
The paper presents what is undoubtedly a good deal of work.
- My first critic of the paper is that I am not fully convinced of the need for this resource. The project produces 1.4M example problems of this one type - is there really a use case for a system to have quite that many? The introduction does suggest some motivations, but I am not clear that this level of scale is being demanded.
- My second critic of the paper is that the evaluation done in Section 4.3 and 5 is an evaluation of the particular dataset produced, rather that the methodology to produce it. Further, it is a evaluation without comparison - it should be compared against something, such as the template based approached more commonly in use.
- My third critic is the dismissal of an LLM approach to this system. There is some good discussion of the pros and cons vs LLMs, but it still seems to me than an LLM based approach is likely to be the more feasible given my fears over generalisation to other problem types. Perhaps some comparative evaluation would be good.
The web system that asks me for this review allows me to download SQL files of the data in question. The availability of these files is not mentioned in the paper. If the idea is that the journal will host them alongside the paper, I still think that should be mentioned in the pdf. If not, then can they be placed online somewhere else and pointed to. Finally, I note that these files were just the database and not the associated code to create it and use it to populate the ITS. I encourage the authors to release that openly too.
Notwithstanding the critics above, the paper is interesting and represents a good deal of work. Also, the paper is generally well presented, and the English language quality is good. A few minor points:
- The acronym ITS is defined as "Intelligent Tutoring Systems" (plural) but is often used in a singular context. I suggest to define is in the singular usage and then add a small s to the acronym when using it as plural.
= Although the acronym is defined on page 1, I think the idea of what an Intelligent Tutoring System is should be explained in a few sentences here at the start.
- Table 2: is overrunning into the margins quite badly. It can easily be split into two.
- p13: I have never heard the word "diapason" before: I looked it up and I don't think it is the correct word to use here. I suggest just "range".
- Table 3: How is difficulty calculated? Are certain problems not more difficult in some languages than others?
- Strange for the discussion of participant opinions to be in Section 5 rather than together with the data in Section 4.3?
Author Response
I want to thank you for a thorough and kind review of our manuscript, and catching some important problems.
Comment 1: My first critic of the paper is that I am not fully convinced of the need for this resource. The project produces 1.4M example problems of this one type - is there really a use case for a system to have quite that many? The introduction does suggest some motivations, but I am not clear that this level of scale is being demanded.
Response 1: Thank you for your comment; we might have understimated the necessity to justify the size of the generated bank. We added three paragraphs highlighted in yellow to the end of "Discussion" Section 5 (see lines 615-658)
I also want to add that big banks of questions and learning problems are necessary to generate balanced exercises as described in the article https://doi.org/10.3390/computers13060144 , but I cannot add it to the manuscript because we already have a high self-citation count.
Comment 2: My second critic of the paper is that the evaluation done in Section 4.3 and 5 is an evaluation of the particular dataset produced, rather that the methodology to produce it. Further, it is a evaluation without comparison - it should be compared against something, such as the template based approached more commonly in use.
Response 2: Thank you for your comment. Unfortunately, many researchers only publish the data about generated problems without sharing the datasets and software used to produce them. Also, the field of learning-problem generation is broad, and every research group is concentrated on generating learning problems supported by their ITS, so the kinds of generated problems rarely overlap as it can be seen in the literature review. So our abilities to compare methods are limited, and the methods of automatic generation of data can be compared by their performance and the sets of generated problems. We provide a brief compasion of the results of our methodology to the results of methods used by other researchers in Section 5 (Dicussion), which was all we could accomplish using the available data.
In the future, we plan to expand the capabilities of our learning-problem generator to other kinds of learning problems, which might allow more direct comparisons.
Comment 3: My third critic is the dismissal of an LLM approach to this system. There is some good discussion of the pros and cons vs LLMs, but it still seems to me than an LLM based approach is likely to be the more feasible given my fears over generalisation to other problem types. Perhaps some comparative evaluation would be good.
Response 3: Thank you for your comment. We broadened the analysis of current works on translating program code from one language to the other, which is an important part of our generator, using LLMs. The new description added to Section 2 (Related works) and highlighted in yellow (see lines 177-207). A major study conducted by IBM researchers (see reference 26) concluded that the current LLMs make siginicant errors when translating program code, including expression translation.
Comment 4:The web system that asks me for this review allows me to download SQL files of the data in question. The availability of these files is not mentioned in the paper. If the idea is that the journal will host them alongside the paper, I still think that should be mentioned in the pdf. If not, then can they be placed online somewhere else and pointed to. Finally, I note that these files were just the database and not the associated code to create it and use it to populate the ITS. I encourage the authors to release that openly too.
Response 4: Thank you for your comment. Unfortunately, the dataset is a few megabytes too big to be hosted by the journal. We added the links to the dataset, the developed library and the generator to the article (see around lines 707-711). However, the generator is developed as the part of the CompPrehension ITS and cannot be run alone from it. We will be happy if our code will help other researchers in the field.
Comment 5: The acronym ITS is defined as "Intelligent Tutoring Systems" (plural) but is often used in a singular context. I suggest to define is in the singular usage and then add a small s to the acronym when using it as plural.
Response 5: The question of plural forms of English acronims ending with "s" is often discussed, and many sources concerning ITS use it both as a singular and plural form. But for the readers' convenieces, I change the plural form to "ITSs"
Comment 6: Although the acronym is defined on page 1, I think the idea of what an Intelligent Tutoring System is should be explained in a few sentences here at the start.
Reponse 6: We added the definition of ITS and a reference to a review to the second paragraph of the manuscript, highlighted in yellow.
Comment 7: Table 2: is overrunning into the margins quite badly. It can easily be split into two.
Response 7: We tried to make that table a long vertical list, but it looks even worse. MDPI journals are published electronically and, in my experience the margins matter only during the review. We will work with the MDPI technical editors on formatting that table if the manuscript is accepted if it would be necessary.
Comment 8: p13: I have never heard the word "diapason" before: I looked it up and I don't think it is the correct word to use here. I suggest just "range".
Response 8: Thank you for spotting this; indeed, we should have used "range". I changed wording.
Comment 9: Table 3: How is difficulty calculated? Are certain problems not more difficult in some languages than others?
Response 9: Thank you for this question. I added the reference link to the article describing the way difficulty function was composed by approximating teachers' difficulty scores. It is highlighted in yellow (see lines 478-479).
Comment 10: Strange for the discussion of participant opinions to be in Section 5 rather than together with the data in Section 4.3?
Response 10: Thank you for spotting this. I moved the discussion of participant opinions to Section 4.3
Reviewer 2 Report
Comments and Suggestions for AuthorsIn this study, The authors present an automatic approach for generating learning problems for teaching introductory programming in different programming languages.
The good points in this paper are:
1- They put rich information about the topic in the paper.
2- The title is interesting
3- They organized the paper professionally
4- I like the way that they put discussion part and validation part
The points I believe they must improve
1- The authors using the word we too much
we, we,we .....
2- They use in this study too many times in different ways why
3- They put Threats to Validity
where is the solutions
For example, you said "Still, it is possible that some errors might have remained unnoticed, which poses a threat to the study’s validity." what you did about that
Also you said "Still, by employing experts from different universities we partially mitigated that threat" do you believe this enough
Author Response
Thank you for your review of our manuscript.
Comment 1: The authors using the word we too much
we, we,we .....
Response 1: Thank you for you comment. We lowered the number of usages of the word "we" in the manuscript.
Comment 2: They use in this study too many times in different ways why
Response 2: Thank you for pointing this out. We deleted some of the occurences of "in this study" where they were unnecessary.
Comment 3:
They put Threats to Validity
where is the solutions
For example, you said "Still, it is possible that some errors might have remained unnoticed, which poses a threat to the study’s validity." what you did about that
Also you said "Still, by employing experts from different universities we partially mitigated that threat" do you believe this enough
Response 3: Thank you for your comment. The section "Threats to Validity" is traditionally included in scientific articles to show that the authors aware that no study is perfect and make the readers aware where the potential imperfections are in that particular study. They can be solved during further work, but they are usually out of scope of the current study. For example, please consider the published articles from the same journal https://www.mdpi.com/2504-2289/8/9/113 and https://www.mdpi.com/2076-3417/15/3/1026
As for your question about employing experts from different universities: we generated learning problems in 3 programming languages. We asked teachers who used each of those languages to gather balanced opinions, which is enough for this study in my opinion. We will continue to evaluate the generated learning problems during further work.
Reviewer 3 Report
Comments and Suggestions for AuthorsMy few suggestions about this manuscript are listed below.
- Contributions are well listed. However, bit more explanation is encouraged.
- Related works needs bit extension. In this regard few more references and their explanation should be included from the year 2025.
- In Section 3, many many details are just messed. In this case a pseudocode should be included in the manuscript.
- Text in Figure 2 should be rectified. It appears very poor in print version.
- Equations 1 and 2 and also few equations in Figure 2 should be typed in maths version.
- Section 4 is well written. However, after Discussion, limitations of the current study should be included.
- Section 6 is bit short and needs refinement.
- Number of References should be increased.
- Authors should mention that how they addressed the research gap during their study.
- What are the final outcomes in this study? How the research community will benefit from them?
- Technically, manuscript is well-written. However, suggested changes are encouraged to be adjusted.
Author Response
Thank you for your review of our manuscript.
Comment 1: Contributions are well listed. However, bit more explanation is encouraged.
Response 1: We enhanced the description of contributions; the new text is highlighted in yellow in the relevant list. (see lines 80-93)
Comment 2: Related works needs bit extension. In this regard few more references and their explanation should be included from the year 2025.
Response 2: We enhanced the Related work section with paragraphs about recent works from year 2024 on code conversion between programming languages using LLMs; they are highlighted in yellow. The year 2025 has just started and few articles are published by now, so we had to resort to the publications from year 2024. (see around lines 168-207)
Comment 3: In Section 3, many many details are just messed. In this case a pseudocode should be included in the manuscript.
Response 3: We added Algorithm 1 (Learning Problem Generation) in pseudocode to enhance Section 3) (see page 6)
Comment 4: Text in Figure 2 should be rectified. It appears very poor in print version
Response 4: Thank you for catching this. We increased the scale of Figure 2, which made the text clearer. We will provide a high-resolution version of that figure to the journal team.
Comment 5: Equations 1 and 2 and also few equations in Figure 2 should be typed in maths version.
Response 5: We use LaTeX, and equations are typed in the math mode. The journal team will receive full source and can adjust them as they want.
Comment 6: Section 4 is well written. However, after Discussion, limitations of the current study should be included.
Response 6: The limitations of the current study are included after Section 5. Discussion in the section 6. Threats to Validity
Comment 7: Section 6 is bit short and needs refinement.
Response 7: We improved Section 6; the changes are highlighted in yellow. Sections like that are usually not long. (see lines 649-657 and 664-669)
Comment 8: Number of References should be increased.
Response 8: We increased the number of references to 37.
Comment 9: Authors should mention that how they addressed the research gap during their study.
Response 9: Thank you for this comment. We added the paragraph about the research gap to the end of the literature review; it is highlited in yellow. (lines 212-128)
Comment 10: What are the final outcomes in this study? How the research community will benefit from them?
Response 10: We added the paragraph on the final outcomes and their benefits to the Conlusions section, highlighted in yellow. (about lines 707-713)
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsGood work