Can ChatGPT Solve Undergraduate Exams from Warehousing Studies? An Investigation
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsCan ChatGPT Solve Undergraduate Exams from Logistics Studies? An Investigation.
In this paper, the authors investigate whether and how different ChatGPT models solve the exams of three different logistics undergraduate courses and contribute to the discussion of ChatGPT's existing logistics knowledge, particularly in the field of warehousing.
The topic addressed by the authors in this paper could be interesting and important, but I believe that the basic idea of the presented research is not the most fortunate. As I understand the presented starting points: "Yet, little is known about how much ChatGPT knows or can answer about warehousing" raises a series of opposing questions, such as: How much data about warehousing is available online for its learning (e.g., a causal link would be a good argument!?) How much of your information and data (for students) is available online from which an LLM could be taught? And many others.
Individual comments:
1. At the beginning, try to highlight the underlying problem of this research and explain why this research is essential for logistics and how it will contribute to the advancement of the profession.
a. The argument cited : (line 24) "However, despite their widespread use in various studies, there is little research in the field of logistics, particularly in warehousing", is not a valid argument?
b. "Test be cheated with the right training data"; this is what students do; additional explanations are needed!
c. "Yet, little is known about how much ChatGPT knows or can answer about warehousing". However, how much information about warehousing is available online for learning it (e.g. a causal link would be a good argument!?) How much of your information and data (for your students) is available online from which an LLM could be taught?
"This contribution tests ChatGPT's knowledge of warehousing" - to conduct the pre-survey mentioned above could be helpful.
(80)The first step is translating the respective exams into English - what about the relevant material? Is it accessible to your "new" student (ChatGPT) in English as well?
2. This raises several questions about using different languages, interpretations within them, the abbreviations used (translated), and several other fundamental problems.
3. "This contribution tests ChatGPT's knowledge of warehousing" - try to conduct the pre-survey/meta-analyses mentioned above.
4. LE Please put yourself in the role of a logistics expert and answer all the questions below. LS Please put yourself in the role of a logistics student and answer all the questions below. ???
Does Chat GPT know what the difference is between LE and LS?
Can the same test be valid for LE and LS?
5. It would be interesting to compare ChatGPT testing with student tests, with and without using ChatGPT, thus avoiding problems with misunderstanding. But it would also be appropriate for you to analyze the difference between human experts and students so that you can tell ChatGPT what to look out for.
6. Figure 2 needs refinement; I don't think it's clear enough, and it's hard to understand.
7. It is clear from Table 1 that there is no theoretical (significant) difference between LE and LS or that LS achieve even better results. This brings us back to the fundamental question of how ChatGPT can distinguish between LE and LS???; thus, we can conclude that ... what? Why do you have two LE and LS groups, and what is the difference between them??
8. Surely, we could use a slightly more elaborate methodology and statistics in Table 1, at least standard deviation, etc.
Conclusion
I had more questions after analyzing your post than at the beginning. One of these is how the tests are pedagogically appropriate for students and whether they follow all the recommendations for creating tests/questionnaires adapted to at least Bloom's taxonomy.
Some inconsistencies arise from the conclusion, such as
"from logistics is answered positively" / "formulas from the relevant literature, however, cannot be solved well" / "ChatGPT quickly reaches its limits"/.
It will probably all need to be thought through, thought out and arranged. As I mentioned, this paper has the potential, e.g., if you are looking for additional tests for ChatGPT (passing exams...) with the help of this research, but it would need to be thoroughly refined. The fact that we are "playing" with ChatGPT a bit is unfortunately no longer enough for a scientific contribution today; transparent research methodology is vital. I miss this in your contribution.
Author Response
Dear Reviewer 1,
thank you very much for the detailed feedback you have provided us with, we really appreciate it. Your comments have enabled us to make several improvements:
Comment 1: At the beginning, try to highlight the underlying problem of this research and explain why this research is essential for logistics and how it will contribute to the advancement of the profession.
a. The argument cited: (line 24) “However, despite their widespread use in various studies, there is little research in the field of logistics, particularly in warehousing”, is not a valid argument?
b.“Test be cheated with the right training data”; this is what students do; additional explanations are needed!
c. “Yet, little is known about how much ChatGPT knows or can answer about warehousing”. However, how much information about warehousing is available online for learning it (e.g. a causal link would be a good argument!?) How much of your information and data (for your students) is available online from which an LLM could be taught?
“This contribution tests ChatGPT's knowledge of warehousing” - to conduct the pre-survey mentioned above could be helpful.
(80)The first step is translating the respective exams into English - what about the relevant material? Is it accessible to your ‘new’ student (ChatGPT) in English as well?
Answer: a. First, we emphasized the relevance of warehousing and presented the economic factor. We also added examples from the application. The application is not accessible, so publications like ours are an important step towards visibility and transparency in teaching. Studies from the field of logistics have been added. Despite its relevance, these do not refer to warehousing per se.
b. We have added how cheating works on LLM benchmarks and that it is at least a known problem. However, we have also noted that students can cheat just as easily.
c. Much of the teaching material is not publicly available. Nevertheless, there is basic literature in English. We have added this. Our basic literature is available in German and can only be used to a limited extent. Nevertheless, much of the content is available in English in principle. We have addressed the problem of availability and transparency.
Comment 2: This raises several questions about using different languages, interpretations within them, the abbreviations used (translated), and several other fundamental problems.
Answer: Our translation was carried out professionally, the translations of the abbreviations are also known and not neologisms. We have now noted that different formulations can lead to different answers.
Comment 3: “This contribution tests ChatGPT's knowledge of warehousing” - try to conduct the pre-survey/meta-analyses mentioned above.
Answer: Further available and visible sources have been added and classified.
Comment 4: LE Please put yourself in the role of a logistics expert and answer all the questions below. LS Please put yourself in the role of a logistics student and answer all the questions below. ???
Does Chat GPT know what the difference is between LE and LS?
Can the same test be valid for LE and LS?
Answer: There seems to be a misunderstanding here, since we did not elaborate on this before. ChatGPT was not given the abbreviations LE and LS. We changed the list to bullet points. In the text, we introduced the abbreviation for easier readability. The aim of the different prompts was to point out the differences in the answers.
Comment 5: It would be interesting to compare ChatGPT testing with student tests, with and without using ChatGPT, thus avoiding problems with misunderstanding. But it would also be appropriate for you to analyze the difference between human experts and students so that you can tell ChatGPT what to look out for.
Answer: This is an interesting aspect, which we have now included in the future work section. However, an investigation of this is currently beyond the scope of our project.
Comment 6: Figure 2 needs refinement; I don't think it's clear enough, and it's hard to understand.
Answer: Figure 2 has been revised and is now sharp.
Comment 7: It is clear from Table 1 that there is no theoretical (significant) difference between LE and LS or that LS achieve even better results. This brings us back to the fundamental question of how ChatGPT can distinguish between LE and LS???; thus, we can conclude that... what? Why do you have two LE and LS groups, and what is the difference between them??
Answer: We are aware of the slight differences between LE and LS and have addressed this. It was important to show that a role assignment in principle produces different results. How this comes about cannot be determined with certitude, since there is no insight into the model’s inner workings. We have noted this fact.
Comment 8: Surely, we could use a slightly more elaborate methodology and statistics in Table 1, at least standard deviation, etc.
Answer: We added the averages, the standard deviation and the coefficient of variance, made the table more readable and visually displayed the new key figures.
We have made extensive changes, in according with the feedback you provided. Likewise, the formulations from the conclusion have been improved and arranged, based on the valuable criticism given in this review. All remaining points of criticism will be addressed in future work. In addition, the reviews have created more transparency, and our paper is the first step in transparently publishing data from teaching, hoping that others will follow suit.
Thank you very much and best regards,
Sven Franke
Reviewer 2 Report
Comments and Suggestions for Authors· The paper investigates ChatGPT's ability to solve undergraduate logistics exams at TU Dortmund University, comparing its performance to human students. The test was performed on three undergraduate exams: Warehouse Management Systems (WMS), Material Flow Systems I (MFS I), and Material Flow Systems II (MFS II), and the models compared are Three, namely ChatGPT models;GPT-4o mini, GPT-4o, and o1-preview. In addition, the test was conducted using three prompting techniques or roles: No prompt, role as a logistics expert and role as a logistics student.
The study addresses a hot topic which is assessing the capabilities of large language models in solving domain-specific academic tasks, particularly because LLMs have shown, in the literature, a limited knowledge in multiple domains. However, there are some limitations and remarks:
· The first remark from the introduction is that the study claims to evaluate the domain knowledge of ChatGPT's, namely logistics knowledge, but it narrowly focuses on warehousing and specific exams. Logistics includes other fields such as supply chain management, transportation, and global trade. So, some part of the papers should be reformulated to reflect this specific scope.
· The study is also limited in its experimental design as it is restricted only to three exams from a single university and semester, which limits its generalizability. So, expanding the study to include more logistics courses, institutions, and datasets could make the findings more generalizable. Related to this, the paper presents median grades and frequencies but lacks robust statistical analysis (e.g., hypothesis testing or confidence intervals) to substantiate claims about model performance differences. The reliance on a single reviewer for evaluation introduces potential bias, but this is not a big issue in my opinion. For this remarks, I am requiring only to add more exams.
· The focus on whether ChatGPT "passes" the exams simplifies the analysis. Grades alone do not capture qualitative nuances, such as the depth or accuracy of ChatGPT's reasoning. So, it is better to detect the general biases or analysis patterns where the LLM succeeds or fails. The authors analyzed individual instances, that is good, but I would like to see general patterns or biases. The paper also mentions ChatGPT's struggles with domain-specific calculations and image-based questions, but it does not deeply analyze why these failures occur or propose strategies for improvement. To get such insights, it is also required to have more exams and categorize the questions into distinct categories. Questions included multiple-choice, free-text, and mathematical problems. The authors did not give insight on them accurately. So, I suggest different categorization based on the conceptual or semantic level rather than the type. If mathematical category is kept, then a general comparison with papers focusing on Math subjects can be done.
· Since the questions are translated from Germany to English, I suggest to use different phrasing of the questions throughout the different runs. This will be useful to catch more accurate statistical results because LLMs are sensitive to the formulation of the questions and their embeddings in the early layers.
· The impact of prompts is tested, this is a good idea. However, I have two remarks about this test. First, the results from the logistics expert role are not meaningfully analyzed or contrasted with the logistics student role and sometimes student role for GPT-4o achieved better performance than expert. Why is this? This is not the expected behavior. Second, the study could delve deeper into how prompt engineering might improve domain-specific results, especially for complex questions.
· As noticed by the authors, the models may have contradiction in their analysis. In one of the examples, “The model answers option c) first. After explaining the individual possibilities, it comes to the conclusion that answer d) is correct… The correct answer is d).” In fact, this highlights the need for Chain of Thought (CoT or Cocunut) reasoning mechanism to ensure logical consistency and prevent premature conclusions. Comparing this behavior to students, who can backtrack and revise their answers during exams, is unfair in my opinion. Not only O1 (or O3) models who have reasoning power, but also other models can be fine-tuned to acquire some reasoning power and iterative reasoning capability. The authors have either to discuss this deeply or try to fine-tune some open-source models.
· The paper could compare ChatGPT’s logistics performance to its performance in other fields (e.g., law, medicine, Maths) to contextualize its strengths and limitations in logistics. Here, I suggest at least to refer to similar studies in other fields where exams were used for evaluation...and make general comparison in terms of shared metrics.
In general, the paper is well written. The main issue is that the dataset used for the test is too small to draw statistically significant conclusions. This limits the reliability of comparisons between GPT models and students.
Author Response
Dear Reviewer 2,
we appreciate your thorough feedback and acknowledgment. Your insights have enabled us to make various enhancements. I will now explain these:
Comment 1: The first comment from the introduction is that the study claims to evaluate the expertise of ChatGPTs, namely logistics knowledge, but it focuses narrowly on warehousing and specific exams. Logistics also includes other areas such as supply chain management, transportation, and global trade. Therefore, some of the papers should be reformulated to reflect this specific scope.
Answer: We have emphasized the focus on warehousing more. The title has been adjusted and the relevance of the field explained. We have also added efforts in other fields of logistics.
Comment 2: The study is also limited in its experimental design as it is restricted only to three exams from a single university and semester, which limits its generalizability. So, expanding the study to include more logistics courses, institutions, and datasets could make the findings more generalizable. Related to this, the paper presents median grades and frequencies but lacks robust statistical analysis (e.g., hypothesis testing or confidence intervals) to substantiate claims about model performance differences. The reliance on a single reviewer for evaluation introduces potential bias, but this is not a big issue in my opinion. For this remarks, I am requiring only to add more exams.
Answer: We have explained why we used the three exams and why there is only one reviewer for them. In general, exams are not published and we do not have access to further ones. With this, we hope to encourage others to make their teaching more transparent. Our project is the first public data set in teaching. It is common for a reviewer to correct the exams, and the sample solutions, which only allow for a range of specific solutions, ensure that the corrections are objective. We have also added all these points. In the results chapter, we have added further evaluations and their visualization. This has also allowed us to further address the differences between the models.
Comment 3: The focus on whether ChatGPT “passes” the exams simplifies the analysis. Grades alone do not capture qualitative nuances, such as the depth or accuracy of ChatGPT's reasoning. So, it is better to detect the general biases or analysis patterns where the LLM succeeds or fails. The authors analyzed individual instances, which is good, but I would like to see general patterns or biases. The paper also mentions ChatGPT's struggles with domain-specific calculations and image-based questions, but it does not deeply analyze why these failures occur or propose strategies for improvement. To get such insights, it is also required to have more exams and categorize the questions into distinct categories. Questions included multiple-choice, free-text, and mathematical problems. The authors did not give insight on them accurately. So, I suggest different categorization based on the conceptual or semantic level rather than the type. If mathematical category is kept, then a general comparison with papers focusing on Math subjects can be done.
Answer: To better understand the answers and limitations, we categorized the question types according to Bloom’s Taxonomy. This approach helps identify the cognitive categories in which ChatGPT performs particularly well or struggles. It also provides valuable insights that can guide further research and development. Additionally, we ensured that all the questions from the exams remain publicly available in the dataset, promoting transparency and accessibility. All of these points have been carefully incorporated into the text to enhance its clarity and depth.
Comment 4: Since the questions are translated from Germany to English, I suggest to use different phrasing of the questions throughout the different runs. This will be useful to catch more accurate statistical results because LLMs are sensitive to the formulation of the questions and their embeddings in the early layers.
Answer: The translation was carried out professionally. Nevertheless, using several variants is a good idea. We have included this point in the implications. However, it is important that an interpretation works with a pre-formulated formulation and that the text is not first adapted exploratively using LLMs. Especially since the models are not transparent in this regard.
Comment 5: The impact of prompts is tested, this is a good idea. However, I have two remarks about this test. First, the results from the logistics expert role are not meaningfully analyzed or contrasted with the logistics student role and sometimes student role for GPT-4o achieved better performance than expert. Why is this? This is not the expected behavior. Second, the study could delve deeper into how prompt engineering might improve domain-specific results, especially for complex questions.
Answer: We have revised the Results section and have gone into more detail about the differences. Nevertheless, the differences in role assignments can only be noted but not explained, since the models have no transparency. We have noted this.
Comment 6: As noticed by the authors, the models may have contradiction in their analysis. In one of the examples, “The model answers option c) first. After explaining the individual possibilities, it comes to the conclusion that answer d) is correct... The correct answer is d).” In fact, this highlights the need for Chain of Thought (CoT or Coconut) reasoning mechanism to ensure logical consistency and prevent premature conclusions. Comparing this behavior to students, who can backtrack and revise their answers during exams, is unfair in my opinion. Not only O1 (or O3) models who have reasoning power, but also other models can be fine-tuned to acquire some reasoning power and iterative reasoning capability. The authors have either to discuss this deeply or try to fine-tune some open-source models.
Answer: Thank you for noting the chain of thoughts. We have included and discussed the phenomenon of chain of thoughts in the text. Nevertheless, different answers in one output are confusing for an unknowing questioner. Readers cannot know which is correct. We have also noted this fact.
Comment 7: The paper could compare ChatGPT's logistics performance to its performance in other fields (e.g., law, medicine, Maths) to contextualize its strengths and limitations in logistics. Here, I suggest at least to refer to similar studies in other fields where exams were used for evaluation...and make general comparison in terms of shared metrics.
Answer: We included the role of performance in the logistics of ChatGPT and efforts in other disciplines as well. To create comparability, we referred to the Blooms Taxonomy as mentioned above.
Thank you very much and best regards,
Sven Franke
Reviewer 3 Report
Comments and Suggestions for AuthorsI thank you for the opportunity to review the paper "Can ChatGPT Solve Undergraduate Exams from Logistics Studies? An Investigation." By comparing ChatGPT's performance with that of students, the manuscript explores the ability of ChatGPT to solve undergraduate logistics exams. The paper makes a valuable contribution to understanding the potential and limitations of LLMs, such as ChatGPT, in the domain of logistics. However, the current version of this work could benefit from significant improvements.
The introduction highlights the significance of evaluating ChatGPT in logistics education and links advancements in AI with their potential educational impact. However, the context could be strengthened by elaborating on specific gaps in logistics-related AI applications and explaining why this domain remains underexplored. Moreover, the rationale for the study should clearly align the identified gaps with the research objectives, framing a precise research question.
The literature review covers LLM developments and their applications in education but leans heavily towards general education rather than logistics. This creates a gap in domain-specific synthesis. I recommend the authors include more recent and relevant references on AI applications in warehousing or logistics education to better contextualize the study. Summarizing contrasting findings across different domains could also enhance the theoretical framing.
The study design appears robust, utilizing three exams and three GPT-4 variants. However, the presentation and description of the methodology, including the study design, sample, data collection, procedure, and data analysis, need improvement. While the structured repetition of tests strengthens reliability, the evaluation process for translations and scoring answers appears subjective and requires clarification. Additionally, the authors should elaborate on the rationale for selecting these specific exams and their relevance to logistics education.
The results section includes useful tables and visualizations. However, the descriptions and interpretations of the data could be more detailed. Improved readability and organization would enhance the section. I suggest the authors present the main findings more explicitly in relation to the study's research questions and objectives.
Finally, while the discussion synthesizes findings effectively and offers valuable insights, it could delve deeper into the implications for logistics education. Specifically, the discussion should address how educators might adapt their strategies to accommodate ChatGPT’s limitations and leverage its strengths. The conclusion could also be improved by proposing broader implications of the present study.
Author Response
Dear Reviewer 3,
thank you for the detailed feedback and noting the relevance. Your comments have allowed us to make several improvements. I will now explain these:
Comment 1: The introduction highlights the significance of evaluating ChatGPT in logistics education and links advancements in AI with their potential educational impact. However, the context could be strengthened by elaborating on specific gaps in logistics-related AI applications and explaining why this domain remains underexplored. Moreover, the rationale for the study should clearly align the identified gaps with the research objectives, framing a precise research question.
Answer: We have considered your advice and strengthened the focus on warehousing, adapted the title and better emphasized its relevance. We have also highlighted previous solutions and applications and identified the gap in warehousing. The text also notes why this is the case: logistics is very application-oriented, and solutions often remain commercial. Therefore, many projects are individual solutions. Our approach is to share knowledge and insights in the hope that others will follow.
Comment 2: The literature review covers LLM developments and their applications in education but leans heavily towards general education rather than logistics. This creates a gap in domain-specific synthesis. I recommend the authors include more recent and relevant references on AI applications in warehousing or logistics education to better contextualize the study. Summarizing contrasting findings across different domains could also enhance the theoretical framing.
Answer: Previous studies in logistics were supplemented. Furthermore, we also pointed out the gap that otherwise no teaching content is shared publicly, although this is possible.
Comment 3: The study design appears robust, utilizing three exams and three GPT-4 variants. However, the presentation and description of the methodology, including the study design, sample, data collection, procedure, and data analysis, need improvement. While the structured repetition of tests strengthens reliability, the evaluation process for translations and scoring answers appears subjective and requires clarification. Additionally, the authors should elaborate on the rationale for selecting these specific exams and their relevance to logistics education.
Answer: We have revised the structure of the results section and added more evaluations. This makes new findings possible. We have also categorized the translation process and explained that it is a technically correct translation that can be interpreted by LLMs. The selection of exams has also been justified and categorized: The exams are the only ones about the basics of warehousing at the university. Others do not provide them.
Comment 4: The results section includes useful tables and visualizations. However, the descriptions and interpretations of the data could be more detailed. Improved readability and organization would enhance the section. I suggest the authors present the main findings more explicitly in relation to the study's research questions and objectives.
Answer: We have comprehensively expanded and categorized the data analysis. The figures and tables have also been revised for better readability.
Comment 5: Finally, while the discussion synthesizes findings effectively and offers valuable insights, it could delve deeper into the implications for logistics education. Specifically, the discussion should address how educators might adapt their strategies to accommodate ChatGPT’s limitations and leverage its strengths. The conclusion could also be improved by proposing broader implications of the present study.
Answer: The summary and implications for the future have been extensively revised and now cover recommendations for educators, students and users.
Thank you very much and best regards,
Sven Franke
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsMuch effort has been put into preparing the paper and responding to the reviewers' comments. The improvements are apparent, but in my opinion, they are still insufficient. To comment on the concluding answers in general:
"We have made extensive changes based on the feedback you provided. Likewise, the formulations from the conclusion have been improved and arranged based on the valuable criticism given in this review". I fully agree with this; the contribution has been significantly improved. However:
" All remaining points of criticism will be addressed in future work. In addition, the reviews have created more transparency, and our paper is the first step in transparently publishing data from teaching, hoping that others will follow suit".
Perhaps this article is not yet in the publication phase and would need additional, more in-depth data, such as what the answer to my Comment 5 points out: "This is an interesting aspect, which we have now included in the future work section. However, an investigation of this is currently beyond the scope of our project."
A few specific comments:
Comment 1: Answer
The argument cited (line 24), “However, despite their widespread use in various studies, there is little research in the field of logistics, particularly in warehousing”, is not valid.
You can also consider ChatGPT as a partner and analyze its opinion (with correct citation, of course)
ChatGPT answer on topics of logistics and warehouse:
Providing an exact count without access to specific data is difficult. Still, I have likely been trained on many words related to logistics and warehousing, as these are common topics in discussions about supply chain management, transportation, and business operations. These words would likely appear frequently in the text data used to train me, helping me to generate accurate and relevant responses on these subjects (ChatGPT).
Google search: 858 billion results for the word logistic and 1268 balloon results for the word warehouse. To train chats for reliable answers, we need approx. 40 billion words.
First, we emphasized the relevance of warehousing and presented the economic factor. We also added examples from the application. The application is inaccessible, so publications like ours are essential to visibility and transparency in teaching.
When we talk about arguments, these scientific, logical arguments must be supported by evidence and not just by assumptions that may be true.
C5, connected with C7:
Answer: This is an interesting aspect we have now included in the future work section. However, an investigation of this is currently beyond the scope of our project. As I mentioned, perhaps the paper is not yet ready for publication, and the above argument is not a response to the comment.
C7:
Answer: We know the slight differences between LE and LS and have addressed this. In principle, it was essential to show that a role assignment produces different results. How this comes about cannot be determined with certitude since there is no insight into the model’s inner workings. We have noted this fact.
As you mentioned in the comment, the abbreviations LE and LS are also used for abbreviation, but they mean that the meaning of these abbreviations must be considered. Does Chat distinguish this meaning? Did you check with him?
Author Response
Dear Reviewer,
Thank you for acknowledging the improvements. We really appreciate it. Below we answer the remaining points:
Comment 1: (a) You can also consider ChatGPT as a partner and analyze its opinion (with correct citation, of course)
(b) When we talk about arguments, these scientific, logical arguments must be supported by evidence and not just by assumptions that may be true.
Answer: (a) That's a great hint. We have included this point. For this, we asked ChatGPT and put its answer in the appendix (Figure A1). The classification of ChatGPT's statement has been added in chapter 5.2.
(b) To make it clear that many solutions are not transparently publicly available, exemplary sources have been added (chapter introduction).
Comment 2: As you mentioned in the comment, the abbreviations LE and LS are also used for abbreviation, but they mean that the meaning of these abbreviations must be considered. Does Chat distinguish this meaning? Did you check with him?
Answer: We would like to clarify that the abbreviations LS and LE were only introduced for the sake of readability. During the experiments, ChatGPT was not given these abbreviations. We have discussed whether ChatGPT can distinguish between a logistics student and a logistics expert.
General comments: Perhaps this article is not yet in the publication phase and would need additional, more in-depth data, such as what the answer to my Comment 5 points out: “This is an interesting aspect, which we have now included in the future work section. However, an investigation of this is currently beyond the scope of our project.”
Answer: Regarding the original comment 5 (It would be interesting to compare ChatGPT testing with student tests, with and without using ChatGPT, thus avoiding problems with misunderstanding. But it would also be appropriate for you to analyze the difference between human experts and students so that you can tell ChatGPT what to look out for.), we note that the idea is still a good one and will be investigated by us in the future. However, it is not easy to get this data quickly, as it would be necessary to check with the curricula or the relevant faculty. Ad-hoc access to students is not possible at this point. Unfortunately, we cannot do anything differently from an administrative point of view.
We have submitted a more in-depth analysis of the data with the last revision and that was a completely right criticism .
Thank you very much and best regards,
Sven Franke
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors addressed most of my remarks, either fully or partially, in one way or another. I have only minor ramarks:
- Figure 1. which shows the overview of the process and paper structure is not well placed. A paragraph outlining the structure of the paper should be placed at the end of the introduction and this figure should be adapted for methodology.
- It could be good to categorize effectively the exam questions according to Bloom’s Taxonomy (remember, understand and apply) and quantify according to these categories the performance of the LLMs and compare it with student performance.
Author Response
Dear Reviewer,
Thank you very much for your kind words of appreciation.
We have addressed your comments:
Figure 1 has been added to the introduction and the structure explained.
Furthermore, in chapter three, we have explained which exam questions are categorized in which part of Bloom's Taxonomy.
Thank you very much and best regards,
Sven Franke
Reviewer 3 Report
Comments and Suggestions for AuthorsThank you to the authors for their efforts in addressing the suggested improvements and enhancing the paper. These revisions not only enhance the paper’s readability and coherence but also ensure its contribution to the discipline.
Author Response
Dear Reviewer,
Thank you for the kind words and appreciation. Also, thank you for your support throughout the submission.
Thank you very much and best regards,
Sven Franke
Round 3
Reviewer 1 Report
Comments and Suggestions for AuthorsThe article has been significantly improved. Although I did not get precise answers to all the comments, I am aware that everything can be improved. I think that perhaps, with minor editorial corrections, the article can be published.
Author Response
Dear Reviewer,
We very much appreciate your assessment. We have looked at the paper again and made minor improvements through all sections from abstract to conclusion. We hope that everything is now to your satisfaction.
Best regards,
Sven Franke