Harnessing the Power of Large Language Models for Automated Code Generation and Verification
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper proposed a new approach to reduce the cost of programming complex robot behaviors
by using the the Generative AI and Finite State Machines.
It is an interesting research issue.
Two adversarial LLMs are used to improve the quality of the generated codes.
One LLM is used as code generator, the other is used as discriminator.
This is an innovative approach.
However, there are some drawbacks which should be improved.
1. In the first sentence, "In the domain of advanced technology systems,
a significant transformation is occurring within the cost landscape."
what is the term "advanced technology systems"?
What is ther term "cost landscape"?
These terms are not defined clearly.
2. The authors state "The transition toward software as the most pivotal aspect of systems has been propelled
by several interconnected factors." After that, the authors only describe the "complexity" factor.
There is no "several interconnected factors."
3. "the software transforms into a dynamic, ever-evolving entity that demands continuous attention and refinement." is not clear.
4. In abstract, the research goal is to reduce the "cost".
However, the goal is to reduce the "complexity" in section 1.2.
These are not consistent.
5. The research objective of this paper, "cost" or "complexity" are not defined clearly.
6. In Figure 2, the "Generator LLM" and "Discriminator LLM" should be labelled explicitly.
7. In the experiment, the layout in Figure 9 is too simple to the Figure 7.
8. The experimental cast is too simple to show that two adversarial LLMs can effectively ensure the generated code.
9. In table 1 and 2, the time comparison among the human programmer and the proposed approach is less valuable.
There is no comparison about the code.
10. This experiment is suggested that only applying the "Generator LLM", no "Discriminator LLM", to the same experimental environment.
11. In the section 4.4, the authors state "Handling Multiple FSMs:...". However, there is no details about multiple FSMs in the paper.
This paper is easy to read and clear to understand.
Author Response
Please note: main changes in the document are marked in blue. As reviewer comments indicated that there were some relevant changes to be made on the document, several parts of the document have been updated.> Comments 1: ________________________________ Reviewer comment 1. In the first sentence, "In the domain of advanced technology systems, a significant transformation is occurring within the cost landscape." what is the term "advanced technology systems"? What is ther term "cost landscape"? These terms are not defined clearly. .................................
· Response 1: ________________________________ The first sequence has been changed to more clearly indicate our goal: contribute to move robots into dynamic environments where they have to be reprogrammed frequently, by making their programming (and reprogramming) easier (and therefore cheaper)
· Response 2: ________________________________ The paragraph has been rephrased to clarify that we consider that moving robots out of controlled environments with repetitive tasks is limited mainly by software. Software in these new environments requires a software that is inherently more complex.
> Comments 3:
________________________________ Reviewer comment 3. "the software transforms into a dynamic, ever-evolving entity that demands continuous attention and refinement." is not clear. .................................· Response 3: ________________________________ Paragraph has been rephrased to make it more clear, just saying that: "As advanced robots and related systems incorporate these advanced technologies, their software becomes more complex to design, develop, configure and maintain"
> Comments 4:
________________________________ Reviewer comment 4. In abstract, the research goal is to reduce the "cost". However, the goal is to reduce the "complexity" in section 1.2. These are not consistent. .................................· Response 4: ________________________________ The introduction has been modified to describe the link between software complexity and its associated cost : "The introduction of robots into these unstructured environments 14
requires improving intelligence and adaptability of the robots, which directly impacts in the 15
complexity of their software. Consequently, programming or configuring them demands 16
more expert programmers and additional time for programming, testing, and debugging, 17
which in turn raises costs".
> Comments 5:
________________________________ Reviewer comment 5. The research objective of this paper, "cost" or "complexity" are not defined clearly. ................................. · Response 5: ________________________________ As we consider that both cost and complexity are linked, we assume that both are achievable. The introduction has been modify to describe the link between software complexity and its associated cost: "The budget of a robotic project has to consider many costs (the robotic hardware itself, physical installation & integration, software & programming, maintenance & support, training, etc). But in situations where the robot has to be frequently programmed and re-tasked, its programming and integration can account for up to 50-70% of the cost of a robot application"
> Comments 6:
________________________________ Reviewer comment 6. In Figure 2, the "Generator LLM" and "Discriminator LLM" should be labelled explicitly. .................................· Response 6: ________________________________ We updated the figure to clearly identify all elements.
> Comments 7:
________________________________ Reviewer comment 7. In the experiment, the layout in Figure 9 is too simple to the Figure 7. .................................· Response 7: ________________________________ In the updated PDF, Figure 5 shows a more detailed layout.
> Comments 8:
________________________________
Reviewer comment 8. The experimental cast is too simple to show that two adversarial LLMs can effectively ensure the generated code. .................................· Response 8: ________________________________ Several tests have been performed to test the limits of the proposed approach. "4.1. Synthetic setup" details four of them, and 11 tests are summarized in Table 3.
> Comments 9:
________________________________ Reviewer comment 9. In table 1 and 2, the time comparison among the human programmer and the proposed approach is less valuable. There is no comparison about the code. .................................· Response 9: ________________________________ Agreed. This information is only relevant to the programming "speed", and so its relevance in the document is limited. Considering that the relevant aspect in the paper is the limit of the presented approach (how far can we go using LLMs?), we moved the moved those tables out of the conclusions and the emphasis is put on Table 3 and the conclusions that we get from it.
> Comments 10:
________________________________ Reviewer comment 10. This experiment is suggested that only applying the "Generator LLM", no "Discriminator LLM", to the same experimental environment. .................................· Response 10: ________________________________ Experiment information has been updated (specially in "4.1"), with more information about the interaction between "Generator LLM" and "Supervisory LLM"
> Comments 11:
________________________________ Reviewer comment 11. In the section 4.4, the authors state "Handling Multiple FSMs:...". However, there is no details about multiple FSMs in the paper. .................................· Response 11: ________________________________ The several tests executed on the same environment now show how the system can generate multiple FSM using the same context information
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper is rather incomplete. There is a huge lack of detail in a lot of places.
What computational resources did you used?
Did the LLM always understood the mission? How detailed did the mission specification needed to be?
Did the "code generation" and the check from LLMs always succeed?
How the resulting generated files from Table 1 and Table 2 compare to each other?
Have the humans produced better "code" than the LLMs? Smaller? Faster? Easier to debug?
3.3.1 and 3.3.2 seem to be missing.
Action: "move to warehouse and take box" seems rather large and very unclear.
It seems one would need to split it into several actions.
How is the robot deciding where to go in Figure 4?
Comments on the Quality of English LanguagePick a version and stick to it: ’ai2thor’, AI2THOR, AI2-THOR
Add a . at the end of each Figure or Table caption. Use Fig instead of fig
Line 155: arisen -> risen?
Line 260: in Tecnalia -> at Tecnalia
Line 351: fig -> Fig
Line 461: AI2THOR -> AI2-THOR
Line 469: fig -> Fig
Line 568: position1 -> position 1 (more consistent with position 2 in previous line)
Author Response
Please note: main changes in the document are marked in blue. As reviewer comments indicated that there were some relevant changes to be made on the document, several parts of the document have been updated.> Comments 1: ________________________________ The paper is rather incomplete. There is a huge lack of detail in a lot of places. ................................. · Response 1: ________________________________ Agreed. We introduced new tests in the synthetic setup and show: - Input information - Example output information - Simulated results
We pushed the limits on the methodology, and increased complexity of the requests until the output started to fail. Using a parameter that can be used to measure the complexity of the requests, then used that to verify the degradation of the output in eleven tests (see Table 3).
> Comments 2: ________________________________ What computational resources did you used? ................................. · Response 2: ________________________________ In 2.2 we now indicate that the "Meta-Llama-3-70B-Instruct" model runs on Deep Infra computation resources.
> Comments 3: ________________________________ Did the LLM always understood the mission? How detailed did the mission specification needed to be? ................................. · Response 3: ________________________________ Now shown in detail in "4. Experimental setup and validation", the system understands the mission and creates a correct output but this behavior degrades as the complexity of the mission (number of elements that the robot has to interact with, and consecutive operations that have to be performed). Table 3 shows such degradation. It seems that this degradation is align with what other researchers found in LLMs.
> Comments 4: ________________________________ Did the "code generation" and the check from LLMs always succeed? ................................. · Response 4: ________________________________ No, there's an interaction loop between the LLM that generates the code, and the one validating it. In some situations (described in chapter 4) this interaction loop is executed to re-plan. As indicated in the previous comment, once a certain limit is reached, the system can not produce correct code.
> Comments 5: ________________________________ How the resulting generated files from Table 1 and Table 2 compare to each other? Have the humans produced better "code" than the LLMs? Smaller? Faster? Easier to debug? ................................. · Response 5: ________________________________ Both show that a seasoned programmer is required even for trivial tasks (as the one requested in those two examples). In the (quite trivial) test presented, the "code" (just a set of consecutive states in a machine state) was equivalent in size and sped.
> Comments 6: ________________________________ 3.3.1 and 3.3.2 seem to be missing. ................................. · Response 6: ________________________________ Updated the document to fix numbering.
> Comments 7: ________________________________ Action: "move to warehouse and take box" seems rather large and very unclear. It seems one would need to split it into several actions. ................................. · Response 7: ________________________________ Yes, we updated the examples in section 3 to align them with the tests detailed in section 4, so they can be better understood.
> Comments 8: ________________________________ How is the robot deciding where to go in Figure 4? ................................. · Response 8: ________________________________ As now indicated in "4.1.1": "the Generator LLM and Validation LLM form a non-reactive offline programming system. It is assumed that the robot includes already a low-level latency system that will cope with perception, obstacle avoidance and navigation in its environment, so these tasks are out-of-scope of the approach proposed in this paper, and therefore, out-of-scope for these tests (e.g. location information of elements is provided to the robot as a predefined table)."
In another words: we assume a certain navigation system (out of the scope here) that is able to move the robot from "location A" to "location B".
> Comments 9: ________________________________ Comments on the Quality of English Language Pick a version and stick to it: ’ai2thor’, AI2THOR, AI2-THOR Add a . at the end of each Figure or Table caption. Use Fig instead of fig Line 155: arisen -> risen? Line 260: in Tecnalia -> at Tecnalia Line 351: fig -> Fig Line 461: AI2THOR -> AI2-THOR Line 469: fig -> Fig Line 568: position1 -> position 1 (more consistent with position 2 in previous line) ................................. · Response 8: ________________________________ Agreed. Modified in the updated document version.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThis paper presented Harnessing the Power of Large Language Models for Automated Code Generation and Verification, including methodological procedures, simulation scenarios, testing, and validation. Although the paper presents significant advances in the use of LLM for code generation, some suggestions are made to improve the final version of the paper:
1. At the end of the introductory section, include what the main contributions of this research would be
2. The authors could add a section II, called Related Works, to show the state of the art on the research topic, its challenges, and opportunities, which can be justifications to motivate the development of this paper.
3. In section 3, the authors present Fig. 2, which indicates the proposed steps of the methodology. This representation seems to be part of the methodology. The authors could provide a more complete diagram of the methodology, which allows the reader to have a macro view of the authors' proposal. This is important in order to make the article a scientific reference that can be reproduced.
4. In section 4, the authors presented the experimental setups and results. Although the results show success in the proposal, this setup is quite simple to program. What would be the limitations of this approach? This point is important because as the number of states increases, using FSM to automate processes becomes very complicated.
5. In section 4, did the authors evaluate the possibility of obstacles in the robot's trajectory?
6. The article presents an interesting procedure for generating code in general, but in terms of results, the presentation of the setup was quite conservative, which limited the proposal's potential.
Author Response
Please note: main changes in the document are marked in blue. As reviewer comments indicated that there were some relevant changes to be made on the document, several parts of the document have been updated. > Comments 1: ________________________________ Reviewer comment 1. At the end of the introductory section, include what the main contributions of this research would be ................................. · Response 1: ________________________________ We introduced the goal of the research in the end with: "This paper presents an approach to program and configure robotic applications using large language models as a tool to reduce the time required for the programming (and the costs associated to such time)"
> Comments 2: ________________________________ Reviewer comment 2. The authors could add a section II, called Related Works, to show the state of the art on the research topic, its challenges, and opportunities, which can be justifications to motivate the development of this paper. ................................. · Response 2: ________________________________ The state of the research topic and its challenges are presented in the updated "4.2.1. Results" chapter, where we present the limitations of our approach, and see that it is aligned with what other researchers find in related works.
> Comments 3: ________________________________ Reviewer comment 3. In section 3, the authors present Fig. 2, which indicates the proposed steps of the methodology. This representation seems to be part of the methodology. The authors could provide a more complete diagram of the methodology, which allows the reader to have a macro view of the authors' proposal. This is important in order to make the article a scientific reference that can be reproduced. ................................. · Response 3: ________________________________ Figure 2 has been updated to include more detail.
> Comments 4: ________________________________ Reviewer comment 4. In section 4, the authors presented the experimental setups and results. Although the results show success in the proposal, this setup is quite simple to program. What would be the limitations of this approach? This point is important because as the number of states increases, using FSM to automate processes becomes very complicated. ................................. · Response 4: ________________________________ Agreed. Information about experiments in section 4 has been upgraded to include different situations and push the limits on the methodology. Four detailed tests are presented in the synthetic setup and a simple experiment in the real setup. Additionally, up to 11 tests are summarized in Table 3.
> Comments 5: ________________________________ Reviewer comment 5. In section 4, did the authors evaluate the possibility of obstacles in the robot's trajectory? ................................. · Response 5: ________________________________ As indicated now in 4.1.1, the Generator LLM and Validation LLM form a non-reactive offline programming system. It is assumed that the robot includes already a low-level latency system that will cope with perception, obstacle avoidance and navigation in its environment, so these tasks are out-of-scope of the approach proposed in this paper, and therefore, out-of-scope for these tests (e.g. location information of elements is provided to the robot as a predefined table).
> Comments 6: ________________________________ Reviewer comment 6. The article presents an interesting procedure for generating code in general, but in terms of results, the presentation of the setup was quite conservative, which limited the proposal's potential. · Response 6: ________________________________ Agreed. Information about new tests and their results (and limits) has been added in section 4.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsEssentially, this paper had been detail modified according to the reviewer's comments. I suggest this paper can be accepted for publishing.
Comments on the Quality of English LanguageIt is clear to read and understand.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have responded to the comments and suggestions that were requested.