Automated Generation of Test Scenarios for Autonomous Driving Using LLMs

Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper introduces a method using large language models (LLMs) to automate the conversion of Operational Design Domain (ODD) descriptions into executable simulation scenarios for autonomous vehicles. The approach blends model-based and data-driven techniques, decomposing ODDs into three core components: environmental, scenery, and dynamic elements and uses prompt engineering to generate ScenarioRunner scripts for the CARLA simulator. The model-based aspect guides LLMs with structured prompts and a "Tree of Thoughts" (ToT) strategy to outline scenarios, while a data-driven refinement process inspired by red teaming iteratively enhances script accuracy and robustness.
- How does the paper integrate model-based and data-driven techniques specifically in converting Operational Design Domains (ODDs) into simulation scenarios? What is the application of the Tree of Thoughts (ToT) strategy and red-teaming techniques in this process?
- The experiments indicate that dynamic elements (e.g., vehicle and pedestrian behavior) require fine-tuning. What specific issues were observed, and how did these defects manifest in Scenario 1 and Scenario 2?
- The paper highlights that manual scenario creation is time-consuming and lacks scalability. What concrete improvements in efficiency and scenario diversity does the LLM-based automated approach offer? Can these be illustrated with experimental data?
- The future work mentions optimizing dynamic element generation. Beyond refining prompt engineering, are there plans to incorporate other technologies (e.g., reinforcement learning or multimodal models) to enhance the realism of dynamic behaviors?
Author Response
Comments 1: [How does the paper integrate model-based and data-driven techniques specifically in converting Operational Design Domains (ODDs) into simulation scenarios? What is the application of the Tree of Thoughts (ToT) strategy and red-teaming techniques in this process?]
Response 1: [Both the model-based and data-driven techniques are integrated in the sense of gradually guiding the LLM instance from the phase of introducing it to a specific context; in this case converting ODD descriptions into simulation scenarios then to refining the LLM instance to ultimately generate accurate and executable code. Although these techniques have very different approaches to execute this, they share a common goal of training the LLM instance to evolve from base state to an augmented state. The Tree of Thoughts (ToT) strategy primarily provided the nomenclature of the various instances the LLM undertook to highlight the different phases of its evolution and to also establish the prompting technique that was being used; the Chain of Thoughts (CoT) prompting technique. On the other hand, the red-teaming technique provided a roadmap on the refinement stage within the pipeline. This is where prompts were created to identify any errors or mistakes, mitigate logical inconsistencies and overall ensuring a higher code quality.] Please refer to Chapter 4. We have integrated this information into sections 4.4 - 4.6.
Comments 2: [The experiments indicate that dynamic elements (e.g., vehicle and pedestrian behavior) require fine-tuning. What specific issues were observed, and how did these defects manifest in Scenario 1 and Scenario 2?]
Response 2: [The dynamic elements observed in both scenarios were concluded to require fine-tuning because there was a higher rate of inconsistencies as compared to any of the other elements. For example, NPCs/Pedestrians were often captured too close to the highway and at times walking on the highway as vehicles moved along on all sides. And as a result of low execution triggers, some vehicles did not necessarily adhere to any traffic rules. Motorbikes would often spawn on the sidewalks and execute actions on the sidewalks with pedestrians present. These were some contributing factors to low scenario realism during simulations and would often require manual tweaks of the generated to achieve a somewhat realistic scenario.] Kindly refer to the "Observations to Scenario" paragraphs of both scenarios in chapter 5.1 and 5.2.
Comments 3: [The paper highlights that manual scenario creation is time-consuming and lacks scalability. What concrete improvements in efficiency and scenario diversity does the LLM-based automated approach offer? Can these be illustrated with experimental data?]
Response 3:[We have added a section into chapter 6 ie. 6.4 Improvements and Advantages of the LLM-Based Automated Scenario Generation Pipeline.]
Comments 4: [The future work mentions optimizing dynamic element generation. Beyond refining prompt engineering, are there plans to incorporate other technologies (e.g., reinforcement learning or multimodal models) to enhance the realism of dynamic behaviors?]
Response 4: We have included a new section which highlights [Fine-tuning the LLM by integrating the data-driven approach, an approach that heavily relies on the model being trained on a dataset that consists of data such as the prompting techniques used from the model-based approach, error logs from dynamic elements during simulations, scenario executions, simulation results etc. For models like Llama 3, “training” often involves either fine-tuning (adjusting weights on a subset of data) or inference-time adaptation (using prompt engineering or embeddings).] Please refer to Chapter 7, Line 435
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper proposed a method that using LLMs to change the detailed ODD descriptions into executable simulation scenarios for autonomous vehicles. Besides, this paper also give an integration that integrated the model-based and data driven techniques to enhance the robustness and accuracy of the generated scenarios. To honesty, the proposed idea is very necessary for simulating the AVs.
- the idea employ the LLMs into the simulation scenario generation, however, what kind of the LLM model was used ? GPt-4? ,or other model? Usually, different LLM model may have different results, especially have some impacts on the accuracy . But, this paper did not clearly indicate the LLM model.
- Section 5.1 and 5.2 give two scenarios as examples to show the generated results . But the descriptions were too vague and hard to reproduce by readers. Specifically, some parameters were did not present clearly.if this generation did not need any parameters, so it was just gave some prompt instructions to the LLMs? But, without knowing specific road characteristics, such as how many lanes and how long of the road, how can the AV properly plan its driving trajectory on the generated scenario?
- The data driven was highly mentioned in the abstract ,but in the text, especially in the methodology section, did not give clearly how the data driven technique was used in the scenario generation .
- In line 357, typing mistake, Figure 12 should be Figure. 10.
Author Response
Comments 1: [the idea is to employ the LLMs into the simulation scenario generation, however, what kind of the LLM model was used ? GPt-4? ,or other model? Usually, different LLM model may have different results, especially have some impacts on the accuracy . But, this paper did not clearly indicate the LLM model.]
Response 1: [The model used for the experiments was Llama 3. We agree with this comment and did not clearly state which model was being used because within our scope, the single LLM is only employed for its generative capabilities. In future iterations of conducting experiments based on our pipeline, we can employ and compare other powerful models and evaluate metrics such as the results and impacts on accuracy as you have also pointed out.] Please refer to chapter 5, line 261 as we have integrated this information.
Comments 2: [Section 5.1 and 5.2 give two scenarios as examples to show the generated results. But the descriptions were too vague and hard to reproduce by readers. Specifically, some parameters were did not present clearly. if this generation did not need any parameters, so it was just gave some prompt instructions to the LLMs? But, without knowing specific road characteristics, such as how many lanes and how long of the road, how can the AV properly plan its driving trajectory on the generated scenario?]
Response 2: Please refer to chapter 5 as [both Section 5.1 and 5.2 have been restructured to provide more information on Dynamic Actors, Events and Scenario Triggers to help readers reproduce the scenarios from the initialization stage to the completion stage.]
Comments 3: [The data driven was highly mentioned in the abstract ,but in the text, especially in the methodology section, did not give clearly how the data driven technique was used in the scenario generation.]
Response 3: [The red-teaming technique was heavily influenced by the data-driven approach as the refinement process within the pipeline and was executed by feeding back results into the tree-LLM instance in an iterative loop process to further improve it to generate correct and compatible code for simulations. This is how we used and referenced the data-driven approach. The data-driven approach is also described as being an alternative method of conducting these experiments but requires similar components that can be found in the model-based approach. Hence in a linear timeline, we choose to conduct experiments using the model-based approach first and then future experiments using the data-driven approach.] Please refer to chapter 7, line 435
Comments 4: [In line 357, typing mistake, Figure 12 should be Figure. 10.]
Response 4: Thank you for pointing this out [The mistake has been corrected.]
Reviewer 3 Report
Comments and Suggestions for Authors-
The authors introduced the large language models (LLMs) without clearly stating the research problem. It is essential to define the problem before presenting the proposed model.
-
While experimental results are provided, the authors have not explained or justified these results adequately.
-
The manuscript should include statistical analysis, a clear conclusion, and a discussion of future work to validate and support the research.
-
The title should be revised to include relevant keywords related to vehicles to reflect the main focus of the study.
-
A proper and valid link to reference [1] must be included to support the context of Figure 1.
-
Figures 2 and 3, which appear to describe the research process, should be placed in the methodology section to maintain logical structure.
-
Overall, the paper lacks proper organization and fails to present the research in a clear and coherent manner. Significant revision is needed to enhance clarity and scientific contribution.
Author Response
Comments 1: [The authors introduced the large language models (LLMs) without clearly stating the research problem. It is essential to define the problem before presenting the proposed model.]
Response 1: [Please refer to the paragraphs from line 57 - 70 where we explained how "most current methods for creating these simulation scenarios rely on manual work or strict rules" and how our research will primarily focus on how to integrate AV scenario generation with generative AI to create scenarios based on ODD descriptions which also simulate real world parameters.]
Comments 2: [While experimental results are provided, the authors have not explained or justified these results adequately.]
Response 2: Thank you for pointing this out. [ We have included more information in the "Experiments" section to clarify our results.]
Comments 3: [The manuscript should include statistical analysis, a clear conclusion, and a discussion of future work to validate and support the research.]
Response 3: Thank you for pointing this out. [We have included a "Discussion and Future work" section to provide more information of possible future iterations to apply the pipeline in conducting future experiments.]
Comments 4: [The title should be revised to include relevant keywords related to vehicles to reflect the main focus of the study.]
Response 4: [The keywords have been revised such that it directly reflects our title and our main focus.]
Comments 5: [A proper and valid link to reference [1] must be included to support the context of Figure 1.]
Response 5: Thank you for pointing this out. [The mistake has been corrected.]
Comment 6: [Figures 2 and 3, which appear to describe the research process, should be placed in the methodology section to maintain logical structure.]
Response 6: Thank you for pointing this out. [We have moved Fig 2 & 3 alongside their descriptions from the Introduction and Related Works sections respectively to the Methodology section to maintain logical structure.]
Comment 7: [Overall, the paper lacks proper organization and fails to present the research in a clear and coherent manner. Significant revision is needed to enhance clarity and scientific contribution.]
Response 7: [The chapters have been revised to provide a methodical structure for our readers. ]
Round 2
Reviewer 2 Report
Comments and Suggestions for Authorsthe authors have revised all the parts according to the previous comments.
Author Response
References have been revised and reformatted. There have also been significant changes to the figure descriptions and the tables in the Results section in Chapter 6.
Reviewer 3 Report
Comments and Suggestions for Authors- The authors must generate original figures for Figures 2 through 7, rather than reusing or reproducing existing visuals.
-
The titles of Figures 8 and 9 are too lengthy and should be revised to be more concise and specific.
-
All results presented in the paper must be directly supported by experimental data.
-
The authors are also expected to provide research-based, evidence-driven results that reflect the outcomes of their experiments.
-
Additionally, references [1–4], [20], [23], and [24] need to be updated and reformatted to comply with proper research citation standards.
The English could be improved to more clearly express the research.
Author Response
Comments 1: The authors must generate original figures for Figures 2 through 7, rather than reusing or reproducing existing visuals.
Response 1: We only had to generate Fig 2 and 3 again as they are the only pre-existing figures in our paper, Fig 1 excluded. Attached is a file containing all of our diagrams.
Comments 2: The titles of Figures 8 and 9 are too lengthy and should be revised to be more concise and specific.
Response 2: The figure descriptions of Fig 8 and 9 have been revised to better articulate our point.
Comments 3: All results presented in the paper must be directly supported by experimental data.
Response 3: Thank you for pointing this out. We have uploaded our scripts for the Scenarios described in the paper and the jupyter notebook used for testing the CARLA environment/assets are also available.
Comments 4:The authors are also expected to provide research-based, evidence-driven results that reflect the outcomes of their experiments.
Response 4: We appreciate the emphasis on evidence-driven reporting. We have restructured Section 6 and Table 3 to present each experimental outcome with corresponding descriptive statistics. For example, execution success rates are now reported with mean ± SD and significance levels, directly reflecting our scenario runs under both baseline and pipeline conditions. Our results have been revised and presented with more context to our experiments.
Comments 5:. Additionally, references [1–4], [20], [23], and [24] need to be updated and reformatted to comply with proper research citation standards.
Response 5: Thank you for highlighting these mistakes. The references have been reformatted properly to comply with proper research citation standards.