You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
  • Article
  • Open Access

25 November 2025

A Comparative Study on Self-Driving Scenario Code Generation Through Prompt Engineering Based on LLM-Specific Characteristics

,
and
School of AI Convergence, Sungshin Women’s University, Seoul 02844, Republic of Korea
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Artificial Intelligence for Advancing Connected and Autonomous Vehicles

Abstract

Large Language Models (LLMs) demonstrate potential in code generation capabilities, yet their applicability in autonomous vehicle control has not been sufficiently explored. This study verifies whether LLMs can generate executable MATLAB code for software-defined vehicle scenarios, comparing five models: GPT-4, Gemini 2.5 Pro, Claude Sonnet 4.0, CodeLlama-13B-Instruct, and StarCoder2. Thirteen standardised prompts were applied across three types of scenarios: programming-based driving scenarios, inertial sensor-based simulations, and vehicle parking scenarios. Multiple automated evaluation metrics—BLEU, ROUGE-L, ChrF, Spec-Compliance, and Runtime-Sanity—were used to assess code executability, accuracy, and completeness. The results showed GPT-4 achieved the highest score 0.54 in the parking scenario with an overall average score of 0.27, followed by Gemini 2.5 Pro as 0.26. Commercial models demonstrated over 60% execution success rates across all scenarios, whereas open-source models like CodeLlama and StarCoder2 were limited to under 20%. Furthermore, the parking scenario yielded the lowest average score of 0.19, confirming that complex tasks involving sensor synchronisation and trajectory control represent a common limitation across all models. This study presents a new benchmark for quantitatively evaluating the quality of SDV control code generated by LLMs, empirically demonstrating that prompt design and task complexity critically influence model reliability and real-world applicability.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.