Abstract
Large Language Models (LLMs) demonstrate potential in code generation capabilities, yet their applicability in autonomous vehicle control has not been sufficiently explored. This study verifies whether LLMs can generate executable MATLAB code for software-defined vehicle scenarios, comparing five models: GPT-4, Gemini 2.5 Pro, Claude Sonnet 4.0, CodeLlama-13B-Instruct, and StarCoder2. Thirteen standardised prompts were applied across three types of scenarios: programming-based driving scenarios, inertial sensor-based simulations, and vehicle parking scenarios. Multiple automated evaluation metrics—BLEU, ROUGE-L, ChrF, Spec-Compliance, and Runtime-Sanity—were used to assess code executability, accuracy, and completeness. The results showed GPT-4 achieved the highest score 0.54 in the parking scenario with an overall average score of 0.27, followed by Gemini 2.5 Pro as 0.26. Commercial models demonstrated over 60% execution success rates across all scenarios, whereas open-source models like CodeLlama and StarCoder2 were limited to under 20%. Furthermore, the parking scenario yielded the lowest average score of 0.19, confirming that complex tasks involving sensor synchronisation and trajectory control represent a common limitation across all models. This study presents a new benchmark for quantitatively evaluating the quality of SDV control code generated by LLMs, empirically demonstrating that prompt design and task complexity critically influence model reliability and real-world applicability.