Next Article in Journal
Influence of Thermally Treated Asbestos-Containing Materials on Cement Mortars Properties
Previous Article in Journal
The Effects of Non-viable Probiotic Lactobacillus Paracasei on the Biotechnological Properties of Saccharomyces Cerevisiae
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Empirical Evaluation of Prompting Strategies for Python Syntax Error Detection with LLMs

College of Computer Science and Engineering, Taibah University, Medina 41411, Saudi Arabia
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(16), 9223; https://doi.org/10.3390/app15169223 (registering DOI)
Submission received: 26 July 2025 / Revised: 18 August 2025 / Accepted: 19 August 2025 / Published: 21 August 2025

Abstract

As large language models (LLMs) are increasingly integrated into software development, there is a growing need to assess how effectively they address subtle programming errors in real-world environments. Accordingly, this study investigates the effectiveness of LLMs in identifying syntax errors within large Python code repositories. Building on the bug in the code stack (BICS) benchmark, this research expands the evaluation to include additional models, such as DeepSeek and Grok, while assessing their ability to detect errors across varying code lengths and depths. Two prompting strategies—two-shot and role-based prompting—were employed to compare the performance of models including DeepSeek-Chat, DeepSeek-Reasoner, DeepSeek-Coder, and Grok-2-Latest with GPT-4o serving as the baseline. The findings indicate that the DeepSeek models generally outperformed GPT-4o in terms of accuracy (Acc). Notably, DeepSeek-Reasoner exhibited the highest overall performance, achieving an Acc of 86.6% and surpassing all other models, particularly when integrated prompting strategies were used. Nevertheless, all models demonstrated decreased Acc with increasing input length and consistently struggled with certain types of errors, such as missing quotations (MQo). This work provides insight into the current strengths and weaknesses of LLMs within real-world debugging environments, thereby informing ongoing efforts to improve automated software tools.
Keywords: large language models (LLMs); syntax error detection; prompt engineering; automated debugging large language models (LLMs); syntax error detection; prompt engineering; automated debugging

Share and Cite

MDPI and ACS Style

Aloufi, N.; Aljuhani, A. Empirical Evaluation of Prompting Strategies for Python Syntax Error Detection with LLMs. Appl. Sci. 2025, 15, 9223. https://doi.org/10.3390/app15169223

AMA Style

Aloufi N, Aljuhani A. Empirical Evaluation of Prompting Strategies for Python Syntax Error Detection with LLMs. Applied Sciences. 2025; 15(16):9223. https://doi.org/10.3390/app15169223

Chicago/Turabian Style

Aloufi, Norah, and Abdulmajeed Aljuhani. 2025. "Empirical Evaluation of Prompting Strategies for Python Syntax Error Detection with LLMs" Applied Sciences 15, no. 16: 9223. https://doi.org/10.3390/app15169223

APA Style

Aloufi, N., & Aljuhani, A. (2025). Empirical Evaluation of Prompting Strategies for Python Syntax Error Detection with LLMs. Applied Sciences, 15(16), 9223. https://doi.org/10.3390/app15169223

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop