This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessArticle
Empirical Evaluation of Prompting Strategies for Python Syntax Error Detection with LLMs
by
Norah Aloufi
Norah Aloufi *
and
Abdulmajeed Aljuhani
Abdulmajeed Aljuhani
College of Computer Science and Engineering, Taibah University, Medina 41411, Saudi Arabia
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(16), 9223; https://doi.org/10.3390/app15169223 (registering DOI)
Submission received: 26 July 2025
/
Revised: 18 August 2025
/
Accepted: 19 August 2025
/
Published: 21 August 2025
Abstract
As large language models (LLMs) are increasingly integrated into software development, there is a growing need to assess how effectively they address subtle programming errors in real-world environments. Accordingly, this study investigates the effectiveness of LLMs in identifying syntax errors within large Python code repositories. Building on the bug in the code stack (BICS) benchmark, this research expands the evaluation to include additional models, such as DeepSeek and Grok, while assessing their ability to detect errors across varying code lengths and depths. Two prompting strategies—two-shot and role-based prompting—were employed to compare the performance of models including DeepSeek-Chat, DeepSeek-Reasoner, DeepSeek-Coder, and Grok-2-Latest with GPT-4o serving as the baseline. The findings indicate that the DeepSeek models generally outperformed GPT-4o in terms of accuracy (Acc). Notably, DeepSeek-Reasoner exhibited the highest overall performance, achieving an Acc of 86.6% and surpassing all other models, particularly when integrated prompting strategies were used. Nevertheless, all models demonstrated decreased Acc with increasing input length and consistently struggled with certain types of errors, such as missing quotations (MQo). This work provides insight into the current strengths and weaknesses of LLMs within real-world debugging environments, thereby informing ongoing efforts to improve automated software tools.
Share and Cite
MDPI and ACS Style
Aloufi, N.; Aljuhani, A.
Empirical Evaluation of Prompting Strategies for Python Syntax Error Detection with LLMs. Appl. Sci. 2025, 15, 9223.
https://doi.org/10.3390/app15169223
AMA Style
Aloufi N, Aljuhani A.
Empirical Evaluation of Prompting Strategies for Python Syntax Error Detection with LLMs. Applied Sciences. 2025; 15(16):9223.
https://doi.org/10.3390/app15169223
Chicago/Turabian Style
Aloufi, Norah, and Abdulmajeed Aljuhani.
2025. "Empirical Evaluation of Prompting Strategies for Python Syntax Error Detection with LLMs" Applied Sciences 15, no. 16: 9223.
https://doi.org/10.3390/app15169223
APA Style
Aloufi, N., & Aljuhani, A.
(2025). Empirical Evaluation of Prompting Strategies for Python Syntax Error Detection with LLMs. Applied Sciences, 15(16), 9223.
https://doi.org/10.3390/app15169223
Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details
here.
Article Metrics
Article metric data becomes available approximately 24 hours after publication online.