You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

21 August 2025

Empirical Evaluation of Prompting Strategies for Python Syntax Error Detection with LLMs

and
College of Computer Science and Engineering, Taibah University, Medina 41411, Saudi Arabia
*
Author to whom correspondence should be addressed.

Abstract

As large language models (LLMs) are increasingly integrated into software development, there is a growing need to assess how effectively they address subtle programming errors in real-world environments. Accordingly, this study investigates the effectiveness of LLMs in identifying syntax errors within large Python code repositories. Building on the bug in the code stack (BICS) benchmark, this research expands the evaluation to include additional models, such as DeepSeek and Grok, while assessing their ability to detect errors across varying code lengths and depths. Two prompting strategies—two-shot and role-based prompting—were employed to compare the performance of models including DeepSeek-Chat, DeepSeek-Reasoner, DeepSeek-Coder, and Grok-2-Latest with GPT-4o serving as the baseline. The findings indicate that the DeepSeek models generally outperformed GPT-4o in terms of accuracy (Acc). Notably, DeepSeek-Reasoner exhibited the highest overall performance, achieving an Acc of 86.6% and surpassing all other models, particularly when integrated prompting strategies were used. Nevertheless, all models demonstrated decreased Acc with increasing input length and consistently struggled with certain types of errors, such as missing quotations (MQo). This work provides insight into the current strengths and weaknesses of LLMs within real-world debugging environments, thereby informing ongoing efforts to improve automated software tools.

1. Introduction

Syntax errors rank among the most common types of programming errors encountered during software development. These errors occur when the written code does not conform to the grammatical rules of the programming language, such as a missing colon (MC), mismatched quotation (MQ), missing parenthesis (MP), or the use of a reserved word as a variable identifier [1]. Although such errors may seem minor, they can seriously disrupt the development process by preventing the code from being properly compiled or interpreted, ultimately hindering the execution of the program. In large-scale software projects that involve thousands of lines of code, syntax errors can impede developers’ productivity, as tracking and fixing them often takes a lot of time and can be frustrating [2]. With the advancement of integrated development environments (IDEs) and code analysis tools, new solutions—particularly those powered by artificial intelligence (AI) and machine learning (ML)—are emerging to detect and correct these errors more quickly and efficiently. These improvements help developers write higher-quality code and enhance the overall efficiency of the development process.
Over the past few years, ML has emerged as a key tool in many areas of software development. For example, in software refactoring, ML algorithms are trained on massive datasets containing millions of actual refactorings from real-world projects. By learning from classes and methods that have undergone refactoring in practice, the resulting models can provide more reliable refactoring recommendations to developers [3]. On the other hand, ML models can improve the detection and correction of syntax errors in software development. Compared to traditional methods, including compiler diagnostics that produce ambiguous or generic messages, ML-based solutions have the potential to provide more useful and automated suggestions. By being trained on massive amounts of syntactically correct code, these models can recognize typical error patterns and support the suggestion of precise fixes [4]. This type of automated assistance has the potential to save developers time and reduce the need for manual debugging.
Building on the foundations of ML, deep learning (DL) has significantly advanced the field of software development by enabling more intelligent and automated solutions to complex programming tasks. For example, in the field of bug detection and correction, neural networks can be trained on massive repositories of code to identify common and subtle programming errors. This approach often outperforms traditional static analysis tools. These DL models are ideally suited for capturing long-term dependencies, such as MP or invalid logical operators, which can be difficult for rule-based systems to detect and are cumbersome for developers to address in cases of very complex code [5]. In addition to bug detection, DL is also used in other areas of software development, such as code refactoring [6], test case generation [7], and code summarization [8]. These capabilities have revolutionized the development workflow, allowing developers to write more secure and efficient code with less manual effort.
As ML technologies advance, LLMs have demonstrated remarkable potential in code-related applications. Originally developed for human language processing, LLMs such as GPT [9] have proven viable in comprehending and generating programming code by virtue of their ability to model both syntax and semantics. Their strength lies in learning from massive, diverse datasets, which allows them to perform tasks such as code completion [10], bug detection [11], and program synthesis [12]. Recent benchmarks such as needle-in-a-haystack (NIAH) [13] and bug in the code stack (BICS) [14] evaluate how effectively these models can locate subtle bugs or retrieve relevant information across large codebases. As LLMs continue to improve, they are expected to play a more significant role in enhancing developer productivity through real-time suggestions and intelligent debugging assistance.
Several research studies have analyzed the process of extracting critical small data points from extensive text collections by using the NIAH benchmark to evaluate LLMs on their ability to locate such information within extensive texts. NIAH specializes in retrieving text, but researchers now utilize LLM technology for software-related projects by evaluating systems based on their capacity to detect errors in codebases. BICS provides a systematic analysis of LLM debugging performance with Python code as an evaluation platform. The BICS benchmark system creates syntax errors at different depths within extensive code stacks to evaluate GPT-4o, Claude 3 [15], Gemini 1.5 [16], and additional models regarding their error discovery competencies. Research results demonstrate how longer context affects system performance, since certain models experience a decline in performance when the code length increases.
The contribution of this paper is to investigate the effectiveness of various LLMs, such as DeepSeek [17] and Grok [18], in detecting syntax errors under varying code lengths, code depths, and prompt designs. It further explores the impact of prompt engineering on enhancing model performance. It also highlights the critical limitations in current LLMs for accurate syntax error detection and their effectiveness in a real-world setting. Finally, the paper provides insights and recommendations for future improvements in model capabilities.

3. Framework

This research investigates the ability of LLMs to detect syntax errors in substantial Python codebases with more than 16,000 lines. Extending the BICS benchmark [14], we introduce Python scripts with intentional syntax errors—specifically, MC and mismatched brackets (MB)—at differing depths in the code. The paper analyzes the performance of three DeepSeek models (DeepSeek-Chat, DeepSeek-Reasoner, and DeepSeek-Coder) alongside Grok-2-Latest under two prompting paradigms: two-shot prompting and role-based prompting. We subsequently compare the performance of the models with the benchmark GPT-4o, as established by the BICS. To facilitate an in-depth analysis, we compare the models based on the following important metrics: Acc, standard deviation (STD), and the duration taken to run the complete evaluation.
The proposed framework is divided into several stages, starting from cleaning the dataset, then preprocessing, such as generating the prompt and preparing the test parameters, to predicting the results using the model, as depicted in Figure 1.
Figure 1. Proposed framework.

3.1. Dataset

The dataset used for this study is the Alpaca dataset [30], a collection of 18,000 Python code samples with accompanying text commands. It is primarily designed for code generation tasks based on text instructions, with each sample containing a textual description of the task and its corresponding code. The dataset encompasses a broad range of programming concepts, including functions, loops, conditionals, list comprehensions, dictionaries, classes, and imports. This diversity makes it a comprehensive resource for learning and identifying syntax errors in Python. Example tasks range from basic operations like sequence summation or palindrome checking to advanced ones like password randomness generation or making web requests, with the dataset being highly valuable for both educational use and automated code analysis. In this paper, a modified version of the dataset was used and preprocessed to match the requirements of the experiment prior to use with the LLM, which was Python code snippets with corresponding altered versions that contain a specific syntax error. The samples are constructed to embody typical programming errors and their precise positions within the code. Each sample contains the correct code, along with variants that have errors, including but not limited to MC, MB, MP, MQo, MCo, and incorrect usage of reserved keywords as identifiers (KID). Line numbers in the corresponding places straightforwardly show the locations of all errors. Null values in the dataset were filled, and a line numbering function was created, which will later be used to identify the line where the syntax error occurred.

3.2. Preprocessing

Preprocessing is one of the most important stages of data preparation for training LLMs. Since LLMs rely heavily on the quality and structure of their input data, effective preprocessing ensures that the model learns valuable patterns and generates coherent output. Without proper preprocessing, the model may struggle with inconsistencies or irrelevant content, leading to reduced performance and poor generalization. One essential preprocessing step applied in this study was tokenization, which splits text into smaller units called tokens. Tokenization improves the capacity of the model to understand and process language more effectively. The input data are made more structured and compatible with the model architecture [31]. Another important preprocessing step involves building a haystack and inserting the syntax error. This function builds a stack of Python code according to the target length and inserts syntax errors according to the target depth, as defined by the test parameters, in addition to the syntax error types outlined in Section 3.2.1. The model parameters, on the other hand, include the API key, which is a unique authentication token that grants authorized access to LLM services via their API endpoints. The context size of the model is used to ensure that the entire prompt does not exceed the context size.

3.2.1. Syntax Errors

In Python, several common syntax errors can disrupt code execution, each with distinct characteristics and causes. A list of the syntax errors addressed in this experiment is listed in Table 1. These errors can be identified by careful code review, using tools, or leveraging IDE features that highlight syntax issues, such as ensuring proper punctuation, matching delimiters, and avoiding reserved words.
Table 1. Syntax error types.

3.2.2. Prompt Engineering

We assess each model under two prompting strategies: two-shot prompting and role-based prompting. Two-shot prompting is a technique in which the model is given two examples of solving a particular type of problem or question before being asked to solve a new example. The goal is to guide the model towards its solution by learning from the provided examples [32]. We provide two examples of syntax errors before the target task, as shown in Figure 2.
Figure 2. Example of a two-shot prompt used in model evaluation.
Role-based prompting is a technique that leverages the inherent ability of LLMs to adopt various personas, providing them with a specific identity and context to improve their performance [33]. In this study, we assign a role before presenting the two-shot examples, as shown in Figure 3.
Figure 3. Example of a role-based prompt used in model evaluation. ** represents exponentiation in python code.

3.3. LLM Models

Four LLMs were tested and compared against BICS, including three from DeepSeek and one from xAI, as shown in Table 2. LLMs were prompted not only to determine whether there was a syntax error, but also to predict the specific type of syntax error and the exact line number where the error occurred. This approach simulates real debugging scenarios, where programmers rely on their tools to provide both locational and descriptive information.
Table 2. LLMs evaluated in the experiment.

4. Experiment

In this experiment, we tested the models at different lengths and depths. Length refers to the desired size of the combined code snippets—specifically, the total token count of the final concatenated string—which was tested with values [500, 1k, 2k, 4k, 8k, 16k], measured using a tokenizer to ensure the output aligns with LLM input limits and various testing scenarios. Depth, tested at [0%, 25%, 5%, 75%, 100%], indicates the proportion of that target length filled with initial correct code before inserting a specific error, thus controlling the relative position of the error within the overall structure. This is how the haystack was built and the needle (syntax error) inserted. Together, these parameters enable us to investigate a range of conditions by adjusting the total code length and the error’s placement, and following each prompt delivery, we conducted 25 test iterations for each combination of code length (e.g., 8k tokens) and error depth (e.g., 50% depth), resulting in 750 total test runs per model and prompt strategy (6 lengths × 5 depths × 25 iterations). This process systematically combined these values and involved repeating the procedure multiple times per combination to gather thorough and reliable insights into their impact. For example, when the length is 1k and the depth is 50%, it was tested 25 different times, each time with a different haystack and type of error. Each time, the model was fed a prompt containing the haystack and the error, ensuring that the entire prompt remained within the model’s window. All experiments were conducted using Google Colab and Python 3.12.11 to ensure consistent execution across models and test iterations.
The performance of the model was evaluated based on three main evaluation metrics: Acc, STD, and execution time. Acc measures how many times the model predicted correctly; it is a proportional measure of the number of correct predictions (true positives (TP) and true negatives (TN)) to all predictions (TP, TN, false positive (FP) and false negative (FN)), as presented in Equation (1).
Acc = TP + TN TP + FP + TN + FN
STD measures how consistently the model predicts correctly. It is a statistical metric that quantifies the degree to which individual predictions deviate from the average prediction. A lower STD indicates more consistent predictions, whereas a higher STD reflects greater variability. It is calculated as the square root of the average of the squared differences between each prediction and the mean, where x i represents each data point, μ is the mean of the data, and N is the total number of data points, as presented in Equation (2).
STD = i = 1 N ( x i μ ) 2 N
Execution time refers to the amount of time it takes for a model to complete a specific task or prediction. It is a crucial performance metric, particularly when comparing models in terms of efficiency and usability in real-time applications. A shorter execution time indicates a faster and more efficient model, which is beneficial for time-sensitive applications. In this evaluation, the time taken by each model during testing was measured and compared to assess their computational efficiency.

5. Results

This section is divided into three subsections. The first presents the results obtained using two-shot prompting, while the second focuses on role-based prompting combined with two-shot prompting. For each setting, the evaluation is based on two Acc metrics: the Avg. Acc per input length, calculated by averaging the model’s performance across all error depths, and the overall Avg. Acc, which represents the Avg. Acc of the model across all lengths and error types. The last subsection highlights the computational time taken by each model to complete the experiment. These metrics allow for an overall comparison of model performance in terms of code size, complexity, STD, and the time required for each to complete all iterations.

5.1. Two-Shot Prompting

In this experiment using two-shot prompting, we evaluated four LLMs: DeepSeek-Chat, DeepSeek-Reasoner, DeepSeek-Coder, and Grok-2-Latest, across varying target code lengths and depths, with 25 iterations per combination, yielding a total of 750 test runs per model to assess their Acc in handling code with embedded errors. The results reveal distinct performance trends: the DeepSeek models consistently achieved high Avg. Acc, ranging from 83.3% to 83.6%, but exhibited a decline as token length increased, as shown in Figure 4. DeepSeek-Chat dropped from 85.6% at 500 tokens to 72.0% at 16k, and DeepSeek-Reasoner from 92.8% to 66.4%, suggesting sensitivity to longer contexts despite their large windows. This was accompanied by a high STD from ±37.0 to ±37.3, indicating variability across depths and lengths. DeepSeek-Coder maintained relatively stable performance, ranging from 88.8% to 72.8%, slightly outperforming its counterparts, likely due to its code-specific design. In contrast, Grok-2-Latest, exhibited the lowest Avg. Acc of 50.2%, with a sharp decline from 64.0% at 500 tokens to 36.0% at 16k, and the highest STD ±50.0, reflecting inconsistent performance and potential struggles with error detection in longer or more complex contexts. For benchmarking, GPT-4o demonstrated a balanced Avg. Acc of 81.2% with a low STD of ±5.5, maintaining steady performance from 82% to 74% across all lengths, which highlights its robustness and reliability. Overall, while the DeepSeek models excelled at shorter lengths, Grok-2-Latest’s weaker performance and GPT-4o’s consistent results underscore the varying capabilities of these models in processing extended code contexts with embedded bugs. Table 3 shows the results for the two-shot prompting setting.
Figure 4. Acc of models using two-shot prompting across different input lengths, *GPT-4o [14].
Table 3. Acc scores of models with two-shot prompting at varying input lengths.
Additionally, we analyzed model performance across error types using two-shot prompting. DeepSeek-Chat achieved an Avg. Acc of 83.3%, with strong performance on MC at 95.6% and MB at 94.0%, but weaker results on MQ at 67.4%. DeepSeek-Reasoner, also averaging 83.3%, excelled on KID at 89.8% and MCo at 88.7%, yet struggled with MQo at 51.3%. DeepSeek-Coder, with the highest Avg. Acc of 83.6%, performed best on MP at 95.8% and MC at 94.0%, though it dipped to 73.1% on MQ, demonstrating robustness across most error types. Grok-2-Latest, averaging 50.2%, showed inconsistent results, with a peak on MC at 82.1% but notably poor performance on MP at 29.1% and MQ at 30.9%, reflecting challenges with certain syntax errors. For comparison, the GPT-4o benchmark from prior research averaged 81.2%, with high marks on MC at 93%, MP at 93%, and MB at 96%, but a significant drop on MQo at 18%, suggesting variability in error-handling capabilities. These findings highlight that while DeepSeek models generally handle diverse errors well, Grok-2-Latest struggles with specific syntax issues, and GPT-4o provides a reliable baseline—each model’s performance varies by error type under two-shot prompting conditions. Table 4 presents the results for each error type under the two-shot prompting setup.
Table 4. Acc of each model across error types under two-shot prompting.

5.2. Role-Based and Two-Shot Prompting

In a separate evaluation using role-based and two-shot prompting, we tested DeepSeek-Chat, DeepSeek-Reasoner, DeepSeek-Coder, Grok-2-Latest, and GPT-4o across the same target lengths and depths, again with 750 test runs per model. The results with this combined prompting strategy showed improved performance trends: DeepSeek-Chat achieved an Avg. Acc of 83.8%, with scores ranging from 87.2% at 500 tokens to 68.8% at 16k; STD was ±36.8. DeepSeek-Reasoner reached the highest average of 86.6%, with a peak of 92.0% at 1000 tokens and a low of 78.4% at 16k; STD was ±34.0. DeepSeek-Coder averaged 81.4%, ranging from 88.0% at 1000 tokens to 67.2% at 16k; STD was ±38.9, as shown in Figure 5. Compared to two-shot prompting alone, role-based prompting enhanced overall accuracies, particularly for DeepSeek-Reasoner, and mitigated some decline at longer lengths, e.g., DeepSeek-Chat from 72.0% to 68.8% at 16k, though variability remained high. This suggests that adding role-based context to two-shot prompting bolsters error detection across diverse code lengths, with DeepSeek-Reasoner showing the most consistent gains, while still reflecting challenges with extended contexts, as indicated by the persistent drop-off at 16k tokens across all models. Grok-2-Latest achieved an Avg. Acc of 59.3%, ranging from 80.8% at 500 tokens to 36.8% at 16k, with a STD of ±49.1. GPT-4o reached an Avg. Acc of 73.8%, with scores from 80.8% at 500 tokens to 64.0% at 16k, and STD ±43.9. Table 5 shows the results with role-based and two-shot prompting.
Figure 5. Acc variation across models and input lengths under role-based and two-shot prompting.
Table 5. Acc scores for each model using role-based and two-shot prompting.
When examining performance across specific error types with role-based and two-shot prompting, DeepSeek-Chat achieved an Avg. Acc of 83.8%, with strong results on MP at 93.4% and MB at 93.4%; but a moderate dip on MQ at 72.8%. DeepSeek-Reasoner topped the group at 86.6%, excelling on MB at 99.1% and MC at 98.2%, while notably struggling with MQo at 40.5%. DeepSeek-Coder averaged 81.4%, performing best on MP at 95.9% and MB at 94.0%, but showing weakness on MQ at 70.1%. Compared to two-shot prompting alone, role-based enhancements improved accuracies across most error types, such as DeepSeek-Chat’s MCo rising from 80.0% to 86.1% and DeepSeek-Reasoner’s KID from 89.8% to 92.0%; however, challenges persisted with MQo, e.g., DeepSeek-Reasoner dropped from 51.3% to 40.5%. These results indicate that the combined prompting strategy enhances error detection for most syntax errors, particularly for DeepSeek-Reasoner, which consistently outperformed others, while highlighting persistent difficulties with MQo across models. This suggests error-type-specific limitations, even with advanced prompting techniques. Grok-2-Latest achieved an Avg. Acc of 59.3%, with individual error type scores ranging from 88.7% on MC to 45.6% on MP. GPT-4o reached an Avg. Acc of 73.8%, performing best on MC at 90.3% and MB at 87.7%, while showing weakness on MQo at 7.0%. Table 6 presents each error type result with role-based and two-shot prompting.
Table 6. Acc for each error type using role-based and two-shot prompting.

5.3. Computational Time

We also measured the computational time for the 750 test runs under both prompting strategies. With two-shot prompting alone, DeepSeek-Chat completed in 01:41:20, DeepSeek-Reasoner in 05:49:58, DeepSeek-Coder in 01:35:16, and Grok-2-Latest in a notably faster 00:42:47, reflecting varying processing efficiencies, with Grok-2-Latest being the quickest despite its lower Acc. For the benchmark GPT-4o, computational time is not reported. Using role-based and two-shot prompting, DeepSeek-Chat improved to 01:32:40, suggesting a slight efficiency gain, while DeepSeek-Coder took 01:39:27, a minor increase from 01:35:16, and DeepSeek-Reasoner increased significantly to 10:19:10 from 05:49:58, indicating substantial computational overhead, possibly due to its higher Acc and the added complexity in handling role-based context. For Grok-2-Latest, the computational time decreased slightly from 00:42:47 with two-shot prompting to 00:41:51 with role-based and two-shot prompting, further underscoring its relative efficiency. Table 7 shows the time taken for each model.
Table 7. Computation time taken by each model.
Figure 6 presents a series of confusion matrix heatmaps that illustrate the performance of each model at depths of varying context lengths. Panels (a), (c), (e), (g), and (i) reflect results under two-shot prompting only, while panels (b), (d), (f), (h), and (j) incorporate role-based and two-shot prompting, allowing for direct comparison of strategy effectiveness. The visualizations clearly show that all models experienced a gradual decline in Acc as context length increased, with deeper target depths compounding this effect, most notably in Grok-2-Latest (g), where Acc dropped sharply to as low as 16% at 16k tokens. DeepSeek-Reasoner with role-based and two-shot prompting (d) consistently maintained high Acc across all lengths and depths, especially in mid-range contexts (e.g., tokens 2k–4k), confirming its robustness. Furthermore, the benefit of role-based prompting is evident when comparing (a) to (b), (c) to (d), (e) to (f), and (g) to (h), where each model showed increased stability and Acc, especially at deeper target depths. For example, DeepSeek-Chat improved from 68% to 84% at 16k tokens and 0% depth, and DeepSeek-Coder showed better retention at deeper levels with role-based support. These heatmaps visually reinforce the findings discussed earlier, highlighting DeepSeek-Reasoner’s superior adaptability and Grok-2-Latest’s limitations, while showcasing how prompt engineering plays a critical role in preserving the Acc of the model in increasingly complex coding scenarios.
Figure 6. Confusion matrix of the models. (a) DeepSeek-Chat with two-shot prompting. (b) DeepSeek-Chat with role-based and two-shot prompting. (c) DeepSeek-Reasoner with two-shot prompting. (d) DeepSeek-Reasoner with role-based and two-shot prompting. (e) DeepSeek-Coder with two-shot prompting. (f) DeepSeek-Coder with role-based and two-shot prompting. (g) Grok-2-Latest with two-shot prompting. (h) Grok-2-Latest with role-based and two-shot prompting. (i) GPT-4o with two-shot prompting [14]. (j) GPT-4o with role-based and two-shot prompting.

6. Discussion

The experiment demonstrated that DeepSeek models consistently outperformed Grok-2-Latest at detecting syntax errors on code snippets with various lengths and depths. Through two-shot prompting, DeepSeek models achieved between 83.3% and 83.6% Avg. Acc, and through the addition of role-based and two-shot prompting, there was further improvement, with DeepSeek-Reasoner reaching the highest at 86.6%. Grok-2-Latest, however, lagged far behind with a mere Avg. Acc of 50.2%, which suggests that it may not be best optimized for structured code understanding or syntax-level operations.
Though all tested models support increased context lengths of 64k to 128k tokens, the results report a dominating trend: the performance decreases as the token length increases. For example, DeepSeek-Chat showed a drop in Acc from 85.6% at 500 tokens to 72.0% at 16k tokens under two-shot prompting, demonstrating potential limits in attention span or memory. An exception was GPT-4o [14], which maintained a relatively stable performance across the different lengths, scoring an Avg. Acc of 81.2% with a low variance of ±5.5, thus making it a strong baseline.
The use of role-based prompting significantly improved detection rates for the various models, particularly DeepSeek-Reasoner, but at the expense of much longer runtimes. As indicated in Table 7, DeepSeek-Reasoner runtime doubled from 05:49:58 to 10:19:10, and there was evidently a trade-off between Acc and computational efficiency. This overhead can be attributed partly to the longer prompt length introduced by role-based instructions and partly to the model’s internal reasoning process, which may expand during role-based prompting. These factors together highlight the trade-off between richer guidance and computational efficiency. Such a trade-off is an important factor in real deployment settings, where runtime limitations tend to restrict practical deployments.
Despite the gain, even top-performing models were afflicted with some types of errors such as MQo, which shows that some patterns of syntax are still difficult for LLMs to capture regardless of the prompting approach. It is likely that subtle syntax errors often do not strongly disrupt the surrounding context, making them less noticeable to the model. These persistent vulnerabilities also illustrate the value of additional targeted training or fine-tuning on rare or complex syntax errors.
While Avg. Acc indicates peak model performance, high STD values reveal variability and potential instability. For example, DeepSeek models achieve high Avg. Acc but shows large STD, suggesting that their performance is less predictable across scenarios. In contrast, GPT-4o demonstrates lower STD, reflecting more stable and reliable behavior. Considering both Acc and STD allows a more comprehensive evaluation, highlighting the trade-off between peak performance and consistency in model behavior.
Overall, the findings demonstrate that DeepSeek models possess a great capacity for syntax error detection, particularly for short or comparatively complex inputs, and GPT-4o has a balanced performance and stability. However, practical considerations such as runtime, input length sensitivity, and vulnerability to specific errors must be addressed to enable successful application in real-time or large-scale coding environments.
To address potential data contamination and the limited diversity of error types in the original test set, we constructed an additional self-organized test set that includes novel error types. This dataset was deliberately kept simple in structure, yet distinct from the Alpaca dataset evaluation. In this way, we ensured that the errors tested were not trivially memorized during pre-training. Specifically, we focused on three new types of syntax errors: indentation error, which arises when code blocks are not properly indented according to Python’s syntax rules; invalid assignment, which occurs when assignment is attempted to an invalid target such as a number; and missing as, a frequent error in Python exception handling or context management statements when the as keyword is omitted. With this dataset, we evaluated the two top-performing models under each prompting strategy: DeepSeek-Coder and DeepSeek-Reasoner. For this additional evaluation, the setup was intentionally lightweight for each combination of code length and error depth; only three iterations were generated and tested, ensuring a manageable yet illustrative assessment. The resulting analysis provides further evidence of the models’ robustness when facing error types beyond the commonly tested ones, thus offering a more rigorous and diversified assessment of their syntax error detection capabilities.
The results on the self-organized test set indicate that DeepSeek-Reasoner consistently outperformed DeepSeek-Coder across most code lengths and error types. In terms of overall Acc across token lengths, DeepSeek-Reasoner achieved an Avg. Acc of 81.1%, with a peak of 100% at 8k tokens and a low of 60% at 16k tokens, whereas DeepSeek-Coder averaged 76.6%, ranging from 53.3% at 4k tokens to 86.6% at 16k tokens. Examining performance by error type, DeepSeek-Reasoner performed best on missing as at 90.0%, followed by indentation error at 83.2%, and struggled most with invalid assignment at 70.1%. DeepSeek-Coder achieved perfect Acc on missing as at 100%, moderate performance on invalid assignment at 75.5%, and was weakest on indentation error at 54.3%. These results highlight that while both models can handle novel syntax errors, DeepSeek-Reasoner demonstrates more consistent robustness, particularly across error types and varying code lengths. Table 8 and Table 9 present Acc of models in each length and error type.
Table 8. Acc scores of models on the self-organized test set.
Table 9. Acc for each error type on the self-organized test set.

7. Conclusions

The goal of this research is to assess how effectively LLMs can identify and categorize syntax errors in large Python code, with the aim of measuring their performance under varying input lengths, depths, error types, and prompting techniques. The results of this research highlight both the strengths and limitations of existing LLMs in identifying syntax errors in large-scale Python codebases. DeepSeek-Reasoner reported the best Avg. Acc among the models tested, especially when utilizing role-based prompting combined with two-shot prompting, indicating the usefulness of contextual augmentation for model guidance. While the DeepSeek models performed well in syntax error detection, several limitations were observed. Firstly, all models were vulnerable to longer input sequences, with reduced Acc when context length was raised despite having large window capacities. This suggests that effective utilization of longer contexts remains an open problem. Also, the increased efficiency achieved with role-based prompting was at the expense of a notable runtime increase, especially for DeepSeek-Reasoner, which can restrict its real-world usage to settings where time and resources are not limited. Another notable limitation is the inconsistent performance across certain error types, specifically MQo, which persisted across models and prompting strategies. This indicates that current LLMs may still lack robust internal representations for handling rare or structurally complex syntax patterns. Therefore, while models are effective in controlled settings, their deployment in real-world coding environments requires further optimization to address issues related to efficiency, generalization, and error-specific handling. These findings open up avenues for additional research in AI-driven software maintenance and building more robust models that can deal with large-scale, complicated programming environments. Future research can be conducted by expanding the analysis to multi-language codebases other than Python, or by increasing the input size to evaluate the performance at the scale of complete production-level repositories. Future research may also involve testing additional LLMs to explore how newer or alternative models perform in syntax error detection tasks. These directions would provide a fuller picture of model robustness and generalizability across a variety of software development scenarios.

Author Contributions

Conceptualization, N.A. and A.A.; investigation, N.A. and A.A.; methodology, N.A. and A.A.; project administration, N.A.; resources, N.A. and A.A.; supervision, A.A.; validation, N.A. and A.A.; writing—original draft, N.A. and A.A.; writing—review and editing, N.A. and A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data used in this study are publicly available and can be accessed at the following Hugging Face repository: https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca (accessed on 20 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
LLMsLarge Language Models
MCMissing Colon
MQMissing Quotation
MPMissing Parenthesis
MCoMissing Comma
MQMismatched Quotation
MBMismatched Brackets
KIDKeyword as Identifier

References

  1. Gore, D.V.; Binoj, M.; Borate, S.; Devnani, R.; Gopale, S. Syntax Error Detection and Correction in Python Code using ML. Grenze Int. J. Eng. Technol. (GIJET) 2023, 9, 296. [Google Scholar]
  2. Alasmari, O.A.; Singer, J.; Bikanga Ada, M. Do current online coding tutorial systems address novice programmer difficulties? In Proceedings of the 15th International Conference on Education Technology and Computers, Barcelona, Spain, 26–28 September 2023; pp. 242–248. [Google Scholar]
  3. Aniche, M.; Maziero, E.; Durelli, R.; Durelli, V.H. The effectiveness of supervised machine learning algorithms in predicting software refactoring. IEEE Trans. Softw. Eng. 2020, 48, 1432–1450. [Google Scholar] [CrossRef]
  4. Zhu, Q.; Sun, Z.; Xiao, Y.a.; Zhang, W.; Yuan, K.; Xiong, Y.; Zhang, L. A syntax-guided edit decoder for neural program repair. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, 23–28 August 2021; pp. 341–353. [Google Scholar]
  5. Wu, L.; Li, F.; Wu, Y.; Zheng, T. Ggf: A graph-based method for programming language syntax error correction. In Proceedings of the 28th International Conference on Program Comprehension, Seoul, Republic of Korea, 13–15 July 2020; pp. 139–148. [Google Scholar]
  6. Li, T.; Zhang, Y. Multilingual code refactoring detection based on deep learning. Expert Syst. Appl. 2024, 258, 125164. [Google Scholar] [CrossRef]
  7. Pandhare, H.V. From Test Case Design to Test Data Generation: How AI is Redefining QA Processes. Int. J. Eng. Comput. Sci. 2024, 13, 26737–26757. [Google Scholar] [CrossRef]
  8. Gao, S.; Gao, C.; He, Y.; Zeng, J.; Nie, L.; Xia, X.; Lyu, M. Code structure–guided transformer for source code summarization. ACM Trans. Softw. Eng. Methodol. 2023, 32, 1–32. [Google Scholar] [CrossRef]
  9. Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
  10. Liu, J.; Chen, Y.; Liu, M.; Peng, X.; Lou, Y. Stall+: Boosting llm-based repository-level code completion with static analysis. arXiv 2024, arXiv:2406.10018. [Google Scholar]
  11. Campos, V. Bug Detection and Localization using Pre-trained Code Language Models. In Proceedings of the INFORMATIK 2024, Wiesbaden, Germany, 24–26 September 2024; Gesellschaft für Informatik eV: Bonn, Germany, 2024; pp. 1419–1429. [Google Scholar]
  12. Li, Y.; Parsert, J.; Polgreen, E. Guiding enumerative program synthesis with large language models. In Proceedings of the International Conference on Computer Aided Verification, Montreal, QC, Canada, 24–27 July 2024; Springer: Cham, Switzerland, 2024; pp. 280–301. [Google Scholar]
  13. Kamradt, G. Needle In A Haystack—Pressure Testing LLMs. 2023. Available online: https://github.com/gkamradt/LLMTest_NeedleInAHaystack (accessed on 20 July 2025).
  14. Lee, H.; Sharma, S.; Hu, B. Bug in the code stack: Can llms find bugs in large python code stacks. arXiv 2024, arXiv:2406.15325. [Google Scholar] [CrossRef]
  15. Anthropic, A. Introducing the Next Generation of Claude. 2024. Available online: https://www.anthropic.com/news/claude-3-family (accessed on 1 July 2025).
  16. Team, G.; Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024, arXiv:2403.05530. [Google Scholar] [CrossRef]
  17. Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
  18. xAI. Grok-2 Beta Release; xAI: San Francisco Bay, CA, USA, 2024. [Google Scholar]
  19. Rahman, M.M.; Watanobe, Y.; Nakamura, K. Source code assessment and classification based on estimated error probability using attentive LSTM language model and its application in programming education. Appl. Sci. 2020, 10, 2973. [Google Scholar] [CrossRef]
  20. Kanoutas, T.; Karanikiotis, T.; Symeonidis, A.L. Enhancing code readability through automated consistent formatting. Electronics 2024, 13, 2073. [Google Scholar] [CrossRef]
  21. Wang, J.; Li, L.; Liu, K.; Du, X. Detecting and Explaining Python Name Errors. Inf. Softw. Technol. 2025, 178, 107592. [Google Scholar] [CrossRef]
  22. Yasunaga, M.; Liang, P. Break-it-fix-it: Unsupervised learning for program repair. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: Cambridge, MA, USA, 2021; pp. 11941–11952. [Google Scholar]
  23. Ahmed, T.; Ledesma, N.R.; Devanbu, P. Synshine: Improved fixing of syntax errors. IEEE Trans. Softw. Eng. 2022, 49, 2169–2181. [Google Scholar] [CrossRef]
  24. Lutellier, T.; Pham, H.V.; Pang, L.; Li, Y.; Wei, M.; Tan, L. Coconut: Combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, 18–22 July 2020; pp. 101–114. [Google Scholar]
  25. Dinella, E.; Dai, H.; Li, Z.; Naik, M.; Song, L.; Wang, K. Hoppity: Learning graph transformations to detect and fix bugs in programs. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 26 April–1 May 2020. [Google Scholar]
  26. Berabi, B.; He, J.; Raychev, V.; Vechev, M. Tfix: Learning to fix coding errors with a text-to-text transformer. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: Cambridge, MA, USA, 2021; pp. 780–791. [Google Scholar]
  27. Phung, T.; Cambronero, J.; Gulwani, S.; Kohn, T.; Majumdar, R.; Singla, A.; Soares, G. Generating high-precision feedback for programming syntax errors using large language models. arXiv 2023, arXiv:2302.04662. [Google Scholar] [CrossRef]
  28. Yang, Z.; Wang, S.; Yan, Y.; Deng, Y. Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors. arXiv 2025, arXiv:2503.22388. [Google Scholar]
  29. Prenner, J.A.; Robbes, R. Automatic program repair with openai’s codex: Evaluating quixbugs. arXiv 2021, arXiv:2111.03922. [Google Scholar]
  30. Bisht, T. python-code-instructions-18k-alpaca. 2023. Available online: https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca (accessed on 20 July 2025).
  31. Rajaraman, N.; Jiao, J.; Ramchandran, K. Toward a theory of tokenization in llms. arXiv 2024, arXiv:2404.08335. [Google Scholar] [CrossRef]
  32. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  33. Bansal, P. Prompt engineering importance and applicability with generative AI. J. Comput. Commun. 2024, 12, 14–23. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.