1. Introduction
Syntax errors rank among the most common types of programming errors encountered during software development. These errors occur when the written code does not conform to the grammatical rules of the programming language, such as a missing colon (MC), mismatched quotation (MQ), missing parenthesis (MP), or the use of a reserved word as a variable identifier [
1]. Although such errors may seem minor, they can seriously disrupt the development process by preventing the code from being properly compiled or interpreted, ultimately hindering the execution of the program. In large-scale software projects that involve thousands of lines of code, syntax errors can impede developers’ productivity, as tracking and fixing them often takes a lot of time and can be frustrating [
2]. With the advancement of integrated development environments (IDEs) and code analysis tools, new solutions—particularly those powered by artificial intelligence (AI) and machine learning (ML)—are emerging to detect and correct these errors more quickly and efficiently. These improvements help developers write higher-quality code and enhance the overall efficiency of the development process.
Over the past few years, ML has emerged as a key tool in many areas of software development. For example, in software refactoring, ML algorithms are trained on massive datasets containing millions of actual refactorings from real-world projects. By learning from classes and methods that have undergone refactoring in practice, the resulting models can provide more reliable refactoring recommendations to developers [
3]. On the other hand, ML models can improve the detection and correction of syntax errors in software development. Compared to traditional methods, including compiler diagnostics that produce ambiguous or generic messages, ML-based solutions have the potential to provide more useful and automated suggestions. By being trained on massive amounts of syntactically correct code, these models can recognize typical error patterns and support the suggestion of precise fixes [
4]. This type of automated assistance has the potential to save developers time and reduce the need for manual debugging.
Building on the foundations of ML, deep learning (DL) has significantly advanced the field of software development by enabling more intelligent and automated solutions to complex programming tasks. For example, in the field of bug detection and correction, neural networks can be trained on massive repositories of code to identify common and subtle programming errors. This approach often outperforms traditional static analysis tools. These DL models are ideally suited for capturing long-term dependencies, such as MP or invalid logical operators, which can be difficult for rule-based systems to detect and are cumbersome for developers to address in cases of very complex code [
5]. In addition to bug detection, DL is also used in other areas of software development, such as code refactoring [
6], test case generation [
7], and code summarization [
8]. These capabilities have revolutionized the development workflow, allowing developers to write more secure and efficient code with less manual effort.
As ML technologies advance, LLMs have demonstrated remarkable potential in code-related applications. Originally developed for human language processing, LLMs such as GPT [
9] have proven viable in comprehending and generating programming code by virtue of their ability to model both syntax and semantics. Their strength lies in learning from massive, diverse datasets, which allows them to perform tasks such as code completion [
10], bug detection [
11], and program synthesis [
12]. Recent benchmarks such as needle-in-a-haystack (NIAH) [
13] and bug in the code stack (BICS) [
14] evaluate how effectively these models can locate subtle bugs or retrieve relevant information across large codebases. As LLMs continue to improve, they are expected to play a more significant role in enhancing developer productivity through real-time suggestions and intelligent debugging assistance.
Several research studies have analyzed the process of extracting critical small data points from extensive text collections by using the NIAH benchmark to evaluate LLMs on their ability to locate such information within extensive texts. NIAH specializes in retrieving text, but researchers now utilize LLM technology for software-related projects by evaluating systems based on their capacity to detect errors in codebases. BICS provides a systematic analysis of LLM debugging performance with Python code as an evaluation platform. The BICS benchmark system creates syntax errors at different depths within extensive code stacks to evaluate GPT-4o, Claude 3 [
15], Gemini 1.5 [
16], and additional models regarding their error discovery competencies. Research results demonstrate how longer context affects system performance, since certain models experience a decline in performance when the code length increases.
The contribution of this paper is to investigate the effectiveness of various LLMs, such as DeepSeek [
17] and Grok [
18], in detecting syntax errors under varying code lengths, code depths, and prompt designs. It further explores the impact of prompt engineering on enhancing model performance. It also highlights the critical limitations in current LLMs for accurate syntax error detection and their effectiveness in a real-world setting. Finally, the paper provides insights and recommendations for future improvements in model capabilities.
2. Related Work
Several studies have addressed the use of ML techniques to identify code errors—particularly syntax-related issues [
19,
20,
21]. For example, through automatic code bug fixing with unsupervised learning, Yasunaga and Liang [
22] proposed break-it-fix-it (BIFI), a method that trains a fixer to correct bad code (e.g., syntax errors) and a breaker to generate realistic bad code, thereby iteratively improving both models without the need for labeled data. The authors highlight the limitations of existing approaches that rely on synthetic perturbations, which often fail to match real error distributions, and demonstrate BIFI’s effectiveness on two datasets: GitHub-Python and DeepFix. With a Transformer-based model, BIFI attained state-of-the-art results with 90.5% accuracy (Acc) on GitHub-Python and 71.7% on DeepFix, surpassing baseline methods such as back-translation. This paper also covers BIFI’s real-world error pattern adaptation strategies using a critic (e.g., a compiler) to verify corrections, in the context of unsupervised machine translation (MT) and domain adaptation.
By helping beginners fix Java syntax errors, Ahmed et al. [
23] presented SynShine, a tool that combines compiler error messages (javac), pre-trained language models (RoBERTa), and smart editing suggestions to fix mistakes. The authors show how existing tools struggle with long code and unclear errors, while SynShine achieves 75% Acc on single-error fixes, beating prior tools like BF + FF by 18%. The key steps are: (1) use javac to help narrow down the bugs, (2) harness RoBERTa for learning code gist, and (3) generate fixes through simple “insert/delete/replace” operations instead of complicated code rewriting. As a part of VSCode, SynShine is fast even on basic computers.
Combining context-aware neural translation (CoCoNuT), a unique automatic program repair (APR) technique, is presented by Lutellier et al. [
24]. It fixes bugs in several programming languages by utilizing context-aware neural machine translation (NMT) and ensemble learning. By combining convolutional neural networks (CNNs) for hierarchical feature extraction with a dual-encoder NMT architecture that evaluates buggy code and its context independently, the authors overcome the drawbacks of conventional APR techniques, such as their dependency on hard-coded rules. Their method accurately fixes 509 bugs (including 309 that were previously unfixed) in Java, C, Python, and JavaScript benchmarks, outperforming 27 state-of-the-art techniques. Although the use of attention maps facilitates interpretability, limitations include difficulties with generalizability outside of tested benchmarks, and the computational expense of training.
Hoppity, a learning-based tool for detecting and fixing JavaScript bugs using graph transformations, is presented by Dinella et al. [
25]. By modeling issue patches as a series of graph changes (such as adding, removing, or altering nodes in a program’s abstract syntax tree (AST)), the authors present a novel method. Unlike other works, Hoppity works end-to-end, locating bugs and producing fixes without the need for human participation, and it manages complicated bugs that call for structural modifications. The model, which was trained on 290,715 JavaScript code change commits on GitHub, combines an LSTM-based controller to forecast modifications with a graph neural network (GNN) for program representation. According to the results, Hoppity outperforms baselines like gated graph neural networks (GGNN) and SequenceR, correctly fixing 9490 out of 36,361 programs. Limitations include scalability to large ASTs and reliance on beam search for inference.
In the field of bug fixing, tools struggle to accurately cover a wide range of bugs due to the complexity and size of modern code databases. To address this challenge, Berabi et al. in [
26] introduced a new learning-based system, TFix, which presents the bug fixing problem as a text-to-text task, enabling it to operate directly on the program text and avoid complex representations. They did this by leveraging a powerful T5-based Transformer model, pre-trained on natural language and then fine-tuned to generate code fixes using a large, high-quality dataset of 5.5 million commits on GitHub. All 52 bug types were fine-tuned to enhance knowledge transfer. The findings demonstrated that TFix works well in reality, producing code that corrects errors in roughly 67% of instances and surpassing other learning techniques like SequenceR, CoCoNuT, and Hoppity by a significant margin. The limitations of this approach include the fact that the exact match metric represents a lower bound on Acc due to the possibility of multiple ways of fixing the error, and future work is proposed to extend TFix to other languages and more complex error types where error localization is difficult.
Recently, the focus has started to shift from traditional ML approaches to LLMs, which offer broader capabilities in detecting and fixing bugs in code. For example, by automating the generation of feedback for syntax errors within Python scripts, Phung et al. [
27] explore how LLMs, in particular OpenAI’s Codex, might improve programming education. By developing a method called PyFiXV, the authors aim to support educators in monitoring the level of feedback that learners receive. The shortcomings of conventional programming-context messages are pointed out, and the difficulties faced by inexperienced programmers when they run into syntax errors are highlighted.
Applying the naturalness hypothesis to pre-trained code LLMs, Campos et al. [
11] introduced a lightweight method for bug detection and localization in source code based on token-level likelihood scores. The authors hypothesize that LLMs, trained predominantly on correct code, assign lower likelihoods to buggy code, which they formalize through two metrics: likelihood score (mean token probability) and likelihood ratio (comparison to the model’s preferred token). Assessing the public HumanEvalFix benchmark, as well as the private Subato benchmark containing student submissions, the authors show that smaller models like Code Llama 7B can achieve Acc levels of 78% on bug detection, with token-level error localization at 79% (top three tokens). However, the study points out important shortcomings, such as a drop in performance due to overfitting, data leaks, unfamiliar coding styles, and other peculiarities.
By automating feedback generation for syntax and runtime errors in Python scripts, Yang et al. [
28] introduced a data science debugging benchmark (DSDBench), a benchmark specifically designed to evaluate how well LLMs debug complex data science code with multiple bugs and multi-hop logical errors. The authors built DSDBench using code samples from existing benchmarks like DABench and MatPlotBench, injecting both single- and multi-error scenarios through LLM-based methods, and carefully annotating cause-effect lines and error messages. They tested several state-of-the-art models, including GPT-4o, Claude, and DeepSeek-R1, and found that while models perform reasonably well on single-bug detection, their Acc significantly drops in multi-bug cases, especially when tracing the root cause across multiple lines of code. The benchmark highlights how existing LLMs struggle with real-world data science debugging tasks and emphasizes that there is still more work to be performed to enhance an LLM’s reasoning skills for intricate and multiple-file codebases.
Prenner and Robbes [
29] take a closer look at whether Codex can fix bugs in software without needing specific training for that task. They run Codex on 40 buggy programs from the QuixBugs benchmark, written in Python and Java, using different prompt styles like code-only, code-with-hint, code-with-docstring, input-output examples, and others. They show that Codex is able to repair 23 Python bugs (outperforming CoCoNuT and DeepDebug) and 14 Java bugs (though with lower precision in Java), even though the model has not been trained explicitly for APR. Major limitations include possible data leakage-Codex may have seen QuixBugs during training-and the reliance on manual evaluation. The paper demonstrates the potential of Codex for APR, but suggests that further exploration is needed through automation, parameter fine-tuning, and wider benchmarking.
Lee et al. [
14] evaluated the ability of LLMs to detect simple syntactic errors in large Python code structures, where information retrieval in software environments faces significantly greater challenges than in text-based environments. To address this gap, the researchers introduce a new benchmark called BICS, the first of its kind that specifically targets the debugging capabilities of long-context code. BICS is constructed by compiling clean Python code snippets from the Alpaca dataset, after removing existing errors using a static code analyzer, to form “code stacks” of varying token lengths ranging from 500 to 16,000 tokens. One syntactic error of seven specific types is then injected into one of these stacks at five different target depths (0%, 25%, 50%, 75%, and 100%). Eleven leading LLMs (such as GPT-4o and Claude 3.5 Sonnet) were evaluated by asking them to accurately identify the line number and error type, which requires a deep understanding of the code structure. The results revealed that code environments pose significantly more challenges for retrieval tasks than text-based environments, with significant variation in performance between models. GPT-4o and GPT-4-Turbo performed the best, with an average accuracy (Avg. Acc) of 81.2% for GPT-4o. A significant degradation in performance was also observed with increasing context length, and performance varied depending on error depth and type, with the MC and MP errors being the most easily retrievable. The benchmark’s current shortcomings include its emphasis on basic syntactic errors; however, future research suggests broadening its scope to encompass runtime errors, warnings, and a variety of programming languages, including C++ and Rust.
The innovation of this study lies in its systematic exploration of how different LLMs handle syntax error detection in long and complex code structures. While earlier research typically evaluated model performance using a single prompting strategy, this study broadens the scope by examining multiple LLMs across varied prompt designs, code lengths, and error depths. This approach enables a more comprehensive assessment of model capabilities and robustness, better reflecting real-world programming scenarios. By combining model-level comparisons with analyses of prompt strategies, the work not only identifies performance gaps but also uncovers critical limitations in stability and consistency, offering a nuanced understanding of LLM capabilities in software debugging contexts.
4. Experiment
In this experiment, we tested the models at different lengths and depths. Length refers to the desired size of the combined code snippets—specifically, the total token count of the final concatenated string—which was tested with values [500, 1k, 2k, 4k, 8k, 16k], measured using a tokenizer to ensure the output aligns with LLM input limits and various testing scenarios. Depth, tested at [0%, 25%, 5%, 75%, 100%], indicates the proportion of that target length filled with initial correct code before inserting a specific error, thus controlling the relative position of the error within the overall structure. This is how the haystack was built and the needle (syntax error) inserted. Together, these parameters enable us to investigate a range of conditions by adjusting the total code length and the error’s placement, and following each prompt delivery, we conducted 25 test iterations for each combination of code length (e.g., 8k tokens) and error depth (e.g., 50% depth), resulting in 750 total test runs per model and prompt strategy (6 lengths × 5 depths × 25 iterations). This process systematically combined these values and involved repeating the procedure multiple times per combination to gather thorough and reliable insights into their impact. For example, when the length is 1k and the depth is 50%, it was tested 25 different times, each time with a different haystack and type of error. Each time, the model was fed a prompt containing the haystack and the error, ensuring that the entire prompt remained within the model’s window. All experiments were conducted using Google Colab and Python 3.12.11 to ensure consistent execution across models and test iterations.
The performance of the model was evaluated based on three main evaluation metrics: Acc, STD, and execution time. Acc measures how many times the model predicted correctly; it is a proportional measure of the number of correct predictions (true positives (TP) and true negatives (TN)) to all predictions (TP, TN, false positive (FP) and false negative (FN)), as presented in Equation (
1).
STD measures how consistently the model predicts correctly. It is a statistical metric that quantifies the degree to which individual predictions deviate from the average prediction. A lower STD indicates more consistent predictions, whereas a higher STD reflects greater variability. It is calculated as the square root of the average of the squared differences between each prediction and the mean, where
represents each data point,
is the mean of the data, and
N is the total number of data points, as presented in Equation (
2).
Execution time refers to the amount of time it takes for a model to complete a specific task or prediction. It is a crucial performance metric, particularly when comparing models in terms of efficiency and usability in real-time applications. A shorter execution time indicates a faster and more efficient model, which is beneficial for time-sensitive applications. In this evaluation, the time taken by each model during testing was measured and compared to assess their computational efficiency.
6. Discussion
The experiment demonstrated that DeepSeek models consistently outperformed Grok-2-Latest at detecting syntax errors on code snippets with various lengths and depths. Through two-shot prompting, DeepSeek models achieved between 83.3% and 83.6% Avg. Acc, and through the addition of role-based and two-shot prompting, there was further improvement, with DeepSeek-Reasoner reaching the highest at 86.6%. Grok-2-Latest, however, lagged far behind with a mere Avg. Acc of 50.2%, which suggests that it may not be best optimized for structured code understanding or syntax-level operations.
Though all tested models support increased context lengths of 64k to 128k tokens, the results report a dominating trend: the performance decreases as the token length increases. For example, DeepSeek-Chat showed a drop in Acc from 85.6% at 500 tokens to 72.0% at 16k tokens under two-shot prompting, demonstrating potential limits in attention span or memory. An exception was GPT-4o [
14], which maintained a relatively stable performance across the different lengths, scoring an Avg. Acc of 81.2% with a low variance of ±5.5, thus making it a strong baseline.
The use of role-based prompting significantly improved detection rates for the various models, particularly DeepSeek-Reasoner, but at the expense of much longer runtimes. As indicated in
Table 7, DeepSeek-Reasoner runtime doubled from 05:49:58 to 10:19:10, and there was evidently a trade-off between Acc and computational efficiency. This overhead can be attributed partly to the longer prompt length introduced by role-based instructions and partly to the model’s internal reasoning process, which may expand during role-based prompting. These factors together highlight the trade-off between richer guidance and computational efficiency. Such a trade-off is an important factor in real deployment settings, where runtime limitations tend to restrict practical deployments.
Despite the gain, even top-performing models were afflicted with some types of errors such as MQo, which shows that some patterns of syntax are still difficult for LLMs to capture regardless of the prompting approach. It is likely that subtle syntax errors often do not strongly disrupt the surrounding context, making them less noticeable to the model. These persistent vulnerabilities also illustrate the value of additional targeted training or fine-tuning on rare or complex syntax errors.
While Avg. Acc indicates peak model performance, high STD values reveal variability and potential instability. For example, DeepSeek models achieve high Avg. Acc but shows large STD, suggesting that their performance is less predictable across scenarios. In contrast, GPT-4o demonstrates lower STD, reflecting more stable and reliable behavior. Considering both Acc and STD allows a more comprehensive evaluation, highlighting the trade-off between peak performance and consistency in model behavior.
Overall, the findings demonstrate that DeepSeek models possess a great capacity for syntax error detection, particularly for short or comparatively complex inputs, and GPT-4o has a balanced performance and stability. However, practical considerations such as runtime, input length sensitivity, and vulnerability to specific errors must be addressed to enable successful application in real-time or large-scale coding environments.
To address potential data contamination and the limited diversity of error types in the original test set, we constructed an additional self-organized test set that includes novel error types. This dataset was deliberately kept simple in structure, yet distinct from the Alpaca dataset evaluation. In this way, we ensured that the errors tested were not trivially memorized during pre-training. Specifically, we focused on three new types of syntax errors: indentation error, which arises when code blocks are not properly indented according to Python’s syntax rules; invalid assignment, which occurs when assignment is attempted to an invalid target such as a number; and missing as, a frequent error in Python exception handling or context management statements when the as keyword is omitted. With this dataset, we evaluated the two top-performing models under each prompting strategy: DeepSeek-Coder and DeepSeek-Reasoner. For this additional evaluation, the setup was intentionally lightweight for each combination of code length and error depth; only three iterations were generated and tested, ensuring a manageable yet illustrative assessment. The resulting analysis provides further evidence of the models’ robustness when facing error types beyond the commonly tested ones, thus offering a more rigorous and diversified assessment of their syntax error detection capabilities.
The results on the self-organized test set indicate that DeepSeek-Reasoner consistently outperformed DeepSeek-Coder across most code lengths and error types. In terms of overall Acc across token lengths, DeepSeek-Reasoner achieved an Avg. Acc of 81.1%, with a peak of 100% at 8k tokens and a low of 60% at 16k tokens, whereas DeepSeek-Coder averaged 76.6%, ranging from 53.3% at 4k tokens to 86.6% at 16k tokens. Examining performance by error type, DeepSeek-Reasoner performed best on missing as at 90.0%, followed by indentation error at 83.2%, and struggled most with invalid assignment at 70.1%. DeepSeek-Coder achieved perfect Acc on missing as at 100%, moderate performance on invalid assignment at 75.5%, and was weakest on indentation error at 54.3%. These results highlight that while both models can handle novel syntax errors, DeepSeek-Reasoner demonstrates more consistent robustness, particularly across error types and varying code lengths.
Table 8 and
Table 9 present Acc of models in each length and error type.
7. Conclusions
The goal of this research is to assess how effectively LLMs can identify and categorize syntax errors in large Python code, with the aim of measuring their performance under varying input lengths, depths, error types, and prompting techniques. The results of this research highlight both the strengths and limitations of existing LLMs in identifying syntax errors in large-scale Python codebases. DeepSeek-Reasoner reported the best Avg. Acc among the models tested, especially when utilizing role-based prompting combined with two-shot prompting, indicating the usefulness of contextual augmentation for model guidance. While the DeepSeek models performed well in syntax error detection, several limitations were observed. Firstly, all models were vulnerable to longer input sequences, with reduced Acc when context length was raised despite having large window capacities. This suggests that effective utilization of longer contexts remains an open problem. Also, the increased efficiency achieved with role-based prompting was at the expense of a notable runtime increase, especially for DeepSeek-Reasoner, which can restrict its real-world usage to settings where time and resources are not limited. Another notable limitation is the inconsistent performance across certain error types, specifically MQo, which persisted across models and prompting strategies. This indicates that current LLMs may still lack robust internal representations for handling rare or structurally complex syntax patterns. Therefore, while models are effective in controlled settings, their deployment in real-world coding environments requires further optimization to address issues related to efficiency, generalization, and error-specific handling. These findings open up avenues for additional research in AI-driven software maintenance and building more robust models that can deal with large-scale, complicated programming environments. Future research can be conducted by expanding the analysis to multi-language codebases other than Python, or by increasing the input size to evaluate the performance at the scale of complete production-level repositories. Future research may also involve testing additional LLMs to explore how newer or alternative models perform in syntax error detection tasks. These directions would provide a fuller picture of model robustness and generalizability across a variety of software development scenarios.