1. Introduction
Programming is among the most critical skills in the field of computing and software engineering. As a consequence, programming education has received an ever-increasing level of attention. Many educational institutions (universities, colleges, and professional schools) offer extensive programming education options to enhance the programming skills of their students. Indeed, programming has become recognized as a core literacy [
1]. Programming skills are developed primarily through repetitive practice, and many universities [
2,
3,
4,
5] have created their own programming learning platforms to facilitate such practice by their students. These platforms are often used for programming competitions and serve as automated assessment tools for programming courses [
6].
Novice programmers tend to have difficulty developing and debugging source code due to the presence of errors of various types (especially logical errors) and the insufficiency of conventional compilers to detect these errors [
7,
8].
Example 1. Consider a simple program that takes an integer input n from the keyboard and generates an output sum s that repetitively adds integers from 1 through n. The solution code is written in C programming language to implement the procedure and is compiled by a conventional compiler. After compiling, the user inputs n = 6 and the program correctly produces sum s = 21 as the output; similarly, input n = 7 produces an output sum s = 28.
#include <stdio.h> |
int main(){ |
int j, l, totalsum = 0; |
printf(“Give a number: “); |
scanf(“%d”, &l); |
for (j = 1; j < = l; j++){ |
totalsum = totalsum + j; |
} |
printf(“Total sum of 1 to %d is: %d\n”, l, totalsum); |
return (0); |
} |
Now consider the code below in which a novice programmer has made a mistake (a small logic error) but the compiler executes the program normally and generates output, which, in this case, is incorrect. Specifically, the program has taken input
n = 6 and produced output sum
s = 15; similarly, input
n = 7 produces output sum
s = 21.
#include <stdio.h> |
int main(){ |
int j, l, totalsum = 0; |
printf(“Give a number: “); |
scanf(“%d”, &l); |
for (j = 1; j < l; j++){ |
Totalsum = totalsum + j; |
} |
printf(“Total sum of 1 to %d is: %d\n”, l, totalsum); |
return (0); |
} |
No compiler has the ability to detect the coding error here. In more complex examples, such logic errors can be difficult to resolve. Environment-dependent logic errors, such as forgetting to include “= 0” for
totalsum in the above example, are not uncommon, and even experienced programmers can make errors in source code [
9]. It is widely accepted that many known and unknown errors go unrecognized by conventional compilers, which means that programmers often spend valuable time identifying and fixing these errors. To help programmers, especially novice programmers, deal with such source code errors quickly and efficiently, research seeking to shed light on the issue is being actively conducted in programming education [
10,
11].
A variety of source code and software engineering methods have been proposed, such as source code classification [
12,
13], code clone detection [
14,
15], defect prediction [
16], program repair [
17,
18], and code completion [
19,
20]. Recently, natural language processing (NLP) has been used in a number of domains, including speech recognition, language processing, and machine translation. The most commonly used language models, including bi-gram, GloVe [
21], tri-gram, and skip-gram, are examples of NLP-based language models. However, while these models may be useful for relatively short, simple codes, they are considerably less effective for long, complex codes. Today, deep neural network models are being used for language modeling due to their ability to consider long input sequences, and deep neural network-based language models are being developed for source code bug detection, logic error detection, and code completion [
20,
22,
23,
24,
25]. Recurrent neural networks (RNNs) have been used but are less effective due to gradient vanishing or exploding [
26]. Long short-term memory (LSTM) has overcome this problem.
LSTM neural networks consider previous input sequences for prediction or output. However, the functions, classes, methods, and variables of a source code may depend on both previous and subsequent code sections or lines. In such cases, LSTM may not produce optimal results. To fill this gap, we propose a bidirectional LSTM (hereafter BiLSTM) language model to evaluate and repair source codes. A BiLSTM neural network can combine both past and future code sequences to produce output [
27]. In constructing and applying our model, we first perform a series of pre-processing tasks on the source code, then encode the code with a sequence of IDs. Next, we train the BiLSTM neural network using the encoded source codes. Finally, the trained BiLSTM model is used for source code evaluation and repair. Our proposed model can be used for different systems (i.e., online judge type, or program/software development where specifications and input/output are well defined) where problems (questions), submission forms (editors), and automatic assessments are involved. We plan to use the proposed model for an intelligent coding environment (ICE) [
28] via API (Application Programming Interface). ICE is one of the examples of many services. On the other hand, there are many powerful and intelligent IDEs (i.e., grammatical support) available, but our model (which can be applied for online judge type systems) can provide much smarter feedback by identifying logical errors than conventional IDEs.
The main contributions of our work are summarized below:
The proposed BiLSTM language model for code evaluation and repair can effectively detect errors (including logical errors) and suggest corrections for incorrect code.
Application of the proposed model to real-world solution codes collected from the Aizu Online Judge (AOJ) system produced experimental results that indicate superior performance in comparison to other approaches.
The BiLSTM model can be helpful to students, programmers (especially novice programmers), and professionals, who often struggle to resolve code errors.
The model accelerates the code evaluation process.
The proposed model can be used for different real-world programming learning and software engineering related systems and services.
The remainder of the article is organized as follows:
Section 2 presents related works,
Section 3 describes the approach,
Section 4 presents experimental results,
Section 5 points out limitations of the model, and
Section 6 offers conclusions and suggestions for future development.
2. Related Works
The wide range of application domains and the functionality of deep neural networks make them powerful and appealing. Recently, machine learning (ML) techniques have been used to solve complex programming-related problems. Accordingly, researchers have begun to focus on the development and application of deep neural network-based models in programming education and software engineering.
In Reference [
7], logic errors (LEs) are a type of error that persists after compilation, whereas typical compilers can only detect syntax and semantic errors in codes. This paper proposed a practical approach to identify and discover logic errors in codes for object-oriented-based environments (i.e., C# .Net Framework). Their proposed Object Behavior Environment (OBEnvironment) can help programmers to avoid logical errors based on predefined behaviors by using Alsing, Xceed, and Mind Fusion Components. This approach is not similar to our proposed BiLSTM model, as their model is developed for the C# programming language in the .Net framework. Al-Ashwal et al. [
8] introduced a CASE (computer-aided software engineering) tool to identify logical errors in Java programs using both dynamic and static methods. Programmers faced difficulties in identifying logical errors in the codes during testing; sometimes it is necessary to manually check the whole code, which also takes a large amount of time, effort, and cost. They used PMD and Junit tools to identify logical errors on the basis of a list of some common logic errors related to Java. This study is only effective in identifying logical errors in Java programs but will not be effective in other programming languages.
In article [
29], an automated logical error detection technique is proposed for functional programming assignments. A large amount of manual and hand-made efforts are required to identify logical errors in test cases. This proposed technique used a reference solution for each assignment (written in OCaml programming language) of students to create a counter-example that contains all the semantic differences between the two programs. This method identified 88 more logical errors that were not identified by the mature test cases. Moreover, this technique can be effective for automatic code repair. The disadvantage of this method is that a reference program is needed to identify logical errors for each incorrect code. In [
30], the authors studied a large number of research papers on programming languages and natural languages that were implemented using probabilistic models. They also described how researchers adapted these models to various application domains. Raychev et al. [
19] addressed code completion by adopting an n-gram language model and RNN. Their model was quick and effective in code completion tasks. Allamanis et al. [
31] proposed a neural stochastic language model to suggest methods and class names in source codes. The model analyzed the meaning of code tokens before making its suggestions and produced notable success in performing method, class, and variable naming tasks.
In article [
32], the authors proposed a model for predicting defective regions in source codes on the basis of the code’s semantics. The proposed deep belief network (DBN) was trained to learn the semantic features of the code using token vectors derived from the code’s abstract syntax tree (AST), as every source code contains method, class, and variable names that provide important information. On the basis of this semantic meaning, Pradel et al. [
33] introduced a name-based bug detection model for codes.
Song et al. [
34] proposed a bidirectional LSTM model to detect malicious JavaScript. In order to obtain semantic information from the code, the authors first constructed a program dependency graph (PDG) for generating
semantic slices. The PDG stores semantic information that is later used to create vectors. The approach was shown to have 97.71% accuracy, with an F1-score of 98.29%. In articles [
22,
24], the authors proposed an LSTM-based model for source code bug detection, code completion, and classification. Both these methods were used to develop the programming skills of novice programmers. Experimental results, obtained by tuning the various hyper parameters and settings of the network, showed that both models achieved better results for bug detection and code completion in comparison with other related models. In [
20,
23], the authors proposed error detection, logic error detection, and the classification of source codes on the basis of an LSTM model. Both approaches used an attention mechanism that enhanced model scalability. On the basis of various performance scales, both models achieved significant success compared to more sophisticated models. As noted earlier, however, an LSTM-based model considers only previous input sequences for prediction but is unable to consider future sequences. The proposed BiLSTM model has the ability to consider both past and future sequences for output prediction.
In brief, there have been a number of novel and effective neural network and probabilistic models proposed by researchers to solve problems related to source codes. The proposed BiLSTM model is unlike other models in that it considers both the previous and subsequent context of codes to detect errors and offer suggestions that enable programmers and professionals to make the needed repairs efficiently.
6. Conclusions
It is generally recognized that conventional compilers and other code evaluation systems are unable to reliably detect logic errors and provide proper suggestions for code repair. While neural network-based language models can be effective in identifying errors, standard feedforward neural networks or unidirectional recurrent neural networks (RNNs) have proven insufficient for effective source code evaluation. There are many reasons for this, including code length and the fact that some errors depend on both previous and subsequent code lines. In this paper, we presented an efficient bidirectional LSTM (BiLSTM) neural network model for code evaluation and repair. Importantly, the BiLSTM model has the ability to consider both the previous and subsequent context of the code under evaluation. In developing the model, we first trained the BiLSTM model as a sequence-to-sequence language model using a large number of source codes. We then used the trained BiLSTM model for error detection and to provide suggestions for code repair. Experimental results showed that the BiLSTM model outperformed existing unidirectional LSTM and RNN neural network-based models. The CoM value of the BiLSTM model was approximately 50.88%, with an F-score of approximately 97%. The proposed BiLSTM model thus appears to be effective for detecting errors and providing relevant suggestions for code repair.
In the future, we plan to evaluate our model using larger datasets and different programming languages. We will also seek to optimize various model parameters to improve model performance. We will present real-world performances and experiences obtained from ICE, as well as case studies based on users’ feedback.