An Empirical Study on Software Defect Prediction Using CodeBERT Model

: Deep learning-based software defect prediction has been popular these days. Recently, the publishing of the CodeBERT model has made it possible to perform many software engineering tasks. We propose various CodeBERT models targeting software defect prediction, including CodeBERT-NT, CodeBERT-PS, CodeBERT-PK


Introduction
As modern software is getting more complex, it is of great importance to ensure software reliability.Up to now, the most practical way of building high-reliable software is via a huge amount of testing and debugging.Therefore, software defect prediction, a technique to predict defects in software artifacts, has gained popularity by lessening the burden of developers to prioritize their testing and debugging efforts [1].
For decades, hand-crafted metrics have been used in software defect prediction.Since AlexNet [2], deep learning has been growing rapidly in image recognition, speech recognition, and natural language processing [3].The same trend also appears in software defect prediction because deep learning models are more capable of extracting information from long texts, i.e., source code.Instead of using hand-crafted metrics that are designed top-down, deep learning models are able to generate code features bottom-up from source code and could describe both syntax and semantic information.Many researchers  use various kinds of deep learning models, e.g., Convolutional Neural Networks (CNN), Long-short Term Memory (LSTM) models, and Transformers for software defect prediction, and achieve promising results.However, due to the limited dataset size in software defect prediction (compared to massive lines of source code directly available in open source repositories), it is hard to believe that a deep learning model trained for software defect prediction can really "understand" the source code itself.Therefore, two problems arise.Can we use a model that "understands" source code for software defect prediction?How should we use such a model for software defect prediction?
The language model, a core concept in natural language processing, is gaining popularity.The reason is that, given the abundance of natural language corpora, and the difficulty of getting labels for prediction tasks, it is more economic-efficient to fully leverage the original corpora in an unsupervised fashion before the language model is later used in other prediction tasks.There are many popular language models such as Word2Vec [26] and Glove [27].Recently in 2019, BERT [28] improves natural language pre-training using mask-based objectives and a Transformer-based architecture.BERT has successfully improved many state-of-the-art results for various natural language tasks.
Along with the development of the natural language model, recent research also shows great success in applying natural language models to artificial languages, especially programming languages.Such programming language models mainly embed local and global context [29], abstract syntax trees (ASTs) [30,31], AST paths [32], memory heap graphs [33], and the combination of ASTs and data-flows [34,35].CodeBERT [36] is one of the latest released powerful language models that is a transformer pre-trained with a Bertlike architecture on open source repositories.It could support paired natural language and multi-lingual programming language tasks, such as code search and code documentation generation.Despite the great success of fine-tuning CodeBERT in downstream tasks, it remains unclear if CodeBERT could improve results in software defect prediction.
In this paper, we investigate the feasibility of using the CodeBERT model for software defect prediction.Specifically, we propose four models based on CodeBERT: CodeBERT-NT, CodeBERT-PT, CodeBERT-PS, and CodeBERT-PK.We perform experiments to investigate the performance of CodeBERT-based models that "understand" source code syntax and semantic and design new prediction patterns that are specially designed for pre-trained language models like CodeBERT.The experimental data are available at https://gitee.com/ penguinc/applsci-code-bert-defect-prediciton.(accessed on 23 May 2021).
Our contributions are as follows: • We are the first to introduce pre-trained programming language models for software defect prediction.

•
We propose new forms of prediction patterns specially designed for pre-trained language models in software defect prediction.

•
We discuss the reason why new forms of prediction patterns work in software defect prediction.

Deep Learning Based Software Defect Prediction
Figure 1 shows a typical workflow of deep learning-based software defect prediction.The first step is to extract software modules from open-source repositories.A software module could be a method, a class, a file, a code change, etc.The second step is to mark software modules as buggy/clean.The bug information is extracted from post-release defects recorded in bug tracking systems, e.g., Bugzilla.If a software module contains bugs found in later releases, the module is marked as buggy.The third step is to extract code features from software modules.In deep learning-based software defect prediction, typical code features include character-based, token-based, AST-node-based, AST-tree-based, ASTpath-based, and AST-graph-based features.The fourth step is to build a deep learning model to generate features and train the instances.Frequently used deep learning models in software defect prediction include CNN, LSTM, Transformers, etc.The last step is to use the trained deep learning model for inference, i.e., predict whether a software module is buggy or clean.Deep learning is expert at combining hand-crafted features.Therefore, hand-crafted features can be fed into deep learning models to improve prediction performance.Yang et al. [37] are the first to use fully connected neural networks for hand-crafted feature combinations.Other deep learning models include autoencoders [38,39] and ladder networks [39].The results show that deep representations of hand-crafted features could further improve prediction performance compared to experiments using traditional machine learning models and hand-crafted features.
Software defect prediction based on deep learning and generated features.With the rapid progress in the natural language processing domain, deep learning models are capable of processing long texts, e.g., source code.Unlike designed top-down handcrafted features, the features are generated bottom-up from the source code, which could represent structural and semantics information of the source code.Recently, many deep learning models, including Deep Belief Networks [4,15], CNN [5][6][7]10,17,19,24,25], LSTM [11,12,14,16,18], Transformers [8], and other deep learning models [13,22] are used in software defect prediction.
Researchers also explore more efficient ways of source code representations.Most researchers use AST sequence to represent source code [4,5,7,17] which balances the information density represented and the training difficulty of deep learning models given insufficient datasets.Some researchers also present representations based on AST paths, e.g., PathPair2Vec [9] and Code2Vec [32].AST paths could extract node-path information that is not shown in AST sequences at the cost of the explosion of path numbers and training efforts.Other researchers explore context information based on AST sequences to focus on defect characteristics [19].

Deep Transfer Learning
In some domains, including software defect prediction, it is challenging to construct a large-scale, high-quality dataset due to the high cost of labeling data samples.Transfer learning, which assumes that training data do not have to be identically distributed (i.i.d.) with test data, could mitigate the problem of insufficient training data [40].
The definition of transfer learning is proposed by Tan et al. [41] as below.Given a Deep learning-based software defect prediction can be organized into two categories: Software defect prediction based on deep learning and hand-crafted features.Deep learning is expert at combining hand-crafted features.Therefore, hand-crafted features can be fed into deep learning models to improve prediction performance.Yang et al. [37] are the first to use fully connected neural networks for hand-crafted feature combinations.Other deep learning models include autoencoders [38,39] and ladder networks [39].The results show that deep representations of hand-crafted features could further improve prediction performance compared to experiments using traditional machine learning models and hand-crafted features.
Software defect prediction based on deep learning and generated features.With the rapid progress in the natural language processing domain, deep learning models are capable of processing long texts, e.g., source code.Unlike designed top-down hand-crafted features, the features are generated bottom-up from the source code, which could represent structural and semantics information of the source code.Recently, many deep learning models, including Deep Belief Networks [4,15], CNN [5][6][7]10,17,19,24,25], LSTM [11,12,14,16,18], Transformers [8], and other deep learning models [13,22] are used in software defect prediction.
Researchers also explore more efficient ways of source code representations.Most researchers use AST sequence to represent source code [4,5,7,17] which balances the information density represented and the training difficulty of deep learning models given insufficient datasets.Some researchers also present representations based on AST paths, e.g., PathPair2Vec [9] and Code2Vec [32].AST paths could extract node-path information that is not shown in AST sequences at the cost of the explosion of path numbers and training efforts.Other researchers explore context information based on AST sequences to focus on defect characteristics [19].

Deep Transfer Learning
In some domains, including software defect prediction, it is challenging to construct a large-scale, high-quality dataset due to the high cost of labeling data samples.Transfer learning, which assumes that training data do not have to be identically distributed (i.i.d.) with test data, could mitigate the problem of insufficient training data [40].
The definition of transfer learning is proposed by Tan et al. [41] as below.Given a learning task T t based on D t , and we can get help from D s for the learning task T s .Transfer learning aims to improve the performance of predictive function f T (•) for the learning task T t by discovering and transferring latent knowledge from D s and T s , where D s = D t and/or T s = T t .In addition, in most cases, the size of D s is much larger than the size of D t , N s N t .Deep transfer learning is a special form of transfer learning that uses a non-linear deep learning model for transfer learning [41].Deep transfer learning can be organized into four categories: instance-based deep transfer learning, which uses instances in source domain by appropriate weights; mapping-based deep transfer learning, which maps instances from two domains into a new data space with better similarity; network-based deep transfer learning, which reuses the parts of the network pre-trained in the source domain; and adversarial-based deep transfer learning, which uses adversarial learning to find transferable features that are both suitable for both domains.Among the four methods, network-based deep transfer learning is generally accepted by researchers and has been practically used in many domains.For example, researchers could use pre-trained models published by large IT companies like Google and Facebook and fine-tune the model via deep transfer learning to adapt to downstream tasks.More recently, pre-training language models with large amounts of unlabeled data and fine-tuning in downstream tasks have made a breakthrough in the natural language processing domain, such as OpenAI GPT and BERT [28].
A typical process of network-based deep transfer learning is shown in Figure 2. First, a pre-trained deep learning model is downloaded and available for fine-tuning.The weights are reserved in a pre-trained model.Second, the last layers of the pre-training deep learning models are identified, and the existing weights of these layers are removed.It is assumed that a pre-trained model contains two parts: general knowledge embedding layers and classification layers that target a specific upstream task, e.g., masked language model for text processing.It is generally accepted to reset the weights of the classification layers so that the model remains common knowledge, whereas it could be generalized to downstream tasks, e.g., text classification.Third, the downstream training data, which is usually much small than the dataset to train the pre-trained model up to several orders of magnitude, are fed to the model, which is called the fine-tuning process.During the process, the model learns how to perform downstream prediction tasks based on both pieces of knowledges from the pre-trained model and the downstream training data.Last, the fine-tuned model is available for downstream prediction tasks.instances from two domains into a new data space with better similarity; network-based deep transfer learning, which reuses the parts of the network pre-trained in the source domain; and adversarial-based deep transfer learning, which uses adversarial learning to find transferable features that are both suitable for both domains.Among the four methods, network-based deep transfer learning is generally accepted by researchers and has been practically used in many domains.For example, researchers could use pre-trained models published by large IT companies like Google and Facebook and fine-tune the model via deep transfer learning to adapt to downstream tasks.More recently, pre-training language models with large amounts of unlabeled data and fine-tuning in downstream tasks have made a breakthrough in the natural language processing domain, such as OpenAI GPT and BERT [28].
A typical process of network-based deep transfer learning is shown in Figure 2. First, a pre-trained deep learning model is downloaded and available for fine-tuning.The weights are reserved in a pre-trained model.Second, the last layers of the pre-training deep learning models are identified, and the existing weights of these layers are removed.It is assumed that a pre-trained model contains two parts: general knowledge embedding layers and classification layers that target a specific upstream task, e.g., masked language model for text processing.It is generally accepted to reset the weights of the classification layers so that the model remains common knowledge, whereas it could be generalized to downstream tasks, e.g., text classification.Third, the downstream training data, which is usually much small than the dataset to train the pre-trained model up to several orders of magnitude, are fed to the model, which is called the fine-tuning process.During the process, the model learns how to perform downstream prediction tasks based on both pieces of knowledges from the pre-trained model and the downstream training data.Last, the fine-tuned model is available for downstream prediction tasks.Deep transfer learning requires that the upstream and downstream tasks are transferrable.For example, suppose we use a pre-trained model on text generation (e.g., text corpora form Wikipedia) and use the model for text sentiment classification (e.g., whether a sentence is positive or negative).In that case, the tasks are transferrable because such models will classify a sentence as positive or negative based on both keywords like fantastic and awful, and the whole meaning of the sentence, e.g., a rhetorical question could entirely divert the meaning of a sentence.

BERT and CodeBERT
Language models can be roughly categorized into N-gram language models and neural language models.Classical neural language models include Word2Vec [26] and Glove [27], which are still popular in software defect prediction [4,5,7].BERT [28] improves natural language pre-training by using mask-based objectives and a Transformer-based architecture, which has successfully improved many state-of-the-art results for various natural language tasks.It is one of the best pre-training language models for downstream tasks, considering that more powerful models (e.g., GPT3 [42]) are not open-source and not easily accessible.RoBERTa [43] is a replication of the BERT paper which follows Deep transfer learning requires that the upstream and downstream tasks are transferrable.For example, suppose we use a pre-trained model on text generation (e.g., text corpora form Wikipedia) and use the model for text sentiment classification (e.g., whether a sentence is positive or negative).In that case, the tasks are transferrable because such models will classify a sentence as positive or negative based on both keywords like fantastic and awful, and the whole meaning of the sentence, e.g., a rhetorical question could entirely divert the meaning of a sentence.

BERT and CodeBERT
Language models can be roughly categorized into N-gram language models and neural language models.Classical neural language models include Word2Vec [26] and Glove [27], which are still popular in software defect prediction [4,5,7].BERT [28] improves natural language pre-training by using mask-based objectives and a Transformer-based architecture, which has successfully improved many state-of-the-art results for various natural language tasks.It is one of the best pre-training language models for downstream tasks, considering that more powerful models (e.g., GPT3 [42]) are not open-source and not easily accessible.RoBERTa [43] is a replication of the BERT paper which follows BERT's structure and proposes an improved pre-training procedure.CodeBERT [36] follows the architecture of BERT and RoBERTa, i.e., the RoBERTa-large architecture.Unlike BERT and RoBERTa, which target natural languages, CodeBERT utilizes both natural languages and source codes as its input.
The overall architecture of BERT is shown in Figure 3.The core blocks in BERT are Transformers [44], which follow an encoder-decoder architecture.Based on the attention mechanism, Transformers could convert the distance of any two words in a sentence to 1, which mitigates the long-term dependency problems in natural language processing.The input corpora are first encoded into feature vectors via multi-head attention and fully connected layers.Then, the feature vectors are fed to the decoder, which includes masked multi-head attention, multi-head attention, and fully connected layers, and are finally converted to the conditional probabilities for prediction.Unlike the OpenAI GPT model, BERT uses a bidirectional Transformer that enables extracting context from both directions.
Appl.Sci.2021, 11, 4793 BERT's structure and proposes an improved pre-training procedure.CodeBERT lows the architecture of BERT and RoBERTa, i.e., the RoBERTa-large architecture BERT and RoBERTa, which target natural languages, CodeBERT utilizes both nat guages and source codes as its input.
The overall architecture of BERT is shown in Figure 3.The core blocks in B Transformers [44], which follow an encoder-decoder architecture.Based on the a mechanism, Transformers could convert the distance of any two words in a sente which mitigates the long-term dependency problems in natural language process input corpora are first encoded into feature vectors via multi-head attention and f nected layers.Then, the feature vectors are fed to the decoder, which includes multi-head attention, multi-head attention, and fully connected layers, and are fin verted to the conditional probabilities for prediction.Unlike the OpenAI GPT BERT uses a bidirectional Transformer that enables extracting context from bo tions.

Proving the Naturalness Assumption via Language Models in Software Defect Pred
The use of language models greatly speeds up the advancements in natural l processing.In terms of the software engineering domain, there is usually a few y between the time a new language model is proposed and the time it is used in this For example, the use of Word2Vec and Glove is still popular in software defect pr Similarly, since the breakthrough of the natural language processing domain (th model) is proposed in 2019, it is not until last year before CodeBERT, a pre-train gramming language model using the BERT architecture, is proposed, and the m open to the public.
It is yet unclear whether using a pre-trained language model will improve diction performance of software defect prediction.Since naturalness has been reg an essential part of code characteristics [45,46], researchers would assume that scale programming language model would be more competent in identifying "un code, i.e., code that does not follow the universal patterns identified from large c pora.Recently, the naturalness assumption has been proven by IBM researchers ging AST nodes via pre-trained language models [47].The naturalness assumpt cates that "unnatural" codes are more likely to be buggy when it comes to softwa prediction.However, it remains unclear whether the naturalness assumption is software defect prediction.

Proving the Naturalness Assumption via Language Models in Software Defect Prediction
The use of language models greatly speeds up the advancements in natural language processing.In terms of the software engineering domain, there is usually a few years gap between the time a new language model is proposed and the time it is used in this domain.For example, the use of Word2Vec and Glove is still popular in software defect prediction.Similarly, since the breakthrough of the natural language processing domain (the BERT model) is proposed in 2019, it is not until last year before CodeBERT, a pretrained programming language model using the BERT architecture, is proposed, and the model is open to the public.
It is yet unclear whether using a pre-trained language model will improve the prediction performance of software defect prediction.Since naturalness has been regarded as an essential part of code characteristics [45,46], researchers would assume that a large-scale programming language model would be more competent in identifying "unnatural" code, i.e., code that does not follow the universal patterns identified from large code corpora.Recently, the naturalness assumption has been proven by IBM researchers by tagging AST nodes via pre-trained language models [47].The naturalness assumption indicates that "unnatural" codes are more likely to be buggy when it comes to software defect prediction.However, it remains unclear whether the naturalness assumption is valid in software defect prediction.

The Influence of Prediction Patterns in Software Defect Prediction
The traditional software defect prediction pattern is to predict whether there are software defects in the source code.Specifically, given the features generated from machine learning models, software defect prediction aims to predict 0 for clean code and 1 for buggy code.During the process, a prediction model will not be aware that it performs defect prediction.Assume that we tell the model that it is performing software defect prediction tasks by some means; the model is expected to directly link the textual concept of defects to the semantic parts expressed by natural languages in the source code.Unlike the traditional prediction pattern that takes only code semantics into account, the new prediction pattern also considers textual semantics.For example, if a function is named "Workaround" and the function is unnaturally complex, a model could judge from textual semantics that a historical bug may not be fully solved and may reoccur in the future, and judge from code semantics that a function with very high complexity may be more likely to be buggy.It is worth noting that the new prediction patterns require that the model is a programming language model that understands both natural language and programming language, e.g., CodeBERT.

Workflow
The workflow of our approach is shown in Figure 4.The approach consists of six steps.In step A, different prediction patterns are chosen, which influences the components of the input data.In step B, the source code is tokenized into tokens, guided by the grammar rules in Backus Normal Form.In step C, the tokens are mapped to integer indexes, and some special tokens are also added.In step D, a simple class balancing method is applied to the indexed tokens.In step E, we feed the balanced data to a CodeBERT model based on model data available at the HuggingFace website [48].Finally, in step F, we predict a new source file as buggy or clean using the trained CodeBERT model.

Choosing Prediction Pattern
This paper uses three prediction patterns for software defect prediction, which originates from motivation 2 in the previous section, as shown in Figure 5.The first prediction pattern is the most commonly used pattern that takes the source code as inputs and predicts 0 for clean code and 1 for buggy code.The second prediction pattern takes both the source code and a declarative sentence (e.g., "The code is buggy") as inputs and predicts 0 if the declarative sentence does not match the source code (i.e., the code is clean) or predicts 1 otherwise.In this case, the declarative sentence is more like a question answered with 0 or 1.The third prediction pattern takes both the source code and a list of keywords (e.g., bug, defect, error, fail, patch) as inputs and predicts 0 if the keyword list does not match the source code (i.e., the code is clean), or predicts 1 otherwise.The second and the third prediction pattern is very similar, despite that the second prediction pattern requires more textual comprehension of the declarative sentence, while the third prediction pattern could lessen the burden of textual comprehension by leveraging keyword search and mapping.In the following sections, we name the first prediction pattern as "tra second prediction patter as "sentence", and the third prediction pattern as " simplicity.

Tokenizing Source Code
During the compilation of the source code, a normal tokenizer that sep kens following grammar rules is sufficient.However, as is usually neglected pilation, the semantics hidden in texts, e.g., function names and variable na neglected.Therefore, a tokenizer that could extract the text semantics in should be used.
We follow the settings of CodeBERT, which uses a word piece tokeniz original source code is fed into the word piece tokenizer, code comments the white spaces at the head and the tail of the source code are removed, t separated by splitting the white space, and the punctuations are also separa pendent tokens.This paper uses three preprocessing patterns for software tion, which corresponds to the three prediction patterns described in Subse first preprocessing pattern (i.e., the traditional pattern) starts with a <s> m ginning and </s> at the end.The second and the third preprocessing pattern tence and keyword pattern) add an extra </s> mark to separate the source declarative sentence or the keywords.The preprocessed code is then ready tion.
During the tokenization process, the source code is tokenized using a p cabulary file.The uncommon words are separated into several sub-word occurrences in the vocabulary file using a greedy longest-match-first algo ample, the word "TestCase" will be separated into "Test" and "##Case".T marks that the token is a sub-word token.If a token is not found in the voca unknown token, <UNK>, will be used.

Mapping Tokens
The CodeBERT model provides a standardized tokenizer, BertTokeni the preprocessed source code as input and outputs a list of integers.The B maps each token to an integer specified in the vocabulary file, including sub out-of-vocabulary tokens (<UNK>), and other special tokens (<s> and </s>) enabled, 0 will be padded before or after the generated integers to ensure In the following sections, we name the first prediction pattern as "traditional", the second prediction patter as "sentence", and the third prediction pattern as "keyword" for simplicity.

Tokenizing Source Code
During the compilation of the source code, a normal tokenizer that separates the tokens following grammar rules is sufficient.However, as is usually neglected during compilation, the semantics hidden in texts, e.g., function names and variable names, are often neglected.Therefore, a tokenizer that could extract the text semantics in source code should be used.
We follow the settings of CodeBERT, which uses a word piece tokenizer.Before the original source code is fed into the word piece tokenizer, code comments are removed, the white spaces at the head and the tail of the source code are removed, the tokens are separated by splitting the white space, and the punctuations are also separated into independent tokens.This paper uses three preprocessing patterns for software defect prediction, which corresponds to the three prediction patterns described in Section 4.2.The first preprocessing pattern (i.e., the traditional pattern) starts with a <s> mark at the beginning and </s> at the end.The second and the third preprocessing pattern (i.e., the sentence and keyword pattern) add an extra </s> mark to separate the source code and the declarative sentence or the keywords.The preprocessed code is then ready for tokenization.
During the tokenization process, the source code is tokenized using a pre-trained vocabulary file.The uncommon words are separated into several sub-words with higher occurrences in the vocabulary file using a greedy longest-match-first algorithm.For example, the word "TestCase" will be separated into "Test" and "##Case".The "##" token marks that the token is a sub-word token.If a token is not found in the vocabulary file, an unknown token, <UNK>, will be used.

Mapping Tokens
The CodeBERT model provides a standardized tokenizer, BertTokenizer, that takes the preprocessed source code as input and outputs a list of integers.The BertTokenizer maps each token to an integer specified in the vocabulary file, including sub-word tokens, out-of-vocabulary tokens (<UNK>), and other special tokens (<s> and </s>).If padding is enabled, 0 will be padded before or after the generated integers to ensure that all generated integer lists have the same length.

Handling Class Imbalance
Class imbalance is common in software defect prediction because the buggy rate of a project is usually below 50%.We choose a simple class imbalance method, random oversampling, so that the training set is balanced, i.e., 50% buggy samples and 50% clean samples.

Loading Pre-Trained Model
The pre-trained CodeBERT model is published by researchers and is available online.There are two ways of using the CodeBERT model: reuse model weights or only reuse the model architecture.We experiment on both methods to investigate whether using pre-trained programming language models outperforms training a model with the same architecture from scratch.In the first method, we reuse weights of the encoder and decoder and clean the weights of the classification layers because we perform different tasks in pretraining and fine-tuning phases (software defect prediction VS neural code search).In the second method, we just reuse the RoBERTa-large model, which is adopted by CodeBERT.
Since we use pre-trained models, we do not aim to change most of the hyperparameters.Specifically, we use Transformer with six layers, 768-dimensional hidden states, and 12 attention heads as our decoder.We set the learning rate as 10 −5 , the batch size is 32, and the maximum epoch is 20.We tune hyperparameters and perform early stopping on the test set.The only difference, batch size, is due to the computational limitations.

Predicting Software Defects
After we feed the training set to the CodeBERT model, all parameters, including weights and biases, are fixed.Then, we ran the trained or fine-tuned CodeBERT model for each file in the test set to obtain prediction results.The results were in the form of a float number between zero and one, based on which we predicted a source file as buggy or clean.If the result was above 0.5, the prediction was regarded as buggy; otherwise, it was clean.

Experimental Setup
All of our experiments are run on a laptop with 5-holdouts repeated experiments.

Evaluation Metrics
We use four evaluation metrics to evaluate prediction performance, namely F-measure (F1), G-measure, area under curve (AUC), and Matthews correlation coefficient (MCC).These evaluation metrics are popular in software defect prediction and can comprehensively evaluate model capabilities.
F-measure is the most frequently used evaluation metric in software defect prediction.It highlights the prediction results on buggy samples and balances the precision and recall.Specifically, F1 is the most common form of F-measure, which takes the harmonic average of precision and recall (also TPR, true positive rate).F1 is calculated as follows: G-measure is the harmonic mean of recall and true negative rate (TNR).G-measure targets the false alarm effects, which is of great importance in software defect prediction.G-measure is calculated as follows: AUC is a very important evaluation metric in many machine learning tasks.Unlike most evaluation metrics, AUC is not sensitive to the threshold settings, which inferencing an output ranging from 0 to 1 as buggy or clean.AUC is also not sensitive to class imbalance, which is often the case in software defect prediction.
In this paper, we regard F1 as the main evaluation metrics.We perform multiple experiments to find the experimental results with the best performance, i.e., best F1 values.The best performance of other metrics, including G-measure, MCC, and AUC values, are also recorded and shown online as supplementary data to our experimental results.

Evaluated Projects and Datasets
We use the PROMISE dataset [49], which targets open-source software defect prediction.Specifically, we use the cross-version PROMISE source code (CVPSC) dataset for cross-version software defect prediction and the cross-project PROMISE source code (CPPSC) dataset for cross-project software defect prediction.Both the CVPSC and the CPPSC dataset include bug labels of specified versions of open source projects and the corresponding source code.Because the time cost of fine-tuning CodeBERT is not trivial, we choose seven project version pairs and nine project version pairs for CVPSC and CPPSC, respectively.Precisely, we follow the Pan et al.'s paper [7] to set the CVPSC dataset and take the intersection of Shi et al.'s paper [17] and the project versions in the CVPSC dataset to set the CPPSC dataset.Detailed information of the CVPSC and CPPSC dataset is shown in Tables 1 and 2.

Baseline Models
Our paper aims to explore empirical findings on neural programming language models, i.e., CodeBERT models.As the CodeBERT model takes BertTokenizer to tokenize raw source code, which is very rare in software defect prediction, it is inappropriate to compare the results with other software defect prediction models based on ASTs, CFGs, and so on because their inputs vary.Since it is extremely time-consuming to train a new programming language model from scratch and published large-scale programming language models are very scarce, we do not take other programming language models into account when choosing baseline models.We choose the following six baseline models: • Pre-trained CodeBERT sentence model (CodeBERT-PS).CodeBERT-PS uses the pretrained CodeBERT model available on HuggingFace and predicts the answer of a declarative sentence, "The code is buggy", regarding a specific piece of code.If the output is 1, the code is buggy.Otherwise, the code is clean.We decompose RQ1 into two sub-research questions, RQ1a and RQ1b, that use the CVPSC and the CPPSC dataset for cross-version and cross-project defect prediction, respectively.As we use generated features for software defect prediction, and because th uct-based and process-based features are mostly hand-crafted features, their com does not seem to improve prediction performance that much.For example, Wang tried to combine generated features and hand-crafted features and gain tiny im ments.Therefore, we infer that effective combinations of generated features shou from the machine learning-based assumption.Since the naturalness assumption h proved in software defect prediction via neural language models, and the code assumption has also been proven by various researchers in deep learning-based s defect prediction, it is worth investigating the relationship between the two assum From our perspective, these two assumptions are demonstrated independently in isting experimental design.For example, the use of AST nodes and AST paths pro simplifying source code and highlighting components that are more related to de improve prediction performance in software defect prediction.The reason behin be that adding redundant information introduced by grammar definitions does improve prediction performance for software defect prediction.However, the Co model does not take advantage of ASTs and performs tokenization directly on th code, which is generally accepted in text processing but can be improved in progr language processing.If a programming language model could be pre-trained us sequences or AST paths as input, it is more likely that the code pattern assump naturalness assumption can be combined in experimental design to further impr diction performance.

The Relationship between Prediction Pattern and Buggy Rate
The empirical results of RQ2 indicate that CodeBERT-PK and CodeBERT-PS As we use generated features for software defect prediction, and because the productbased and process-based features are mostly hand-crafted features, their combination does not seem to improve prediction performance that much.For example, Wang et al. [4] tried to combine generated features and hand-crafted features and gain tiny improvements.Therefore, we infer that effective combinations of generated features should come from the machine learning-based assumption.Since the naturalness assumption has been proved in software defect prediction via neural language models, and the code pattern assumption has also been proven by various researchers in deep learning-based software defect prediction, it is worth investigating the relationship between the two assumptions.From our perspective, these two assumptions are demonstrated independently in the existing experimental design.For example, the use of AST nodes and AST paths proves that simplifying source code and highlighting components that are more related to defects can improve prediction performance in software defect prediction.The reason behind it may be that adding redundant information introduced by grammar definitions does not help improve prediction performance for software defect prediction.However, the CodeBERT model does not take advantage of ASTs and performs tokenization directly on the source code, which is generally accepted in text processing but can be improved in programming language processing.If a programming language model could be pretrained using AST sequences or AST paths as input, it is more likely that the code pattern assumption and naturalness assumption can be combined in experimental design to further improve prediction performance.

Figure 1 .
Figure 1.Workflow of deep learning based software defect prediction.

Figure 1 .
Figure 1.Workflow of deep learning based software defect prediction.

Figure 2 .
Figure 2. A typical process of network-based deep transfer learning.

Figure 2 .
Figure 2. A typical process of network-based deep transfer learning.

Figure 3 .
Figure 3.The overall architecture of BERT.

Figure 3 .
Figure 3.The overall architecture of BERT.Since we do not change the architecture of CodeBERT, the formulas of the CodeBERT model, BERT model, RoBERTa model, and Transformers model are omitted for simplicity.

Figure 4 .
Figure 4.The overall workflow of our approach for software defect prediction.

Figure 5 .
Figure 5. Traditional, sentence-based, and keyword-based prediction patterns.The model is omitted for simplicity.

Figure 5 .
Figure 5. Traditional, sentence-based, and keyword-based prediction patterns.The classification model is omitted for simplicity.

19 MCC
is a good evaluation metric for imbalanced datasets.It focuses on buggy and clean samples equally and describes the correlation between them.MCC is calculated as follows: MCC = TP * TN−FP * FN √ (TP+FP)(TP+FN)(TN+FP)(TN+FN)

6. 2 . 1 .( 3 )
RQ1a: Does the Pre-Trained CodeBERT Model Outperform a Newly-Trained CodeBERT Model in Cross-Version Defect Prediction?To answer RQ1a, we perform cross-version defect prediction experiments on the CVPSC dataset to compare the prediction performance of the CodeBERT-NT, CodeBERT-PT model, and RANDOM model in terms of F-measure, as shown in Table3.For example, we get an F-measure of 0.509 and 0.493 on the CodeBERT-NT and CodeBERT-PT models for the Camel project.Compared to the F-measure of 0.287 on the RANDOM model, CodeBERT-NT and CodeBERT-PT model outperform the trivial RANDOM baseline.Considering the average results of the F-measure, the CodeBERT-PT model outperforms the CodeBERT-NT model by 2.1%.Both models significantly outperform the RANDOM baseline at 0.397.Therefore, the CodeBERT-PT model outperforms the CodeBERT-NT model on the CVPSC dataset for cross-version defect prediction.complexity) for source code, WMC (weighted method per class) and DIT (depth of inheritance) for object-oriented software, Degree and Betweenness Centrality for software complex networks.(2) Process-based assumptions.Process-based assumptions focus on history and experience during the development process of software.If the historical version of the source code is buggy, or the source code is developed by inexperienced programmers, it is more likely to be buggy.Related metrics include code churn, fix, developer experience, developer interaction, ownership between software modules and developers, and so on.(3) Machine learning-based assumptions.Machine learning-based assumptions focus on code patterns and naturalness.If the source code includes frequent patterns in buggy source code or the source code is unnatural, i.e., not following patterns followed by the majority of the developers, it is more likely to be buggy.Related metrics include semantic features generated from source code ASTs and neural language models.Appl.Sci.2021, 11, 4793 Machine learning-based assumptions.Machine learning-based assumptions code patterns and naturalness.If the source code includes frequent patterns i source code or the source code is unnatural, i.e., not following patterns follo the majority of the developers, it is more likely to be buggy.Related metrics semantic features generated from source code ASTs and neural language mo

Figure 8 .
Figure 8. Assumptions that influence software defect prediction.

Figure 8 .
Figure 8. Assumptions that influence software defect prediction.

Table 1 .
CVPSC dataset description.The number of files, number of defects, and average buggy rate are listed separately for the training set and the test set.

Table 2 .
CPPSC dataset description.The number of files, number of defects, and average buggy rate are listed separately for the training set and the test set.

•
Pre-trained CodeBERT keyword model (CodeBERT-PK).CodeBERT-PK uses the pretrained CodeBERT model available on HuggingFace and predicts the relationship between a specific piece of code and a set of keywords, including bug, defect, error, fail, and patch.If the output is 1, the code is buggy.Otherwise, the code is clean.• Pre-trained CodeBERT traditional model (CodeBERT-PT).CodeBERT-PT uses the pre-trained CodeBERT model available on HuggingFace and predicts if the source code is buggy, which is adopted by most researchers.If the output is 1, the code is buggy.Otherwise, the code is clean.• Newly-trained CodeBERT traditional model (CodeBERT-NT).CodeBERT-NT uses the architecture of the CodeBERT model but discards the existing weights and trains from scratch.The model predicts if the source code is buggy, which is adopted by most researchers.If the output is 1, the code is buggy.Otherwise, the code is clean.• RANDOM.The RANDOM model predicts if a source file is buggy randomly.A model that performs worse than RANDOM is no better than random guessing and, therefore, no practical value.