A Source Code Similarity Based on Siamese Neural Network

: Finding similar code snippets is a fundamental task in the ﬁeld of software engineering. Several approaches have been proposed for this task by using statistical language model which focuses on syntax and structure of codes rather than deep semantic information underlying codes. In this paper, a Siamese Neural Network is proposed that maps codes into continuous space vectors and try to capture their semantic meaning. Firstly, an unsupervised pre-trained method that models code snippets as a weighted series of word vectors. The weights of the series are ﬁtted by the Term Frequency-Inverse Document Frequency (TF-IDF). Then, a Siamese Neural Network trained model is constructed to learn semantic vector representation of code snippets. Finally, the cosine similarity is provided to measure the similarity score between pairs of code snippets. Moreover, we have implemented our approach on a dataset of functionally similar code. The experimental results show that our method improves some performance over single word embedding method.


Introduction
Code similarity is often used to measure the similarity degree between a piece of code snippets on text, syntax and semantic. Code similarity is a fundamental activity for some applications, such as code cloning detection [1,2], code plagiarism [3][4][5], code recommendation [6] and information retrieval [7]. Indeed, several efforts have been made for finding similar codes for each given code snippet. Specifically, manual defined or hand-crafted features, e.g., by analyzing the overlap among identifiers, operators, operands, lines of code, functions, types, constants and other attributes or comparing the abstract syntax trees of two code snippets, are conducted for small code snippets. However, this method is a coarse-grained measurement with low accuracy [8]. Additionally, programs have a well-defined syntax, they can be represented by a series of tokens, ASTs (Abstract Syntax Trees), PDGs (program dependency graphs) or CFGs (Control Flow Graphs), which can be successfully used to capture code patterns. Then the similarity of codes is measured by string matching [9], suffix tree matching [10], graph matching [11] and other algorithms.
With the increasing arising of valuable, widely-used, open-source software, the scale of available data, such as billions of tokens of code and millions of instances of meta-data, is massive. It's hard to extract ASTs, PDGs and CFGs from big code snippets. Then, a new approach to measure the similarity is required. Software engineering and programming languages researchers have largely focused on machine learning techniques to extract features, because they have shown high performance in Natural Language. Hindle et al. first proposed the assumption of code naturalness [12], and proved that source codes had similar characteristics to natural language, that provided a new idea for source code representation and analysis [13][14][15]. While there are many differences between natural language and source codes, for example, code has formal syntax and semantics, so, it is still an open question to understand codes.
In summary, there are many unique challenges inherent in designing an effective approach for codes. First, words in code have a specific meaning, i.e., 'while', 'for' words mean loop, 'if' and 'switch' mean choice. How to understand the semantic of codes from their contents and represent these semantics automatically is a nontrivial problem. Second, code has rich words and has higher neologism rate than text. Most of characters are identifiers and a developer must name new things using a new identifier that makes code corpora drastically change. Programs consist of identifiers, sentences and structures, in which some are user-defined words, some are reserved words. Their contributions are different for code presentations. Thus, it is necessary to capture the statistical behavior of a word appearing along with other words.
To address the challenges mentioned above, in this paper, we first build a pre-training model that embeds each word of codes into a real vector based on Word2Vec and we also design another model to learn words frequencies in code snippets. We model a code snippet as a matrix rather than averaging and summing word embedding to represent sentences. The weight of each word is the learned word frequency. Then a neural network is proposed to learn automatically semantic features underlying codes. Finally, the cosine similarity scores of source code pairs are calculated based on their representations. We call this approach Word Information for Code Embedding-Siamese Neural Networks (WICE-SNN). This method applies deep learning technology to the measurement of code similarity which can help find deep semantic information and higher abstract features underlying codes than traditional machine learning method [16][17][18][19]. We implemented our method and evaluated their recall, precision on a dataset. Experimental results reveal that WICE-SNN significantly outperforms some baselines on similar code task.
The main contributions of this paper are as follows: • We construct a method that incorporates words statistical information in code snippets. This approach regulates the weight of each word to its corresponding sentence representation according to its contribution.

•
We propose a siamese neural network that extracts semantic features by utilizing the similarity among source codes and makes code snippets with similar function mapped into similar vectors.
The rest of this paper is organized as follows: Section 2 introduces the current situation of related research, summarizes and analyzes it. In Section 3, we present our similarity computing framework and explain the constitution of our model. Section 4 demonstrates an experiment to evaluate the proposed method and presents the obtained results. Section 5 concludes our work.

Source Code Similarity
The research on source code similarity originated in the 1970s. Assessing source code similarity is to measure similar implementation of functions, which is a fundamental activity in software engineering and it has many applications. Many approaches for code similarity have been proposed in literatures. Most of these approaches focus on syntactic similarity rather than semantic similarity. These techniques can be classified into two categories: attributed-based and structure-based. The early approach proposed by Halstead measure similarity based on properties of software, such as numbers of different operators, operands, variable types and other attributes in the program [20]. The programs with similar attributes have higher similarity. The attributed-based methods are simple, but are easy to be interfered by variable substitution. Structure-based approaches include text-based, token-based, tree-based and graph-based. Text-based approaches treat code snippets as two string sequences and compare their similarity. One of the widely-used methods of string similarity proposed by Roy and Cordy is to find a longest common subsequence. This technique is independent of the programming language, but ignores the syntax and structure information of the program [9]. For token-based technology, source codes can be transformed into tokens. A stream of tokens is an abstract representation of a program and useless characters, spaces, comments, etc., are filtered out. This method is not affected by textual difference, but it does not consider the order of the code, and it ignores the structural information in code snippets [1]. Tree-based and graph-based code similarity measurement focus on structural information between two programs and can avoid lexical differences [10,21], but the cost of computation is very large, especially graph-based algorithms are mostly NP-complete.
Text-based, token-based, tree-based and graph-based approaches show good performance in small-scale programs. However, with the development of open sources on Internet, more and more open source codes with similar functions but different forms are available. The above methods are not appropriate anymore. Researchers try to introduce machine learning techniques to solve the similarity task. Machine learning methods can be divided into keywords-based, vector space-based, and deep learning-base approaches. N-gram similarity and Jaccard similarity are the two mainly keywords-based algorithms. It measures similarity by the number of common substrings in two code fragments. Jaccard method measures similarity from the ratio of intersection and union of word sets between two texts. Word2vec is the common vector-based methods, in which words are mapped into vectors and similarity is the distance between two corresponding vectors [22]. As deep learning techniques has so excellent performance in natural language processing and computer vision [23], researchers has begun to apply it to software engineering, such as programming analysis [24], code plagiarism [25], code clone [26], code abstract generation, fault location and other applications [27,28]. These new deep learning approaches use representation learning to extract automatically useful information from the amount of unlabeled code data. The methods reduce the cost of modeling code and improve performance because they are more efficient than human's at discovering hidden features in high dimensional spaces.

Source Code Representation
Source code representation plays a key role for code similarity. In this section, we review some recent and representative work about code similarity tasks. There are two common code representation methods: statistical-based and vector-based approach. Statistical-based methods introduce some probabilistic models to represent source codes. TF-IDF is a popular statistical representation method usually used in Information Retrieval applications. The representation of text consists of a term-document matrix which usually describes the word frequencies in the document. The TF-IDF model assigns relatively low weights to very frequent words and high weights to rare words. For a source code snippet c, it is represented by N reversed words, . . , f N , F is a TF-IDF set. Word embedding is a vector-based representation that produces a low-dimension continuous distributed vector for word. The word2vec model proposed by Mikolov et al. is the most common word embedding method, which constructs a three-tier shallow neural network to predict the distributed vector of each word [29]. It is also introduced into the representation of source code. Ye et al. trained a low-dimensional vector for code snippets extracted from API documents to measure the similarity of documents. This model were used to software search and fault locating tasks based on the trained vector space distance [30,31]. Chen et al. extracted a large number of code corpus from the stack overflow and other programming Q&A communities, and processed the synonym problems based on the word vector. It provided the basis for code conversion and other work [32]. Hao et al. parsed the abstract syntax tree for the source code, and mapped it to its corresponding real-valued vector representations, so that similar AST node have similar vectors [33]. Mou extracted abstract syntax trees from codes, and provided a tree-based CNN to learn word vectors of codes which was used in code classification and searching [34]. Nguyen proposed a new DNN network to extract lexical and grammatical features of code in each layer [24]. White proposed an auto-encoder deep learning framework applied in code clone detection, which learn lexical and syntactic feature of code [35]. Wang used a deep belief network to learn semantic features from token vectors extracted from ASTs to perform defect prediction [36].
Most of the aforementioned works are developed for a specific task, not purely code representation. Unlike such approaches, our method uses deep learning only for training word embedding to learn more complex semantic and syntactic structure features underlying source code. Then we model a code snippet as a weighted series of word vectors and the weights were learned from their TF-IDF values.

WICE-SNN Framework
We first give the formal definition of source code similarity in this section and introduce the code embedding method based on statistics information proposed in this paper.

Definition of Source Code Similarity
Similar code snippets should have same objective in function related with the semantics of codes. For any two code snippets c 1 , c 2 , we measure the similarity score between c 1 and c 2 . This score is a real number in [0, 1]. The higher the score is, the more similar c 1 and c 2 are. Without loss of generality, the problem of similar code snippets can be formulated as: Definition: Given a set of code snippets pairs . . c n a , c n b , a set of real numbers Y = y 1 , y 2 , . . . y n , Y is the set of standard similarity score manually annotated for the set of code are not similar. The objective of the task is to learn a similarity model with where c a , c b ∈ C and θ is a hyper-parameter of f to be trained in model. Using this model, we can measure the similarity score for arbitrary code snippets. As shown in Figure 1, in order to tackle the above problem, we propose a Siamesed CNN model which can be used to capture the similarity semantic feature between two code snippets. The first step is word embedding and word frequency. We split the source code into variables, function name, operators, reserved words, constant value and others. Each of the word is mapped to its corresponding vector with a frequency. Then, the second step is source code representation, the input are code matrices got from the previous step. The hidden feature of code snippet are trained and mined in the model. In the last step, the similarity value is calculated based on the hidden features. We describe our model in detail as follows.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 12 embedding to learn more complex semantic and syntactic structure features underlying source code. Then we model a code snippet as a weighted series of word vectors and the weights were learned from their TF-IDF values.

WICE-SNN Framework
We first give the formal definition of source code similarity in this section and introduce the code embedding method based on statistics information proposed in this paper.

Definition of Source Code Similarity
Similar code snippets should have same objective in function related with the semantics of codes. For any two code snippets , , we measure the similarity score between and . This score is a real number in [0, 1]. The higher the score is, the more similar and are. Without loss of generality, the problem of similar code snippets can be formulated as: Definition , are not similar. The objective of the task is to learn a similarity model with a function f, ( , , θ) → R, where , ∈ and is a hyper-parameter of f to be trained in model. Using this model, we can measure the similarity score for arbitrary code snippets. As shown in Figure 1, in order to tackle the above problem, we propose a Siamesed CNN model which can be used to capture the similarity semantic feature between two code snippets. The first step is word embedding and word frequency. We split the source code into variables, function name, operators, reserved words, constant value and others. Each of the word is mapped to its corresponding vector with a frequency. Then, the second step is source code representation, the input are code matrices got from the previous step. The hidden feature of code snippet are trained and mined in the model. In the last step, the similarity value is calculated based on the hidden features. We describe our model in detail as follows.

Details of WICE-SNN
In this section, we will introduce the technical details of WICE-SNN framework. As shown in Figure 2. This model mainly contains three parts: Word Information for Code Embedding layer (WICE), Siamese Neural Networks layer (SNN) and similarity score layer. WICE layer maps discrete words to real-valued vectors and extract the weight for each word in code snippet. SNN layer learns semantic representation of code snippets using the weighted word embedding as initialized value. Similarity layer calculates the similarity score between two code snippets with their semantic representations, which can be used to rank candidate code snippets to find its similar ones.

Details of WICE-SNN
In this section, we will introduce the technical details of WICE-SNN framework. As shown in Figure 2. This model mainly contains three parts: Word Information for Code Embedding layer (WICE), Siamese Neural Networks layer (SNN) and similarity score layer. WICE layer maps discrete words to real-valued vectors and extract the weight for each word in code snippet. SNN layer learns semantic representation of code snippets using the weighted word embedding as initialized value. Similarity layer calculates the similarity score between two code snippets with their semantic representations, which can be used to rank candidate code snippets to find its similar ones.
where weight matrix ∈ R × are the paremeters of the embedding layer and ∈ R × , where d is the dimension of hidden layer. Output layer is a log probability vector. Each word vector is trained to maximize the probability of neighboring words. The word which has the largest probability in the vector is the output word. As a result, the representation of code snippet is transformed into a matrix which is a part of the weight matrix . Composition in Statistical Information: When using word vectors to represent code snippet, the weight of all word vectors are equally. But, we know the contribution of each word in code is different. For example, there are three main types of vocabulary in source codes, reserved words, user-defined words and operations. Reserved words that represent data types, control structures, library files and others; second, user-defined variable names, function names, etc.; third, various operation symbols. For source code snippets, the weight should reflect the structure of source code. In order to improve the effect of code presentation, the words of control structure should have different coefficients and multiplied by the word frequency of it. So, the vector of code snippet is represented as is the i-th line of embedding matrix , { , , … } is the TF-IDF value of each word, m is the number of words in code snippet.

SNN Layer
In previous layer, we got the word embedding matrix . One simple representation for code snippet can be modeled as a vector by a weigthed sum aggregated result of . But, different words have different contributation to code snippet. Moreover, this representation will lost their program structure information. A CNN model is proposed to learn a representation for an input source code by integrating its semantic and structural materials. As shown in Figure 2, the CNN model contains five layers, i.e., input layer, convolution layer, pooling layer, connection layer and output layer. In the following, we will describe one by one.
Input Layer: We take a pair of pre-trained word embedding matrix , in previous part as input for WICE-SNN model, ∈ R × , ∈ × , here N1, N2 is the number of words respectively

WICE Layer
Word embedding: Word2Vec is a widely word embedding model applied in NLP. Word2vec learns a vector representation for each word using a simple neural network. There have two network architectures: continuous bag-of-words (CBoW) model and skip-gram model. CBoW model uses the average of a context vector, also known as an input vector, to predict a word with softmax weight. Skip-gram model uses a single word to predict surrounding words. They consists of an input layer, a projection layer (hidden layer), and an output layer. Formally, a code snippet c is split into a word sequences, c = v 1, v 2, . . . , v N , where N is the number of words in code snippet c. The input of CBoW model is x 1, x 2, . . . , x k , x i ∈ R 1×N , is the one-hot vector of v i , window size is k, vocabulary size is N. The converted vector u i for one-hot vector of v i is expressed as: where weight matrix W ∈ R N×d are the paremeters of the embedding layer and u i ∈ R 1×d , where d is the dimension of hidden layer. Output layer is a log probability vector. Each word vector is trained to maximize the probability of neighboring words. The word which has the largest probability in the vector is the output word. As a result, the representation of code snippet is transformed into a matrix U which is a part of the weight matrix W. Composition in Statistical Information: When using word vectors to represent code snippet, the weight of all word vectors are equally. But, we know the contribution of each word in code is different. For example, there are three main types of vocabulary in source codes, reserved words, user-defined words and operations. Reserved words that represent data types, control structures, library files and others; second, user-defined variable names, function names, etc.; third, various operation symbols. For source code snippets, the weight should reflect the structure of source code. In order to improve the effect of code presentation, the words of control structure should have different coefficients and multiplied by the word frequency of it. So, the vector of code snippet is represented as . . , f m ·w m , where w i ∈ W is the i-th line of embedding matrix W, f 1 , f 2 , . . . f m is the TF-IDF value of each word, m is the number of words in code snippet.

SNN Layer
In previous layer, we got the word embedding matrix U. One simple representation for code snippet can be modeled as a vector by a weigthed sum aggregated result of U. But, different words have different contributation to code snippet. Moreover, this representation will lost their program structure information. A CNN model is proposed to learn a representation for an input source code by integrating its semantic and structural materials. As shown in Figure 2, the CNN model contains five layers, i.e., input layer, convolution layer, pooling layer, connection layer and output layer. In the following, we will describe one by one.

Input Layer:
We take a pair of pre-trained word embedding matrix U A , U B in previous part as input for WICE-SNN model, U A ∈ R N 1 ×d , U B ∈ R N 2 ×d , here N 1 , N 2 is the number of words respectively in source code A and B, d is the vector dimension. We pad the two code snippets with 0 to have the same length N = max{N 1 , N 2 }. After filling with 0, the initialized marix U ∈ R N×d .
Convolution Layer: Each kernal K ∈ R s×d does convolution operation in the word sequence v 1, v 2, . . . , v N .
Here, * is convolution operator, U i = [u i, u i+1,... , u i+s−1 ], that is the embedding matrix of word sequence v i, v i+1, . . . , v i+s−1 , 1 ≤ i ≤ N − S + 1. p i is a real number, because the dimension of kernal and word vector are same.
Pooling Layer: Pooling (including min, max, average pooling) is commonly used to extract robust features from convolution, to reduce its dimension. A convolution layer transforms an input feature map U with d columns into a new feature map P with one column. We get the maximum after max-pooling over each vector P, which can be expressed as Here, M is the filter number we set in convolution layer. Connection Layer: In connection layer, we concatenate each x i , which get from pooling layer, into a vector for source code A and B.
where ⊕ is the operation that merges two vectors into a long vector, X is a new feature map for source code.

Similarity Score Layer
Similarity Score: This Layer targets at calculating the similarity score of each source code pair, which can be used to rank candidate code snippets to find similar ones for any source code. As shown in Figure 2, the similarity score of the input pair is computed by using cosine function on new feature vectors X which leverage their semantic representations and structure representations.
where · is inner product of vector X A and X B , where X A = X A 2 an X B 2 is their 2-norm. Those code snippets with the largest similarity score will be returned as similar codes of the given one.

Experiments
In this section, we present our empirical evaluation. In our experiments, we aim (1) to evaluate our method's accuracy in a dataset; (2) to compare it with the state-of-the-art.

Dataset
We evaluate our approach on a dataset that is collected from a programming open judge (OJ) platform (http://cstlab.jsnu.edu.cn). There are a large number of programming problems on the OJ system. Students submit their source codes as the solution to certain problems and the OJ system judges the validity of submitted source codes automatically. We select 35 C++ problems on the website which include greatest common divisor, narcissus number, string matching, insertion sorting, linked list inserting, breadth-first searching and others. We download the source codes and the corresponding problem IDs as our datasets and extract functions from source codes. At the same time, when we download the training sample codes, we will select the code that passed the OJ test to ensure they are correct codes for the problems. The source codes for the same problem are considered to have similar function, and the source codes of different topics are not similar. As a result, in the dataset, 940 functions are extracted, so we get 817,216 pairs, including 38,460 similar pairs and 778,756 dissimilar pairs of code snippets totally after pruning.

Implementation and Comparisons
For WICE-SNN, word2vec, N-gram [12] models, we use TensorFlow to implement the neural networks models. We also compared WICE-SNN with the state-of-the-art approach: code2vec [37]. Code2vec is a neural model for code embedding that represents code snippets as a collection of paths and aggregates these paths into a single fixed-length code vector. In our experiment, we get the available from Github [38]. Because the already-trained model on Github is for java, so we train a new model using our preprocessed dataset.
For validating the effectiveness of WICE-SNN, we employ hold-out on the labeled source codes in the dataset, split the dataset into 3 parts: 80% for training, 10% for validating and 10% for testing. We repeat each training 10 times and the reported result is an average value over 10 times. We use python language to archive our programs, the specific version is 3.7 and adopt the popular framework Tensorflow 1.4.0 and Keras 2.2.0 to implement the deep learning module in this paper.
Training process: In our experiments, we use Word2Vec tool to map words into real vectors and count the frequency of words based on TF-IDF algorithm. The weighted matrix is the input of WICE-SNN model that used to train a classifier. We use a cross entropy method to evaluate the difference between the prediction resultŷ and the marked value y as follows: where d is the size of output layer, represents the dimension of the vector y. We use the Adam optimization algorithm to update parameters. We train on a computer with a Nvidia GPU. A single training epoch takes about 40 min, and it takes about 7 h to completely train a model. Parameters Setting: For WICE layer, we set vocabulary size N to 1024, hidden size d to 64, window size to 10. In CNN model, we have three kinds of filter, their window size are 2, 3 and 5. The number of filters is 32. We set the dropout rate to 0.5, learning rate to 0.001, mini-batch size to 32. The model is trained on the training set, and hyper-parameters are tuned for maximizing F1 score on validation set.
Evaluation Measure: We adopt the Precision, Recall and F1 measure, which are widely used in many fields. The expressions are as follows: where TP represents true positive candidates, TN represents true negative candidates, FP represents false positive candidates, FN represents false negative candidates.

Experimental Results
In this paper, to investigate the performance of WICE-SNN and baseline on the task, we have carried out the experiments about 10 times on dataset, and took the average value as the experimental result. The performance results of all model are shown in Table 1. We also used word2vec to get each word vector and compute the cosine value based on their average weights(Word2Vec) or TF-IDF weights(WICE) as the similarity value. We can find our proposed method achieves the best performance, with the improvement by up to 0.67, 0.83, and 0.74 in Precision, Recall and F1. Our method improve 0.41 in recall and 0.18 than word embedding and WICE, however, the precision has been reduced than the previous two methods. For N-gram, we set n from 1 to 5, and get the average values as the result. From the results, we can find N-gram model have almost same performance with Code2Vec one this dataset. In our experiment, we also found N-gram gets the best performance when n is 2. For Code2Vec is only designed for functions, so in the data preprocessing, the main function or the function that implements the main algorithm is retained, and other codes are deleted. In addition, we deleted all comments in code snippets as the astminer library can not process comments when extracting code blocks into abstract syntax trees. At the same time, in order to test the best performance of each method, we change the similarity threshold, which increases from 0 to 1 in steps of 0.1. The test results are shown in Figure 3. The abscissa in the figure shows the change of the threshold value. It can be seen from the figure that when the threshold value of the first two methods is about 0.9, the F1 performance is the best, and the third method, when the threshold value is about 0.6, the performance is the best.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 12 or the function that implements the main algorithm is retained, and other codes are deleted. In addition, we deleted all comments in code snippets as the astminer library can not process comments when extracting code blocks into abstract syntax trees. At the same time, in order to test the best performance of each method, we change the similarity threshold, which increases from 0 to 1 in steps of 0.1. The test results are shown in Figure 3. The abscissa in the figure shows the change of the threshold value. It can be seen from the figure that when the threshold value of the first two methods is about 0.9, the F1 performance is the best, and the third method, when the threshold value is about 0.6, the performance is the best. N-gram is a statistical method that needs less computing resources and it takes about 11.96 s to get results. Word2Vec and WICE belong to shallow neural networks. They take about 57 min to complete the process. Our model and code2vec are deep neural networks. Their computational cost is decided by their size of networks. They will take more time than shallow neural network. Especially, code2vec need to parse all source codes into an abstract syntax tree (AST) and extract the paths from their ASTs. So preprocessing is also a time-consuming thing. For our model, no more preprocessing is required, and the size of kernels are small. The computational cost is mainly spent on model training. It takes about 7 h to completely train a model. For code2vce, we take about 10.5 h to train a model.

Conclusions and Future Works
In this paper, we presented a neural network framework to measure similarities of code snippets. This framework contains three parts: embedding layer, representation layer and similarity layer. The embedding layer integrates the frequency of word into word embedding, maps a code snippet into a real matrix. The representation layer provides a Siamese CNN model that initialized by the pre-trained matrix to mine the deep features of similar codes by training a classifier. The N-gram is a statistical method that needs less computing resources and it takes about 11.96 s to get results. Word2Vec and WICE belong to shallow neural networks. They take about 57 min to complete the process. Our model and code2vec are deep neural networks. Their computational cost is decided by their size of networks. They will take more time than shallow neural network. Especially, code2vec need to parse all source codes into an abstract syntax tree (AST) and extract the paths from their ASTs. So preprocessing is also a time-consuming thing. For our model, no more preprocessing is required, and the size of kernels are small. The computational cost is mainly spent on model training. It takes about 7 h to completely train a model. For code2vce, we take about 10.5 h to train a model.

Conclusions and Future Works
In this paper, we presented a neural network framework to measure similarities of code snippets. This framework contains three parts: embedding layer, representation layer and similarity layer.
The embedding layer integrates the frequency of word into word embedding, maps a code snippet into a real matrix. The representation layer provides a Siamese CNN model that initialized by the pre-trained matrix to mine the deep features of similar codes by training a classifier. The similarity layer computes the value of cosine similarity. We evaluate our approach to a dataset. In contrast with previous approaches, the performance of our proposed method is better than the general word embedding, than some statistical-based methods.
This work can be a foundation for improving many other applications of codes, such as bug detection, code recommendation, and code search. To serve this purpose, we will design more downstream tasks to optimize our model. Moreover, the different types of words in source code, such as reserved words, user-defined words and operators are not considered in our method. Therefore, in the future, we will further refine the word types to verify the experimental results.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 12 downstream tasks to optimize our model. Moreover, the different types of words in source code, such as reserved words, user-defined words and operators are not considered in our method. Therefore, in the future, we will further refine the word types to verify the experimental results.    Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 12 downstream tasks to optimize our model. Moreover, the different types of words in source code, such as reserved words, user-defined words and operators are not considered in our method. Therefore, in the future, we will further refine the word types to verify the experimental results.