Two-Pass Technique for Clone Detection and Type Classiﬁcation Using Tree-Based Convolution Neural Network †

: Appropriate reliance on code clones signiﬁcantly reduces development costs and hastens the development process. Reckless cloning, in contrast, reduces code quality and ultimately adds costs and time. To avoid this scenario, many researchers have proposed methods for clone detection and refactoring. The developed techniques, however, are only reliably capable of detecting clones that are either entirely identical or that only use modiﬁed identiﬁers, and do not provide clone-type information. This paper proposes a two-pass clone classiﬁcation technique that uses a tree-based convolution neural network (TBCNN) to detect multiple clone types, including clones that are not wholly identical or to which only small changes have been made, and automatically classify them by type. Our method was validated with BigCloneBench, a well-known and wildly used dataset of cloned code. Our experimental results validate that our technique detected clones with an average rate of 96% recall and precision, and classiﬁed clones with an average rate of 78% recall and precision.


Introduction
Developers often develop software using copied and modified existing code fragments, referred to as "code clone" or simply "clone" [1,2]. These code fragments can ultimately make the code larger, more complex, and more challenging to maintain. According to some research, between 20% and 59% of all source code is duplicated [3][4][5].
The 'clone-and-own' approach, in which preexisting code fragments are used to fulfill new requirements, is widely employed when a new software system is similar but not identical to an existing system. While this method does allow for the rapid and costeffective development of software systems, code clones can render it more difficult to maintain a system, and they result in the copying of subtle errors across systems [6,7]. Once a bug is found in a clone, it has to be checked everywhere that clone was deployed, and any modifications have to be repeated across all systems. This can increase the costs associated with the development and maintenance of large-scale software systems. When cloned code fragments are not searched for, identified, and possibly refactored, the same bug (and adverse effects) will be repeated across software systems.
According to S. Bellon et al. [1] and K. Roy et al. [2], code clones can be divided into four types: "an exact copy without modifications (Type 1, T1)", "a structurally and syntactically identical fragments except variable, type, or function identifiers (Type 2, T2)", "a copied fragments with further modifications in which statements were changed, added, or removed (Type 3, T3)", and "code fragments that perform the same computation but are implemented by different syntactic variants (Type 4, T4)." Known clone detection techniques are text-based, token-based, tree-based, Program-Dependent Graph (PDG) based, or metric-based [8,9]. Existing clone detection techniques perform well when tested against types T1 and T2, but the detection rate is relatively low for types T3 and T4.

•
We present a novel approach for classifying the clone types, exceptionally well suited to T3 and T4 clones.

•
Our technique uses the AST to capture characteristics that reflect common code patterns, which are both scalable and easily generated from code, which saves efforts for preprocessing process.

•
We present a two-pass technique consisting of steps in clone detection and clone classification order to improve the performance of clone type classification.
In the next section, we describe related techniques for clone detection. Section 2 introduces AST and its vector representation as background knowledge. Section 3 presents our two-pass clone detection and type classification technique. The setup for the experiment and its results are presented briefly in Section 4 and our results are discussed in detail in Section 5. Section 6 offers the related work of clone detection. Finally, in Section 6 we conclude the paper and suggest directions for future research.

Background
A typical code clone detection approaches represent source code using different abstractions such as AST, Control Flow Graph (CFG), and PDG. In this section, we briefly discuss the AST with embedding technique used in this paper to transform AST into feature vectors.

Abstract Syntax Tree
An AST is a tree representation of a code fragment. Since this tree represents the actual structure of a code fragment, studies have used ASTs for code clone detection [16][17][18]. To extract the syntactic representation of a code fragment, code is converted into a set of tokens, and the list of tokens is turned into an AST. Figure 1a is a code fragment of a method signature part. Figure 1b is an AST for the method signature part. Each node of the AST tree structure has a type specifying what it is representing. For example, a type could be a "MethodDeclaration" representing a method definition or a "FormalParameter" representing a parameter for a method declaration. There are two "FormalParameter" subtrees, each with a "ReferenceType" of "str", that is, String.
A typical clone detection approach usually represents the code structure into different abstractions, such as AST, CFG, and PDG, and then compares code similarity between two code fragments. Graph-based techniques that generate CFGs or PDGs from code perform better in detecting code clones. However, PDG-based techniques are not scalable due to the complexity of graph isomorphism [19]. K. Chen et al. [19] improve accuracy and scalability simultaneously in detection clones, but they do not validate their technique for Type 4 clones. M. Gabel et al. [17] improve scalability by mapping the subgraphs in PDG back to AST forest and comparing syntactical feature vectors extracted from AST, but the results are imprecise due to the approximations [20]. In this paper, we use AST as our code representation because we find that AST can represent code patterns with significantly lower effort compared to CFG and PDG, and it is scalable to large amount of code.

Vector Representation of the AST
To facilitate datamining on code as well as the interpretation of the datamining results, syntax trees should be transformed into continuous vectors for representing the code. Vector representation of the code enables a much more comprehensive range of analysis. Figure 1c shows assigned code vectors for the AST of Figure 1b. Since machine learning algorithms take vectors as their inputs, we use vector embedding techniques [21,22] to transform AST into vectors. The code vectors capture properties of code fragments, such that similar code fragments have similar vectors.

TBCNN-Based Two-Pass Clone Type Classification
Our technique consists of two steps performed in order. The first step is for detecting code clones, and the second step is for classifying their clone types. The clone type classification is performed only for the code pairs detected as clones at the first step. Thus, two classification steps are as follows: • The 1st-pass classification (clone detection) determines whether a given piece of code is a clone. This pass detects structural features of a code fragment represented with an AST via TBCNN and applies max pooling to gather information over different parts of the tree. After pooling, the features that are aggregated into a fixed-size vector are fed to a fully-connected hidden layer before introducing the final output layer. For supervised classification, softmax is used as the output layer (see clone detection in Figure 2). The output layer consists of two neurons, clone or non-clone.

•
The 2nd-pass classification (clone type classification) classifies the clone code. This pass targets only the code fragments classified as clones during the first pass. This pass again uses TBCNN and max pooling to detect and aggregate features of a clone code fragment. The features are fully connected to a hidden layer and then fed to the output layer (see clone type classification in Figure 2). In this step, the number of neurons of the output layer refers to the number of clone types.
In the remainder of Section 3 we describe the details of the AST embedding process associated with the code preprocessing step, as well as the clone detection and clone type classification steps.

Preprocessing
The preprocessing step consists of the two sub-steps denoted with the "Preprocessing" label in Figure 2. The first sub-step is the transformation of code fragments into ASTs to focus on the structure of code. The second sub-step requires the conversion of nodes in AST into vectors to produce AST represented by vectors. We used the ast2vec [22] to convert nodes in AST into vectors. The ast2vec encodes nodes in AST similar to how word2vec encodes words for natural language processing, so AST's nodes with similar functions have similar feature vectors.

Encoding Using TBCNN
The proposed technique applies TBCNN [10,13], which uses a tree-based convolution, max pooling, and a fixed-size neural layer during the 1st and 2nd classification. As with a CNN's convolution layer, TBCNN has a tree-based convolution layer that explores the entire AST to generate a new tree that includes structural information associated with a code fragment. The nodes of the subtree are computed as one vector via the following formula: in which x i is a vector representation of nodes in a tree, W conv,i is the weight matrix of the node, and b conv is the bias. AST performs poorly at establishing the size of W conv,i because the number of children that a parent node can have is not fixed. To tackle this problem, TBCNN uses a "continuous binary tree", a model that views any subtree as a "binary tree" regardless of its depth and degree. In this model, the weight matrix for a node W conv,i is defined as a linear combination of the three weight matrices: W t conv , W r conv , and W l conv . W conv,i is determined by multiplying each matrix by a factor containing the location information of the actual node (see [13] for details), as follows: TBCNN explores AST through such calculations, and upon reaching a leaf node, creates a hypothetical child node to perform the calculation by assigning "0" to the leaf node. Generating these virtual child nodes equalizes the form of the tree in the input and output of the tree-based convolution layer. The data is then down-sampled through the pooling layer. We used max pooling, which takes the maximum value of each dimension from features mined from the tree-based convolutional layers. With max pooling, features are pooled to one vector. Data is then aggregated through a pre-binding neural network and converted into a one-dimensional vector.

The 1st-Pass Classification: Clone Detection
Clone detection during the first pass classifies whether a given code fragment is a clone. Looking at the structure of the three code fragments in Figure 3, the code fragments in Figure 3a,b are similar in their functionality while those of Figure 3c perform a different function. The first two code fragments are logics for obtaining the factorial value of a given integer, while the last code fragment is a logic that changes the position of two elements in an integer array. Code similarity is low in all three cases because no structurally identical syntax is found. Yet, the code pair of Figure 3a,b should be classified as a T4 clone as they perform the same function (although their structural similarities are low). The remaining two code pairs are not clones, due to their low structural similarities and performance of different functions.  Our technique performs this clone detection prior to clone classification because nonclone code pairs and their corresponding clone types are sometimes categorized as clones.
Our experimental results confirmed that the accuracy of our clone type classification results varied significantly in accordance with whether this clone detection pass was performed. The main components of our technique include the vector value of AST obtained from AST embedding, tree-based convolution, max pooling, a fully-connected neural network, and an output layer. The output node value (the probability that the code pair given as input is a clone of another code fragment) was between 0 and 1.

The 2nd-Pass Classification: Clone Type Classification
In our system, clone-type classification is performed only on code fragments detected as clones, with this pass using the same TBCNN model used during the 1st pass. However, unlike the clone detection model's output layer with two output nodes (clone or nonclone), the output of the classification model consists of four output nodes corresponding to the four possible clone types (lower right-hand of Figure 2). Each output node has a value between 0 and 1, with the specific value indicating the probability that the assessed code is of a particular clone type.

Experimental Evaluation
We conducted experiments to evaluate the performance of our technique. This section describes the dataset used for evaluation and the parameter values set in our training and experimentation models, as well as the experimental results.

Dataset
We used BigCloneBench for training and testing our model. BigCloneBench [15] is a real-world benchmark that contains over 6,000,000 tagged clone pairs across 43 functionalities, as well as 260,000 false clone pairs. BigCloneBench specifies a clone pair by triple, including a pair of similar code fragments and their clone-type. It contains both intra-project and inter-project clones of all four clone-types. Depending on the similarity ratio, these clones and clone-types have all been manually evaluated. BigCloneBench divides T3 and T4 clones into four categories: Very-Strongly Type 3 (VST3), Strongly Type 3 (ST3), Moderately Type 3 (MT3), and Weakly Type 3+4 (WT3/4). Our technique performs this clone detection prior to clone classification because nonclone code pairs and their corresponding clone types are sometimes categorized as clones.
Our experimental results confirmed that the accuracy of our clone type classification results varied significantly in accordance with whether this clone detection pass was performed. The main components of our technique include the vector value of AST obtained from AST embedding, tree-based convolution, max pooling, a fully-connected neural network, and an output layer. The output node value (the probability that the code pair given as input is a clone of another code fragment) was between 0 and 1.

The 2nd-Pass Classification: Clone Type Classification
In our system, clone-type classification is performed only on code fragments detected as clones, with this pass using the same TBCNN model used during the 1st pass. However, unlike the clone detection model's output layer with two output nodes (clone or non-clone), the output of the classification model consists of four output nodes corresponding to the four possible clone types (lower right-hand of Figure 2). Each output node has a value between 0 and 1, with the specific value indicating the probability that the assessed code is of a particular clone type.

Experimental Evaluation
We conducted experiments to evaluate the performance of our technique. This section describes the dataset used for evaluation and the parameter values set in our training and experimentation models, as well as the experimental results.

Dataset
We used BigCloneBench for training and testing our model. BigCloneBench [15] is a real-world benchmark that contains over 6,000,000 tagged clone pairs across 43 functionalities, as well as 260,000 false clone pairs. BigCloneBench specifies a clone pair by triple, including a pair of similar code fragments and their clone-type. It contains both intra-project and inter-project clones of all four clone-types. Depending on the similarity ratio, these clones and clone-types have all been manually evaluated. BigCloneBench divides T3 and T4 clones into four categories: Very-Strongly Type 3 (VST3), Strongly Type 3 (ST3), Moderately Type 3 (MT3), and Weakly Type 3+4 (WT3/4).
BigCloneBench's T1 clone refers to syntactically identical code fragments, excluding differences in white space, layout, and comments identified after applying T1 normalization. T2 refers to syntactically identical code fragments, excluding differences in identifier names and literal values, in addition to Type-1 clone differences identified after applying T2 normalization. T1 normalization refers to annotation removal and pretty-printing, while T2 normalization includes the addition or modification of identifier names and literal values, in addition to T1 normalization. BigCloneBench's ST3, MT3, and WT3/4 clones were classified by measuring similarity as the minimum ratio of lines or token code fragments shared after the T1 and T2 normalizations were applied. As there is no consensus on classification criteria of a T3 clone or T4 clone, BigCloneBench defines the range of similarity of each clone-type to separate clones because of these types. ST3 clones are those clones with a similarity in the range of 70% (inclusive) to 90% (exclusive), and most syntactic clone detection tools use this criterion. MT3 clones refer to clones with a syntactic similarity of between 50 and 70%. Conventional syntactic clone detection tools classify MT3 clones as either a T3 clone or T4 clone. The final WT3/4 are those clones with a syntactic similarity of less than 50%, and most clone detection tools do not detect this type.
To compare the performance of our method with existing tools, we used the Big-CloneBench's dataset ( Table 1). The selected dataset consists of 59,618 individual code fragments, paired in two to form 97,535 code pairs, including 20,000 false clone pairs. Clone pairs consist of 15,555 T1 clones, 3663 T2 clones, 18,317 ST3 clones, 20,000 MT3 clones, and 20,000 WT3/4 clones. However, the number of clone pairs of each clone-type is not balanced. Specifically, the number of T2 clone pairs is significantly less than other types. Because imbalanced datasets produce biased training results in classes with multiple instances, we conducted sampling from the BigCloneBench dataset to obtain an appropriate dataset for training and testing our technique. To cope with BigCloneBench's class imbalance problem, we used an undersampling technique that randomly removes class instances from a class with many instances to balance the number of instances. We performed undersampling based on the number of T2 clone pairs to build a training and test dataset. Especially for the clone detector of the first pass, the total number of clone pairs and the number of false clone pairs were adjusted to be similar. Table 2 shows a training and test dataset that resulted after your undersampling. We performed the undersampling twice, once to generate a training dataset and again to generate a test dataset.  False Clone  Pairs  Dataset  T1  T2  ST3  MT3  WT3/4   Training  1099  1099  1099  1098  1099  5492  Test  2564  2564  2564  2563  2564  12,823  Total  3663  3663  3663  3661  3663 18,315

Parameter Settings
We tuned the parameters and chose the set that achieved the highest F1 score on the dataset. Parameter tuning was applied to the second pass's classification model. To determine the optimal parameters, parameter tunings proceeded in order of (1) AST embedding, (2) fully-connected neural network, and (3) both AST embedding and fullyconnected neural network. We identified the optimal parameter settings for our technique (Table 3). We set the clone classification model parameters higher for the second pass than the first, considering the difficulty of clone-type classification performed during the second pass.  Table 4 summarizes the recall and precision of our technique on the BigCloneBench dataset. Recall refers to ratio of clone pairs or clone-types within a dataset that our technique detect or classify, while precision is the ratio of clone pairs or their types reported by our technique that are true clone pairs or true clone-types (and not false positives). Overall, our technique achieved 95% precision in the first pass for clone detection. More specifically, our technique achieved 94% recall and 96% precision for true clones, and 97% recall and 95% precision for false clones. In the second pass for clone type classification, our technique achieved 78% recall and precision on average. Our technique achieved high precision (94%) for T1 clones and low precision (67%) for MT3 clones, while the highest recall is for T2 clones (93%) and the lowest is for MT3 clones (58%). In our second experiment, we classified clone-types with and without the first pass. We built a model with the same architecture of our technique, excluding separated passes. Table 5 shows our comparative experimental results. Overall, the proposed two-pass technique achieved better recall and precision than the model without the first pass. With the two-pass technique, classification of T2 clones is significantly improved (see F1 score of Table 5).  Table 6 compares the recall and precision of our technique against other approaches. Overall, our technique achieved 96% recall and precision for clone detection, a result significantly better than those of the other approaches, except DeepSim, against which our technique performed slightly lower worse. These results demonstrate that our technique can detect clones at a similar level and rate as DeepSim, despite its much simpler preprocessing. In the case of clone classification, our technique achieved 78% recall and precision. In our final test, we investigated misclassified clones (Table 7). Many WT3/4 clones were incorrectly classified as false clones. The reason for this result was that the similarity ratio of WT3/4 clones was less than 50%. Clones tend to be incorrectly classified as adjacent types, i.e., ST3 clones are misclassified into T2 or MT3 clones.

Discussion
We manually converted clone pairs to ASTs to analyze the reason that clones are misclassified as adjacent clone-types. We identified a common feature in clones classified as T2, ST3, and MT3 clones. For clones classified as T2, more than 95% of original code fragments were reused by cloned code fragments without any modifications. Clones classified as ST3 included several identical subtrees at the same level. For clones classified as MT3, there was a difference in the level at which the same subtree appeared, unlike clone pairs classified as ST3 clones. The first three subsections discuss each of the misclassified clones, and the final subsection discusses the threats that can affect the validity of the experiment results.

Clone Pairs Misclassified into T2 Clones
As reported in Table 7, most clone pairs misclassified into T2 clones were either T1 or ST3 clones. As for T1 clones misclassified as T2 clones, their ASTs were entirely identical; this is because a T2 clone modifies only identifiers or literals of an original code fragment. In the case of the ST3 clones misclassified into T2 clones, more than 95% of the original code fragment was completely identical. The ASTs of these clones resulted in some nodes being rearranged or a subtree increasing by one level due to the newly added code. Our technique tended to classify clone pairs into T2 clones when the difference between the two ASTs was slight, i.e., when the similarity between the codes was high. Figure 4 presents an example clone pair that best illustrates one of these cases. The clone pair in Figure 4a was classified as an ST3 clone on BigCloneBench, but classified as a T2 clone by our technique. This clone pair is similar, but each code fragment uses different object names, with the code on the left-side using object names such as srcChannel and dstChannel, while the code on the right uses in and out instead (see the fourth line in italic). The order of arguments of a caller method is also different from that of on the left-side code, i.e., "srcChannel, 0, srcChannel.size()" and "0, in.size(), out", respectively. The two caller methods are similar but have a semantic difference, making this clone pair an ST3 clone. In AST, this semantic difference is represented by two identical subtrees that differ only in their order (see the boxed subtrees of Figure 4b). Our technique detects this difference as small changes, so it was classified as a T2 clone. The clone pair example in Figure 5a was classified as an ST3 clone by BigCloneBench but as a T2 clone by our technique. The clone pair of Figure 5a has method signature difference (see lines in italic). According to its ASTs, the signature difference between the two code fragments is expressed as the addition or deletion of subtrees such as the 'set' subtree of Figure 5b. Our technique did not detect this slight distinction. Changes to a small number of nodes in AST have a small impact on type classification because they cause slight changes in the feature vectors used as inputs for training and testing in the model. Clones of the type in Figures 4 and 5 are close to T2 clones because they are highly similar codes that are functionally comparable. In particular, the code pair of Figure 5 can be refactored in the same way as T1 and T2 clones because all code lines, except for the method signature, are identical.

Clone Pairs Misclassified into ST3 Clones
The total number of MT3 clones on BigCloneBench was 20,000, among which 8771 MT3 clones were misclassified (as described in Tables 1 and 7). Almost 78% of MT3 clones were misclassified as ST3 clones. Examination of a sample code pair (Figure 6a) reveals that the difference between the two code fragments is that lines 2 and 3 of the right-side code are newly added, and line 5 of the left-side code has been changed to line 7 of its right-side code (see lines in italic). The ASTs of this clone pair are identical except for the boxed portion of Figure 6b, a piece of the AST for the right-side code fragment. In most cases where MT3 clones are misclassified as ST3 clones, 70-80% of the AST nodes had identical structures. However, the classification performed using the proposed technique can be regarded as reasonable, recalling that the ST3 clones (similarity of 70-90%) and MT3 clones (similarity of 50-70%) in BigCloneBench are distinguished only by their degree of similarity. Had we not divided T3 clones into ST3 and MT3 categories, our technique would be regarded as performing well on T3 clones.

Clone Pairs Misclassified into MT3 Clones
The clones misclassified as MT3 clones were ST3 and WT3/4 clones. Analysis results of these clones revealed that the ASTs had identical subtrees but that some of them were located at different levels. A sample clone pair in Figure 7a is a clone classified as an ST3 clone in BigCloneBench, but was classified as an MT3 clone in our technique. This clone is identical in its statements, the only difference being in the use of the 'try-catch' statement, i.e., line 4-9. When we investigated the ASTs of this clone pair, we found that subtrees located at level 1 in the AST of the left-side code fragment were located at level 2 in the AST of the right-side code fragment due to the 'TryStatement' node. We tested whether the number of statements within a try-catch statement affected clone-type classification, finding that only the level of subtrees in the AST had an effect, regardless of the number of statements with a try-catch statement. This demonstrates that the subtree's level is an influential feature that distinguishes MT3 clones from others in our clone-type classification technique.

Threats to Validity
First, we did not apply cross-validation techniques to train and evaluate our model. Our model thus might fit to the training data and does not accurately work for the actual data. Second, we performed undersampling to overcome the imbalance problem, but we did not compare the results with those of oversampling. We did not perform oversampling because we could not guarantee that we could apply the same classification criteria as originally used in the BigCloneBench dataset. However, this is a threat to the internal validity of the results.

Related Work
There are, at present, six types of clone detection techniques: text-based, token-based, tree-based, PDG-based, matrix-based, and machine learning-based. A text-based clone detection technique treats the code as text and detects clones through direct comparison of code fragments. For example, Dup, a text-based and line-based clone detection tool, uses a parameterized matching algorithm [3], and J. Johnson [25] uses a Karp-Rabin string matching approach. These text-based techniques can find clones simply and quickly but are not capable of detecting clones that are not completely or near identical to the source code.
Token-based clone detection transforms the entire code into a sequence of tokens and then compares the sequences. CCFinder [26], for example, lexes, parses, and transforms identifiers related to variables, constants, and user-defined types into a sequence of tokens. CCFinder detects clones, even if only identifiers have been changed. CCFinder cannot detect clones, however, where the order of sentences has been modified or new code has been inserted. To overcome this limitation, a different token-based clone detection tool, CP-Miner [7], uses frequent pattern mining. CP-Miner can detect one or two code lines that have been inserted, deleted, or modified, but it is still unable to detect cloned codes where more extensive code modifications have been made.
Tree-based clone detection techniques seek to overcome the limitations of text-based techniques by converting code to AST and then using tree matching algorithms to detect clones. CloneDR [27] transforms AST to hash values via hash functions and then detects clones by comparing subtrees with the same hash value. V. Wahler et al. [28] have proposed a technique for clone detection using frequent pattern mining techniques based on AST expressed in XML. W. Evans and C. Fraser [29] have suggested a technique to detect clones using AST's subtree where a particular pattern appears. While researchers have worked hard to refine these tree-based clone detection techniques, clone detection remains complicated in situations where two code fragments have different orders.
An alternative technique for detecting clones, including differently ordered statements, newly inserted statements, or removed statements, are PDG-based techniques. PDG-based techniques detect clones by matching PDGs that contain data flow and control flow information of a code fragment. The most well-known PDG-based clone detection technique is PDG-DUP, proposed by R. Krinke and S. Horwitz [30]. PDG-DUP finds identical PDG subgraphs using program slicing. J. Krinke [31] uses k-length patch matching to identify maximally similar subgraphs in PDGs. While PDG-based clone detection techniques overcome the limitations of the text-based, token-based, and tree-based techniques, this technique does not scale well to large-scale systems, as matching PDG subgraphs takes a while.
Metric-based clone detection techniques refer to various metric values derived from code for clone detection. Metrics are typically calculated from classes, techniques, and statements using fingerprinting algorithms. J. Mayrand et al. [32] transformed each function of source code to Intermediate Presentation Language (IRL) to calculate metrics values from the names, layout, expression, and control flow of functions. A pair of functions that has similar metric values is detected as a clone. Another metric-based clone detection technique proposed by K. Kontogiannis et al. [4] computes metric values that capture the data and control flow properties of source code. The technique annotates metric values to corresponding nodes of AST constructed from source code and detects clones by comparing the distances between annotated metric values of code fragment pairs using dynamic programming approach. Since metric-based techniques do not utilize the code itself, this method may be less precise than others, as two different codes with similar metric values may be identified as clones.
Recently, researchers have focused on extracting code features and detecting clones using machine learning [18,20,23,24,[33][34][35][36]. CCLearner [33] uses a token analysis technique as well as a deep learning technique, in which tokens are used to generate inputs for a deep learning model, which then determines whether the given inputs are clones. DeepSim [20] encodes both control flows and data flows into a semantic matrix and uses a deep learning method to detect clones based on the semantic matrix. N. Bui et al. [34] is an AST-based approach and used a bilateral tree-based convolutional neural network for automated program classification. RtvNN [23], CDLH [24], and D. Perez et al. [35] are AST-based approaches that transform AST into feature vectors to use deep learning models Whereas RtvNN uses recursive neural networks to detect both textual and functional similarity, CDLH and D. Perez et al. apply LSTM on ASTs to detect functional clones. C. Fang et al. [18] use both AST and CFG to extract syntactic and semantic information of code. They use a DNN classifier model by using fused syntactic and semantic feature vectors as input. CCLearner, RtvNN, and CDLH perform tolerably well in experiments using BigCloneBench but perform particularly poorly at detecting T3-and T4-type clones. DeepSim performs well at detecting all types of clones, but its preprocessing procedure is unwieldy. The experiments of D. Perez et al. are about clone detection between the two different programming languages. A. Sheneamer et al. [36] use both AST and PDG to detect semantic information of code, and they compare T3-and T4-type clone detection performance of machine learning algorithms such as Rotation Forest, Random Forest, and Xgboost. They conclude that Xgboost is an excellent classification algorithm for the detection of T3-and T4-type clones.

Conclusions and Future Work
This paper proposed a TBCNN-based two-pass clone-type classification technique that detects clones and then classifies them by clone-type. We confirmed that the proposed technique performs well in terms of recall and precision on the first clone detection pass, which used only simple preprocessing. Experimentation revealed that the second pass classified the clone-types with 78% precision and recall.
This work makes three contributions to the field of clone detection and classification by using machine learning: (1) The proposed technique performs clone detection with high accuracy using only a straightforward preprocessing process. DeepSim [20], which performs similarly well to our technique's clone detection phase, encodes data and control flow information of a program into a compact semantic feature matrix to capture a code's semantic information, which made DeepSim unwieldy. Our technique, however, used only ASTs, which can be generated easily from code. Moreover, our technique encodes each AST's nodes into a vector via AST2vec, which is a much more simple and direct process than that applied by DeepSim. (2) Unlike conventional machine learning techniques, our technique detects and classifies clones with relatively high accuracy. Clone-type information can be used to provide automated refactoring support in development environments. Specifically, T1 and T2 clones can be easily modified by consolidating clones into a single code and invoking the code, if necessary, without further analysis. As for T3 and T4 clones, developers can make decision whether they refactor the clones or not by considering the context in which the clones appear. (3) Our two-pass technique can be divided into two phases, each of which can be used alone or together as needed. Since our two-pass clone classification technique performs the first classification and the second classification sequentially, clone detection results can be obtained without associated classification data.
Our technique does have some deficiencies, specifically its lower recall of MT3 clones than other clone-types, and the fact that it does not consider small changes in AST as features for clone type classification. We can improve the current code vector representation method in a way that captures the subtle differences between code fragments. Alternatively, additional code information such as control flows and data flows can be considered to address the issue that AST provides only structural information. Our future work includes these directions.  Acknowledgments: The authors would like to thank the Writing Center at Jeonbuk National University for its skilled proofreading service.

Conflicts of Interest:
The authors declare no conflict of interest.