Identifying High-Level Concept Clones in Software Programs Using Method’s Descriptive Documentation

: Software clones are code fragments with similar or nearly similar functionality or structures. These clones are introduced in a project either accidentally or deliberately during software development or maintenance process. The presence of clones poses a signiﬁcant threat to the maintenance of software systems and is on the top of the list of code smell types. Clones can be simple (ﬁne-grained) or high-level (coarse-grained), depending on the chosen granularity of code for the clone detection. Simple clones are generally viewed at the lines/statements level, whereas high-level clones have granularity as a block, method, class, or ﬁle. High-level clones are said to be composed of multiple simple clones. This study aims to detect high-level conceptual code clones (having granularity as java methods) in java-based projects, which is extendable to the projects developed in other languages as well. Conceptual code clones are the ones implementing a similar higher-level abstraction such as an Abstract Data Type (ADT) list. Based on the assumption that “similar documentation implies similar methods”, the proposed mechanism uses “documentation” associated with methods to identify method-level concept clones. As complete documentation does not contribute to the method’s semantics, we extracted only the description part of the method’s documentation, which led to two beneﬁts: increased efﬁciency and reduced text corpus size. Further, we used Latent Semantic Indexing (LSI) with different combinations of weight and similarity measures to identify similar descriptions in the text corpus. To show the efﬁcacy of the proposed approach, we validated it using three java open source systems of sufﬁcient length. The ﬁndings suggest that the proposed mechanism can detect methods implementing similar high-level concepts with improved recall values.


Introduction
Cloning of program fragments in any form has been identified both as a boon and bane depending upon the purpose for which it is used [1][2][3][4][5][6][7]. Writing code from scratch often hinders code understandability and evolvability. In such situations, reusability achieved via the use of already existing libraries, design patterns, frameworks, and templates aids structuring code well, while on the other hand, code clones formed as a result of the programmer's laziness can turn out to be harmful in the long term. These clones adversely affect software maintenance as even the smallest change made in a code fragment may have to be propagated to all the clones of that fragment; failing may lead to inconsistent code, which may result in software bugs [8][9][10][11]. Clones can also arise out of language limitations or external business forces.
Depending on the extent of similarity between code clones, they can be classified as exact or type-1 clones. They are the result of copy paste activity of programmers, when they find a piece of code providing exact solution to their problem. To suit their needs, programmers may further modify the pasted code leading to Type-2 or parameterized to further examine the clones, which have already been detected by some clone detection tools. The findings suggest that clone detection and LSI complement each other and give better quality results [18,55]. Table 1 summarizes some of the relevant studies related to the use of information retrieval techniques in clone detection. Bauer et al. [18] 2016 LSI + L(TF) + CS Identifiers LSI improves the results obtained by using a clone detection tool (here Conqat) Ghosh and Kuttal [58] 2018 LDA + SWR Comments Achieved better precision and recall compared to PDG-based techniques Reddivari and Khan [59] 2019 LDA + SWR + ST Source code after identifier splitting Achieved better results when compared to CCFinder [60] and CloneDr [61] Abid [62] 2019 Lucene (IR engine) Comments, keywords from function names, and API names used in function Based on the user query, the technique finds similar functions that implement identical features using the extracted text elements.
Kuttal and Ghosh [63] 2020 LDA + SWR + TF Comments excluding task and copyright comments Emphasized the usage of comments for clone detection. Hybrid of both LDA (on comments) and PDG-based (on source code) approach is presented with improved accuracy metrics values.
Xie et al. [53] 2020 TF-IDF + Cosine Source code text converted into tokens The technique combines TF-IDF and deep learning-based (word embeddings) approaches to identify semantic clones.
Fu et al. [54] 2020 CBOW + SG Source code text converted into tokens The technique combines SG, CBOW word embedding model, and ensemble learning to find semantic clones. It is shown to perform well compared to other token-based and deep learning-based clone detectors. Kuhn et al. [56] also illustrated the use of LSI for topic modeling of source code vocabulary. Later, Maskeri et al. [64] used a more efficient approach called Latent Dirichlet Allocation (LDA) to identify similar topics. Ghosh and Kuttal applied LDA on a text corpus formed from a combination of comments and associated source code to identify semantic clones [58]. It gave better precision and recall values as compared to PDG-based approaches. In their recent study, they emphasized the usage of source code comments to detect clones. They performed clone detection by employing a PDG-based approach to source code and LDA to source code comments (excluding task and copyright comments) for finding file level clones [63]. Their study empirically shows the relevance of good quality comments to scale such IR-based techniques well for larger projects and detecting inter-project clones. Reddivari and Khan also developed CloneTM, which uses LDA to identify code clones (up to type-3) using source code (with identifier splitting, stop-word removal, and stemming). CloneTM gave better results when compared to CCFinder [60] and CloneDr [61]. Abid [62] developed Feature-driven API usage-based Code Examples Recommender (FACER) to detect and report similar functions based on the user's query.
The tool uses comments, keywords from the function name, API class names, and API method names used in functions to find similar functions using an IR-based engine called Lucene. At the data level, data clones (replicas) can help recover lost data at runtime for quality assurance of data analysis tasks in the big data analysis [65].
The literature also provides other similar techniques for clone detection. Grant and Cordy [57] replaced LSI with a feature extraction technique, namely, Independent Component Analysis (ICA), and used tokens in the program to build a vector space model instead of terms in the comments.
Several previously published studies [12,48,49,58,63] apply IR-based techniques using source code comments (including other programming constructs) to identify clones, thereby resulting in a bulky text corpus. This study, however, omits all the unnecessary comments and uses only comments present in the method's documentation. Blending suitable NLP techniques with descriptive documentation only (excluding notes comment, explanatory comment, contextual comment, evolutionary comment, and conditional comment) makes the proposed mechanism lightweight and efficient.
In this study, high-level concept clones at method-level granularity are detected by extracting descriptive documentation of each method to form the text corpus. The descriptive part of documentation generally describes the semantics of code, thus fulfilling the criteria for detection of concept clones as stated above [13]. Latent Semantic Indexing (LSI) coupled with Part-of-Speech (POS) tagging strategy and lemmatization is then applied on the formed text corpus to identify high-level concept clones. After obtaining similar descriptions, the methods corresponding to these descriptions are regarded as concept clones.
The followings points summarize the procedural flow and factors contributing to the efficacy of the proposed approach: The text corpus is formed only from the description of methods (other documentation details are removed). This is because the semantics of a particular method are mostly contained in its description. All other comments augmented with methods are also not used. Furthermore, no additional text element of the source code is used for text corpus preparation. All this makes the comparison process efficient with the reduced size of the text corpus.

2.
A POS tagging strategy has been employed to assign higher weights to certain parts of speech (generally nouns and verbs) and for the removal of stop-word, filtering out all terms not tagged as nouns, adverbs, verbs, and adjectives. The use of POS tagging also reduces the size of text corpus and increases efficiency by combining two tasks in one step. Other works carry out stop-word removal as a separate task.

3.
Different combinations of weight and similarity measures have been examined for clone detection. These combinations are compared based on several parameters to enable the end user to choose suitable variant. In Section 6.3, each combination is ranked based on their efficiency to find contextual clones.

4.
The proposed mechanism is able to detect methods implementing similar high-level concepts with good accuracy values. The mechanism is empirically validated using 3 open source java projects (JGrapht, Collections, and JHotDraw) as subject systems.
The organization of the rest of the paper is as follows. Section 2 explains the relevant terminologies used in this study. Section 3 lays down the research questions relevant to this study. Section 4 lists the key features of documentation in java projects. Section 5 delineates the clone detection process. Section 6 outlines the empirical validation process used and addresses the listed research questions. Section 7 outlines threats to the validity of results derived in this work. The conclusion is presented in Section 8 of this exposition.

Terminology
This section briefly outlines the relevant terms used in this study.

High-Level Concept Clones
High-level concept clones [12,13] are defined as implementations of a similar higherlevel abstraction like an ADT list. They arise when developers try to re-invent the wheel, i.e., implementing the same or similar concept using different algorithms. These clones address the same or similar problem, while having very dissimilar structures. Their detection depends on the user's understanding of the system. Using the code's internal documentation and program semantics can greatly improve detection accuracy of this type of clone.
In this research work, we have used descriptive documentation associated with methods (as it contributes to the method's semantics) to detect high-level concept clones having granularity as java methods.

Latent Semantic Indexing (LSI)
LSI [51] is an information retrieval technique used for automating the process of semantic understanding of sentences written in a particular language. It uses Singular Value Decomposition (SVD) of the term-document matrix developed using the Vector Space Model (VSM) to find the relationship among documents.
To perform LSI on a given set of documents, a term-document matrix is built by extracting terms from these documents. Every entry of the matrix indicates the weight assigned to a particular term in the corresponding document. This term-document matrix is decomposed using SVD. It decomposes the matrix (say A with dimension m,n) into three matrices as follows, where the dimension of U is (m,m), S (diagonal matrix) is (m,n), and V is (n,n). A random value K is chosen to form the matrix A k (a kth approximation of matrix A). A k is derived as After A k is successfully computed, given a document, its similarity with other documents is calculated. The measure of similarity can be cosine, jaccard, dice, etc. The result of applying LSI is a list of similar documents. A schematic representation of LSI is given in Figure 1.
LSI usually finds its usage in applications like search engines to extract the most relevant documents for a given query. It can also be used for document clustering in text analysis, recommendation systems, building user profiles, etc.

Part-of-Speech (POS) Tagging
POS tagging [66] is defined as the process of mapping words in a text to their corresponding Part-of-Speech tag. The tag is assigned based on both its definition and context (relationship with other words in the sentence)*. For example, consider two sentences say "S1 = Sameera likes to eat ice cream" and "S2 = There are many varieties of ice cream like butterscotch, Kesar-Pista, etc.". Look at the word like in both sentences. The tag assigned to "like" changes according to the sentence: in S1 like is a verb, while in S2 like is a preposition.
POS tagging uses a predefined tag-set, such as the penn treebank tagset; transition probabilities; and word distribution. The tag-set contains a list of tags to be assigned to different words in the corpus. Transition probabilities define the probabilities of transition from one tag to another in a sentence. The word distribution denotes the probability that a particular word may be assigned to a specific tag.

Perform SVD with K=p say
The term-document matrix is broken down into 3 matices New matrices are derived by extracting only certain columns and rows from the matrices  Applications of POS tagging are named entity recognition (recognition of the longest chain of proper nouns), sentiment analysis, word sense disambiguation, etc.
While comparing two sentences it is sometimes required that different parts of speech be weighed according to the semantic information it conveys in a particular context, which can be done using POS tagging. In this paper, we have used POS-Tagging to give extra weight-age to nouns and verbs.
In prior studies, D. Falessi et al. [67] used this technique to find equivalent requirements. By comparing different combinations of Natural Language Processing (NLP) techniques required for building algebraic model, doing term extraction, assigning weights to each term in documents, and measuring similarity among documents, they concluded that the best outcome is observed for a combination of VSM, raw weights, extraction of relevant terms (verbs and nouns only), and cosine similarity measure. They also did an empirical validation of their results using a case study [68]. Johan [69,70] used similar NLP techniques to find related or duplicate requirements coming from different stakeholders to avoid reworking on a similar requirement.

Lemmatization
Lemmatization refers to the process of reducing different forms of a word to its root form [71]. For example, "builds", "building", or "built" gets reduced to the lemma "build".
The purpose of lemmatization is to gain homogeneity among different documents, though, another mechanism, stemming, can also be used. Stemming works by removing suffixes from different forms of a word, e.g., "studies" becomes "study", "studying" becomes "study", etc. As is evident from this example, lemmatization reduces words to a word present in the dictionary, while stemming removes suffixes without caring whether the reduced form is available in the dictionary or not.
Stemming/lemmatization strategies have several applications in natural language processing. In the software engineering field, these are generally used to understand and analyze textual artifacts of the software development process, such as documentation provided at each level, e.g., requirement document, design document, code documentation, etc. Falessi et al. [67,68] employed stemming along with POS tagging as a word extraction strategy to find equivalent requirements.

Term Weighting
While generating a term-document matrix, each entry carries some weight according to its occurrence in the documents. These weights can be assigned in several ways as discussed below.
Raw frequency: Weights are assigned according to the number of times they appear in a particular document.
Binary: Weight value is one if a term occurs in the document; otherwise, its value would be zero.
Term Frequency (TF): Raw frequency divided by the length of the document. This helps prevent giving more preference to longer text because longer text tends to contain a particular term at a higher frequency than the shorter ones.
Inverse Document Frequency (IDF): IDF is computed as the inverse of the number of times a term occurs in all the documents. IDF score can be useful to give less importance to the terms that frequently occur in all documents, thereby making it semantically less useful for the document under review.
TF-IDF: Weight for a particular term in a document is computed by combining its TF and IDF values. This mechanism leverages the benefits of both TF and IDF scores.
Word Embeddings: Word embeddings are a way to represent words in the text corpus using distributed representations of words instead of a one-hot encoding vector. Using these vector representations, one can explore the semantic and syntactic similarity of words, the context of words, etc. in the document. Various word embedding techniques exist: Word2Vec: Word2Vec is a feedforward neural network with two variants. It either accepts two or more context words as input and returns the target word as output, or vice versa. The former is called the Continuous Bag of Words (CBOW) model and the latter is the Skip-Gram (SG) model.

(ii)
Glove: Glove works by generating a high-dimensional vector representation of each word. The training is based on the surrounding words. This high-dimensional matrix (context matrix) is then reduced to a lower-dimension matrix such that maximum variance is preserved. Glove uses pretrained models that fail for text containing new words. The documentation of methods may contain many new words (such as method names), therefore glove was found unsuitable for this study.
Fasttext is similar to glove, with a feature that makes it suitable for random words not in the pretrained model. It breaks the word into n-grams and then generates the required context-matrix, for example, given the word "fruits" and n = 3. The n-grams generated are <fru, rui, uit, its>. Each row in the matrix represents the generated n-grams instead of whole words.

Similarity Measures
The two text documents are compared for their similarity using established similarity metrics. As this study uses the term-document matrix, we are only interested in vector similarity measures. There are three commonly used vector similarity metrics: cosine, jaccard, and dice. For two vectors, say X and Y, the three metrics are calculated as follows:

Thresholds Used
Certain thresholds are used for the proper execution of the proposed mechanism: (i) #Method_Tokens: This threshold is used to filter out small methods. Methods with count of tokens greater than a certain threshold are only considered in the comparison process. This helps in getting more relevant results. (ii) #Simterms: #Simterms define the count of similar terms obtained by textual comparison of each pair of documentation to keep track of ordering of terms. This threshold allows only those documents to be compared which have a similar ordering of terms greater than the threshold value.

(iii)
Simmeasure: Simmeasure defines the similarity value obtained by using similarity measures given in Section 2.6 (i.e., cosine, jaccard, or dice similarity) between the pair of documentations. This threshold ensures that the resulting clone pairs have similarity greater than the threshold value. (iv) K: The value used for arriving at the kth approximation of the term-document matrix. It is used when performing LSI to ensure that maximum variance is preserved when reducing the high-dimensional matrix to a lower dimension.

Precision and Recall
In this study, recall refers to the ratio of the number of items that are known to be true clones and are detected successfully (number of true positives say x) to the total number of true clones say y.
Here, precision refers to the ratio of the number of true positives in the detected clones, say p, set to the total number of clones detected in the detected clones set, say q.

Research Questions
The different aspects of clone detection, which have been analyzed in this study, are abstracted into the following research questions.
• RQ1: How is documentation present along with source code in a software system useful for finding concept clones? Through this question, our aim is to investigate the effectiveness of using documentation in finding method-level concept clone pairs. As this study revolves around the fact that "similar documentation implies similar methods", it is also recommended to use internal documentation for high-level concept clone detection. Therefore, it is necessary to examine the structure of documentation followed in the software projects. Here, we do an in-depth study of documentation structures followed in java-based projects in general and validate that similar structures can also be seen in the case studies used for the empirical validation of the proposed mechanism. We then probed over the ideal documentation style of methods for the technique to work credibly without manual checking required both for java-based projects and for projects implemented in some other language. Further, we investigated the use of information retrieval techniques particularly Latent Semantic Indexing (LSI) on the text corpus formed from extracted documentation.
• RQ2: What are the implications of applying different combinations of weight and similarity measures for performing LSI on the extracted method's documentation? While applying LSI to the text corpus formed from the extracted documentation, we observed the result sets obtained using various combinations of weight and similarity measures. This question investigates which combinations gave better results and assigns a ranking to each of the combinations. • RQ3: Is the technique scalable, i.e., applicable to large-sized projects? This research question aims to examine the scalability of the proposed mechanism. As the proposed mechanism is centered around analyzing documentation of methods, its scalability also correlates to the uniformity of style and availability of documentation.

Key Features of Java Documentation
The documentation accompanying the source code in a software system describes the functionality of code fragments; this motivated us to use them in the clone detection process. This is based on the assumption that "similar documentation imply similar code fragment". In this study, the code fragment is taken to be the method in the java projects.
In a java-based project, documentation contains the following parts: (i) Description: Contains description of the code fragment.
(ii) Descriptive Tags: Several other details are presented in this part of documentation prefixed with certain tags such as @param, @see, @author, @throws, @override, @return, etc.
Here, we extract only documentation associated with methods to be used as input to LSI. Contained in the following are the relevant viewpoints of using documentation of methods for clone detection.
(i) It is not compulsory for the documentation to contain all its parts (description and descriptive tags). It can contain any or both of the parts depending upon requirements. The description of methods is extracted using both parts as described in Section 5. (ii) Not all methods are documented in java projects. The absence of documentation can be credited to one of the following reasons: (a) Two or more related methods may be documented collectively. This leads to documentation being associated with only one method, while all other methods are seen undocumented by our technique. (b) Some methods perform a sub-function of a task. The programmer may only document the method performing the whole task, leaving the methods performing sub-functions undocumented. (c) Some methods are seen to be undocumented for unknown reasons, maybe the laziness of the programmer leads to these methods being left undocumented.
We manually injected documentation for undocumented methods based on the rough understanding of methods so that they can be made suitable for input as clone candidates for our technique. (iii) The description part may also contain other types of comments, which are as follows (the explanation of each type is derived from Howard [72]): (a) Notes Comment: Notes are used to communicate pending tasks to either the developer or the team, and sometimes used to refer to other parts of the code. (b) Explanatory Comment: Explanatory comments explain the rationale behind some part of the code. (c) Contextual Comment: These comments provide information about the context of the method, e.g., the method is "called from X", "calls Y", or "called when." (d) Evolutionary Comment: Evolutionary comments describe information about the history of the method. (e) Conditional Comment: Conditional comments specify conditions under which this method should be called.
These comments do not contribute towards the semantics of methods and need to be filtered out. Here, we also used manual ways to filter out these comments. This helps to increase the efficiency of the algorithm as the size of the text corpus used as input to LSI is reduced. (iv) It may be possible that the whole description of the method does not contain any descriptive comments. These methods may result in false negatives. It can be considered similar to undocumented methods as they do not appear in results to form clone pairs with other methods.

Clone Detection Process
This section briefly describes the clone detection process used in this paper. We have taken three open source software written in java to validate our findings empirically. The steps carried out in the process are as follows.
Step 1: Filtering out Interfaces, Abstract Methods, Small Methods, and Constructors: Programs written in java contain a large number of interfaces and methods that need to be filtered out because they do not contribute to relevant clones. This helps in increasing the performance of the proposed mechanism by reducing the number of spurious clones and makes the overall analysis of resulting clone pairs less time-consuming (see Algorithms 1 and 2). The following programming constructs shall be filtered out: (i) Interfaces: An interface contains only the names of methods without any implementation. Classes that implement the interface need to implement the required functionality for each of the methods. As interfaces do not contain any implementation, they must be filtered out as they do not contribute much to the detection of relevant clones. (ii) Abstract Methods: Abstract methods are similar to the methods in interfaces and do not contain any implementation. Thus, they are also filtered out. (iii) Small Methods: Methods that are smaller than a particular threshold, i.e., a specific number of tokens, are also filtered out. This threshold value is a hyperparameter. This filtering helps to remove various setter and getter methods and methods that only call other methods to perform the required functionality. These are generally single line methods whose detection as clones serves the least purpose. (iv) Constructors: In this study, constructors are also not taken into account for clone detection, thus they are also filtered out.

return f ilteredMethodDocs
Step 2: Extraction of Documentation: In this work, only documentation related to methods is used, as described in Section 4. Furthermore, we need to extract only the description associated with methods (see Algorithm 3). If the description is not present, the proposed method uses descriptive tags to extract a suitable description for the method using @return and @see tags.
(i) @return tag describes what is returned from the function. The contents of the @return tag are used when the description is absent. (ii) @see tag is used when a lookup to another method/class is required (no description for the method and no @return tag is present).
Overridden methods are usually annotated with @override annotation. If documentation is not available for these methods, it is extracted from the method being overridden.
Further, to extract java documentation in a processable file format, we used built-in java library doclet (by default javadocs are generated as HTML pages). Steps 1 and 2 collectively describe the key features our generated javadoc should have.
Step 3: Part-of-Speech (POS) Tagging of Documentation and Lemmatization: For each method's extracted documentation, POS tagging is performed (see Algorithm 5 (Step 5)). The purpose of this step is twofold: (i) To give more weight to nouns and verbs in a sentence. (ii) To remove stop words by extracting only words tagged as verbs, adjectives, nouns, and adverbs (see Algorithm 5 (Step 7)).
Lemmatization is done to reduce each term in documentation to its base form, to curb differences among documentation arising out of different forms of words (see Algorithm 5 (Step 6)).
Step 4: Latent Semantic Indexing (LSI): LSI (see Algorithm 5 (steps 10-15)) is performed as follows: (i) Extract those terms tagged as verbs, nouns, adjectives, or adverbs. (ii) Perform LSI on these extracted terms. The VSM built contains weights (matrix entries), as described in Section 2.5. Weights can also be assigned by giving extra weight-age to certain parts of speech (nouns and verbs). (iii) For a pair of method's description, use similarity measures described in Section 2.6, also taking into account similarity based on term ordering to find clone pairs in a software project (see Algorithm 4).
For the complete algorithm, refer to Algorithms 1 to 5. The flowchart representation of this algorithm is given in Figure 2.

Empirical Validation
To empirically validate the proposed mechanism, it has been applied to three open source java projects, namely, JGrapht [73], Collections [74], and JHotDraw [75]. The results obtained were validated against the benchmark Qualitas Corpus (QCCC) [76], which was provided by Ewan Tempero [77] to present the efficacy of the proposed approach. The researcher prepared a corpus of clones for 111 open source java projects. However, the corpus only presents type-1, type-2, and type-3 clones, while our mechanism reports highlevel concept clones. The validation is justified because type-1, type-2, and type-3 clones are very similar in structure and mostly implement similar concepts. The recall values obtained on validation also indicate that concept clones generally cover most clone pairs falling in these three categories. Nevertheless, the manual validation of resulting clone pairs is necessary for accuracy calculations. Table 2 gives details of projects used for the empirical validation. The details include #Files analyzed, #Methods (excluding constructors, abstract methods, methods of interfaces, and small methods), Estimated Lines of Code (ELOC; excluding constructors, abstract methods, methods of interfaces, and small methods), and % of Documented Methods (i.e., % of methods documented before manual modification (excluding constructors, abstract methods, and methods of interfaces)), which shows that with increase in size, the number of documented methods decreases. After applying POS tagging, lemmatization, and stop-word removal to the description of methods, different weighing measures were used to generate term-document matrix, and thereafter to obtain a list of clone pairs, different similarity measures (to measure similarity between several term-document vectors) were used after performing LSI. Table 3 summaries a list of all the combinations of weight and similarity measures used in this study. Furthermore, with each combination POS tagging is used in two ways, i.e., with or without selective weight assignment (giving priority to certain parts of speech). Selective weight assignment cannot be used with binary weight measures because it can only hold values 1 and 0. Owing to the small size (ELOC is approx 6K), the JGrapht project has been used to illustrate the underlying working of the proposed mechanism. The recall is measured by a comparison of results obtained using our mechanism with the one provided in QCCC. Precision calculations are made for individual methods through augmenting manual validation of resulting clone pairs.

Relevance of Jaccard and Dice Similarity Measures with LSI
The reason behind using only binary weight measures with jaccard and dice similarity measures can be explained with the help of a counterexample.
The jaccard similarity measure is calculated as given in Equation (4). Therefore, for two vectors with raw frequency weights say x = {1,2,1,0,1,2} and y = {1,2,2,1,0,1}, then Jaccard (x,y) = 2/5 = 0.4. It will treat (1,2) and (0,1) as different numbers and consider these documents as not similar w.r.t. these terms. While, in reality, methods with vector representation x and y are highly similar, as these numbers indicate the frequency of occurrence of terms in the method's descriptions being compared. A similar example can be quoted for TF weight measure and non-applicability of LSI. Furthermore, the Dice similarity measure, as given in Equation (3), can be used only with binary weight measure and cannot be combined with LSI, based on a similar explanation, as indicated above.

Results and Discussion
The results of empirical validation of each of the case studies are presented here.

Empirical Validation Using JGrapht
JGraphT is a free Java class library that provides mathematical graph-theory objects and algorithms. It runs on java 2 platform (requires JDK 1.8 or later starting with JGraphT 1.0.0).
This subsection provides execution details of the proposed mechanism on JGrapht. The thresholds (see Section 2.7) used for arriving at results are as follows: (i) #Method_Tokens: Small methods are filtered out using a minimum threshold of 50 on the number of tokens in a method (Evans used the same threshold on the number of Abstract Syntax Tree (AST) nodes in a method [77]). (ii) K: For LSI, the value of K is taken as half of the total number of terms in the text corpus. (iii) Simmeasure is taken as 0.5, i.e., documentation of resulting method-level clone pairs should be 50% similar. (iv) #Simterms is set to 30%, i.e., documentation of resulting method-level clone pairs should be at least 30% similar in term ordering. Table 4 shows the results obtained after applying each of the 12 combinations of weight and similarity measures to JGrapht. Column 1 gives the type of combination applied (see Table 3). Column 2 enlists the number of clone pairs observed for each applied combination. Column 3 gives the intersection of clone pairs present in both benchmark QCCC (provided by Evans [77]) and the proposed mechanism. The number of clone pairs reported for JGrapht in QCCC were 85 methods with 6 constructors (constructors have already been filtered out in this study because of their limited relevance). Column 4 gives recall, which is measured by comparing the obtained results against the benchmark. Last, column 5 gives precision metric values, which are obtained by manual analysis of resulting clone pairs.
The results are then compared (see Figure 3) to analyze the relative variation of the number of true positive clone pairs with similarity scores. Similarity scores can be Cosine, Dice, or Jaccard. An ideal plot should exhibit a large proportion of clone pairs for higher similarity values (say for similarity score > 0.7). The following observations can be made while analyzing clone pairs identified for JGrapht.
(i) A large number of clone pairs observed for each combination is attributed to the basic nature of LSI. LSI finds clone pairs by measuring the similarity between terms. In two sentences, terms can be similar, but the overall meaning of a sentence may be different, e.g., when a query search is performed on a search engine, a large number of results are obtained with ranking given to each result. This is also the reason for less values of precision in the result set (see Column 5 of Table 4). (ii) Maximum recall is achieved using combination number 4 (binary weight measure and jaccard similarity measure), and maximum precision is achieved using combination number 12 (TF-IDF weight measure and cosine similarity measure). Both of them gave better values of precision and recall metrics.  Refer to Table 3 for all the combination types listed here.  (refer to Table 3 for all the 12 combinations listed).

Empirical Validation Using Collections and JHotDraw
This section gives complete details of the execution of the proposed mechanism on Collections and JHotDraw open source software. Each of the 12 combinations of weight and similarity measures are applied to each open source java project. Small methods are filtered out using a minimum threshold of 50 on the number of tokens in the methods for both projects.
The results of the application of each of the 12 combinations on the two open source projects are given in Table 5 (for Collections case study) and Table 6 (for JHotDraw case study).
Only recall is measured for results obtained for Collections and JHotDraw due to time constraints. As the precision calculated is a manual and time-intensive process for such a large result set of clone pairs obtained, it was performed for JGrapht to show the relevance of the proposed mechanism. The number of clone pairs reported in the benchmark for Collections is 1037 methods and 3 constructors, and for JHotDraw is 2550 methods and 199 constructors. Refer to Table 3 for all the combination types listed here. Refer to Table 3 for all the combination types listed here.
A comparison similar to JGrapht (between similarity scores and count of true positive clone pairs) is also made for the Collections and JHotDraw case study (see Figures 4 and 5). The following observations can be made from these graphs and tables.    (refer to Table 3 for all the 12 combinations listed).

Exploring the Research Questions
RQ1: How is documentation present along with source code in a software system useful for finding concept clones?
The result of the three case studies exhibited good recall values using the proposed mechanism, thereby showing the effectiveness of considering the method's documentation for clone detection purposes. This shows the strength of the LSI mechanism to link semantically related documents together. Therefore, using documentation present along with source code is beneficial for high-level concept clone detection.
Java projects follow a certain structure and style while building the documentation as given in Section 4. We saw a similar structure in the case studies used. Table 7 gives examples of each documentation characteristic that is seen in the case studies and also the manual modifications that can be applied in each situation. Looking at the availability of the method's documentation in these software projects, we observed that not all methods are documented. Table 2 shows the percentage of documented methods in JGrapht (80%), Collections (70%), and JHotDraw (53%), therefore mandating the manual modifications as described in Section 4.
For this technique to work credibly for java-based projects without the requirement of manual checking, the following points need to be taken care of: 1.
The description part of documentation must only contain comments related to the semantics of the method.

2.
If the description part is absent, it should contain description associated with the @return tag or a lookup to similar method documentation using @see tag (so that its documentation can be used here).

3.
Only overridden methods (methods overriding some other methods present in the project) with @override annotations are allowed to be undocumented. 4.
All the methods in a project should follow a uniform style of documentation.
For projects in other languages, the documentation should follow uniform style and should only contain description related to semantics of methods. These two requirements are necessary for the proposed technique to give useful results.
Most importantly methods should not be left undocumented. Table 7.
The key features of documentation as seen in case studies. The table illustrates how we manually dealt with documentation of methods that were not in the required format.

Documentation Characteristic Case Study Examples Manual Modification Applied Category Characteristics
InComplete comments [78,79] The documentation of code fragments (here methods) can contain both or any part of documentation (description or descriptive tags or both).
The method alg.BellmanFordIterator. putPrevSeenData contain only @param and @return descriptive tags. The @return tag is also accompanied by only a "." The contents associated with @return tag are modified to contain a meaningful return description, which is later extracted to form the description of method. The description can be "@return a set containing previously seen vertices" Not documented methods [80] Two or more related methods may be documented collectively. This leads to documentation being associated with only one method, while all other methods are seen undocumented. We examined the result obtained using all the combinations listed in Table 3. We observed that all of them gave similar results with minor variation in recall values. In all the case studies, combination 3 (TF weights and Cosine Similarity without POS Selective weight-age) outperform others, while combination 4 (binary weights and Jaccard similarity measure) performed the least. The ranking of the 12 combinations is given in Table 8. This ranking is based on the recall values and whether the combination exhibits an ideal property (a large proportion of clone pairs concentrated at higher similarity values). It can be seen that while combination 4 gave good recall values, it is ranked lowest because it shows a large proportion of clone pairs for similarity values as low as 0.3. Table 8. Ranking of each combination of weight and similarity measures used in this study. Ranking is based on recall values and whether the combination exhibit ideal property (a large proportion of clone pairs concentrated at higher similarity values) (see Table 3 for the full forms of acronyms used in column 2). In the case studies, it can be observed that with an increase in size, the number of methods documented decreases. The style of documentation becomes nonuniform, as for larger-sized projects, more developers get involved, and they follow different styles of documentation. Section 7 explains how the documentation of methods becomes nonuniform. However, if the proper documentation structures are followed, the proposed mechanism shall scale well even to the large-sized projects.

Threats to Validity
Documentation of software is done in the English language. A method can be described in a language in varied different styles of sentences. For example, consider a method that checks for the equality of two integers: a and b. This method can be described in the following different styles: 1. "Returns true iff a = b". 2. "Checks whether a equals b". 3. "Checks for the equality of two integers". 4. "Checks for the equality of a and b" and so on.
For LSI to work, a uniform style of documentation should be followed to find cloned methods. It is also sometimes encountered that a method is not documented with a descriptive comment or is not documented at all. To deal with this issue, a manual analysis of all the method documentation is necessary. This vagueness in documentation style and its absence poses a significant threat to results validity. Further, the absence of Javadoc comments may result in missing clone pairs, and manual injection of documentation is a subjective issue and depends on the code understanding.

Conclusions and Future Work
A software's documentation of methods shows an excellent expressiveness in describing "what the method does". In this study, applying information retrieval techniques to these documentations to extract functionally similar or nearly functionally similar methods exhibited impressive results. Recall values are found to be between 68% to 89% for the three case studies, i.e., JGrapht, JHotDraw, and Collections. However, the number of clone pairs identified are significant, which is attributed to LSI's superior capability to match two contextually similar terms in different documents. The proposed technique not only considers similar terms in the documentation, but also considers their ordering, e.g., two strings "Are you there?" and "there you are" shall be processed differently. For certain combinations, we used selective POS weight assignment strategy, which is not found to exhibit significant improvement in results when compared to their counterparts that do not use selective POS weight assignment. Different weight measures (Raw, Binary, TF, and TF-IDF) and models (SG, CBOW, and FASTTEXT) are used in the study. The study shows that the combination using TF weights, cosine similarity, and without POS selective weight-age shows better results for all the three case studies performed for predicting similar methods from their documentations. The relatively low values of the "Precision" metric could be justified owing to the simplicity and lower implementation cost of the proposed mechanism.
Future works may consider tag-based comparison, i.e., the content of each tag (@param, @see, @return, etc.) may be compared separately, especially for java-based projects. Precision can also be greatly improved if along with documentation of code, the source code itself takes part in clone detection process. Various researchers also recommend the use of IR techniques combined with other clone detection tools to improve upon their results. Advanced variations of latent semantic indexing can also be used for further experimentation to analyze and compare their capabilities towards giving improved clone detection results.
Author Contributions: All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.