Next Article in Journal
High-Efficiency and High-Precision Ship Detection Algorithm Based on Improved YOLOv8n
Previous Article in Journal
Hybrid System of Proportional Hilfer-Type Fractional Differential Equations and Nonlocal Conditions with Respect to Another Function
Previous Article in Special Issue
Automated Classification of Agricultural Species through Parallel Artificial Multiple Intelligence System–Ensemble Deep Learning
 
 
Article
Peer-Review Record

Code Comments: A Way of Identifying Similarities in the Source Code

Mathematics 2024, 12(7), 1073; https://doi.org/10.3390/math12071073
by Rares Folea * and Emil Slusanschi
Reviewer 1: Anonymous
Reviewer 3: Anonymous
Reviewer 4:
Mathematics 2024, 12(7), 1073; https://doi.org/10.3390/math12071073
Submission received: 20 February 2024 / Revised: 17 March 2024 / Accepted: 26 March 2024 / Published: 2 April 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In most research comments are removed when similarity is checked, so this paper takes the opposite approach. Applications of this could be as the authors pointed out in the domain of plagiarism detection. I am glad that the authors point out that plagiarism and similarity are not the same and that extra investigation is needed to be able to talk about similarity. While using comments for similarity is not new the usage of the different techniques and their comparison is.

The description of the techniques used and their formalization is good. What I miss is the proof of effectiveness and a table of comparison. I do not like that the authors state "Overall, the quality of the matches is high, with a well balanced mix between the true- and false-positives." What does high mean? what exact number are the false positives?

So authors should provide a table with the number of false positives. Also, authors should provide the Precision and Recall for each model so one can objectively compare them. You could also calculate the F1 measure. Also false negatives are also an issue I know they might be impossible to know but the reader should be informed of their "existence" and what is your opinion would they change your conclusion if they exist?

Overall the paper is relevant, the applications are good especially since the tool is open source which is a big must in my opinion for research software. The MIT license is fine but authors should reconsider to make it GPLv2 or GPLv3 since it will ensure that any derivatives are also open source. But that is the authors free choice. No forks have been made at this point of time so you can still change it.

Author Response

Thanks for the reviews. Based on the comments that we have received, we are uploading a revised version of the manuscript, with the following major changes:

  • We have rewritten the Introduction section and created a new dedicated section, for comments analysis. Also, we have expanded the presentation of other methods used for computing code similarities, in the literature, for a better context on where our contributions are targeted.
  • We are introducing a comparative performance study (accuracy, precision, recall, F1-scores) of the presented methods along with computational overheads (see the summary in Table 3)
  • We are adding a performance study of the Universal Sentence Encoder, when tested against different values of r’s. 
  • We have rewritten the Implementation Details section and added a new section for reproducibility
  • We expanded the ideas discussed in the article in the Conclusions section and scope future work

 

> In most research comments are removed when similarity is checked, so this paper takes the opposite approach. Applications of this could be as the authors pointed out in the domain of plagiarism detection. I am glad that the authors point out that plagiarism and similarity are not the same and that extra investigation is needed to be able to talk about similarity. While using comments for similarity is not new the usage of the different techniques and their comparison is.

We appreciate the feedback. Moreover, based on multiple feedback that we have received, we are adding a dedicated paragraph in the introduction section to address this. “Our research aims to expand automated methods for detecting software code similarities, designing tools that facilitate collaboration between human experts and the software. This will support informed plagiarism decisions through an interactive process where the broader context of the code, including comments and coding style, is automatically analyzed.”

> The description of the techniques used and their formalization is good. What I miss is the proof of effectiveness and a table of comparison. I do not like that the authors state "Overall, the quality of the matches is high, with a well balanced mix between the true- and false-positives." What does high mean? what exact number are the false positives? So authors should provide a table with the number of false positives. Also, authors should provide the Precision and Recall for each model so one can objectively compare them. You could also calculate the F1 measure. Also false negatives are also an issue I know they might be impossible to know but the reader should be informed of their "existence" and what is your opinion would they change your conclusion if they exist?

Understood. We are adding Table 3 to cover this quantitative analysis, where we present a performance based comparison of different models tested against around 35 thousands comments from three versions of Kubernetes source code for the 5 analyzed models. The table evaluates the models using precision, recall and F1-Score as metrics. It also computes the macro and weighted average, as well as the accuracy. The model based on the Universal Sentence Encoder seems to achieve the best quantitative results, maintaining a good balance between precision and recall, on the analyzed Kubernetes-based test dataset, described above.

Other than that, we add few examples to the qualitative section that contribute to the long-tail of errors encountered by the models.

> Overall the paper is relevant, the applications are good especially since the tool is open source which is a big must in my opinion for research software. The MIT license is fine but authors should reconsider to make it GPLv2 or GPLv3 since it will ensure that any derivatives are also open source. But that is the authors free choice. No forks have been made at this point of time so you can still change it.

We have acknowledged the suggestion, thank you.

Reviewer 2 Report

Comments and Suggestions for Authors

Figures 1 and 2 need explanation. The problem needs to be clearly defined in the introduction section. The author needs to rewrite the introduction section to describe his work's problem activities and innovation. Other methods can be used to display the codes. Conventionally, mentioning the code in a scientific article is not acceptable. You can place the code on sites and refer to it. The simulation results need detailed explanations. The implementation part needs to be explained in the format of a scientific article and what is written - mainly in the format of a technical report. Discussion and conclusion sections should be added to the article.

Comments on the Quality of English Language

Figures 1 and 2 need explanation. The problem needs to be clearly defined in the introduction section. The author needs to rewrite the introduction section to describe his work's problem activities and innovation. Other methods can be used to display the codes. Conventionally, mentioning the code in a scientific article is not acceptable. You can place the code on sites and refer to it. The simulation results need detailed explanations. The implementation part needs to be explained in the format of a scientific article and what is written - mainly in the format of a technical report. Discussion and conclusion sections should be added to the article.

Author Response

Thanks for the reviews. Based on the comments that we have received, we are uploading a revised version of the manuscript, with the following major changes:

  • We have rewritten the Introduction section and created a new dedicated section, for comments analysis. Also, we have expanded the presentation of other methods used for computing code similarities, in the literature, for a better context on where our contributions are targeted.
  • We are introducing a comparative performance study (accuracy, precision, recall, F1-scores) of the presented methods along with computational overheads (see the summary in Table 3)
  • We are adding a performance study of the Universal Sentence Encoder, when tested against different values of r’s. 
  • We have rewritten the Implementation Details section and added a new section for reproducibility
  • We expanded the ideas discussed in the article in the Conclusions section and scope future work

 

> Figures 1 and 2 need explanation. 

 

We have expanded the explanations for both figures, not only in the reference from the article but also in the captions. Now for figure 1, we make it clear that even if proportion of comments to total lines of code fluctuates within consecutive versions, the overall long-term trend is ascendant, which could indicate an increased emphasis on code readability, maintainability, or collaboration within the Kubernetes development community and we explicitly state that the data in Figure 2 offers insights into how commenting practices evolve within large-scale open-source software projects, showing an increasing trend in the amount of words part of comments out of the total corpus, across releases, as well as an ever increasing trend in the absolute number of comment words.

 

> The problem needs to be clearly defined in the introduction section. 

 

Thanks for the feedback. We are adding a dedicated paragraph in the introduction section to address this. “Our research aims to expand automated methods for detecting software code similarities, designing tools that facilitate collaboration between human experts and the software. This will support informed plagiarism decisions through an interactive process where the broader context of the code, including comments and coding style, is automatically analyzed.”

 

> The author needs to rewrite the introduction section to describe his work's problem activities and innovation. 

 

We are adding a statement that describes our novel contributions: “This paper introduces two novel models for identifying code similarities based solely on source code comments: one targeting machine-readable comments and the other focusing on human-readable comments.”

Also, we are expanding other work across code similarity in the literature, with one additional paragraph. 

Moreover, we are adding a statement regarding the problem of plagiarism, stating that: “ultimately, while these techniques offer varying degrees of automation and explainability, human judgment remains essential for determining true plagiarism”.

 

> Other methods can be used to display the codes. Conventionally, mentioning the code in a scientific article is not acceptable. You can place the code on sites and refer to it. 

 

While we agree that code snippets in the middle of the technical report may be disturbing and disrupt the flow of the paper, because the paper heavily relies on analysing code, we believe that some degree of qualitative examples are required. In this version, we are suggesting to build a new appendix containing relevant code listings that offer detailed examples illustrating the context in which certain statements in the paper are made. We are willing to listen to further feedback on this topic. 

Regarding model qualitative analysis, we see an opportunity here to apply the same procedure. We have now introduced “Appendix A.2 Qualitative analysis”, with one subsubsection, for each qualitative analysis of the identified similarities for a model that we have proposed.

 

> The simulation results need detailed explanations. 

 

In this revision, we are adding Table 3 to cover this quantitative analysis, where we present a performance based comparison of different models tested against around 35 thousands comments from three versions of Kubernetes source code for the 5 analyzed models. The table evaluates the models using precision, recall and F1-Score as metrics. It also computes the macro and weighted average, as well as the accuracy. The model based on the Universal Sentence Encoder seems to achieve the best quantitative results, maintaining a good balance between precision and recall, on the analyzed Kubernetes-based test dataset, described above.

 

> The implementation part needs to be explained in the format of a scientific article and what is written - mainly in the format of a technical report. 

 

We are rewriting the entire Implementation details paragraph in this version in a more scientific form. Thus we are referencing the reproducibility of each data proposed towards Figure 1-5 and Table 1 from the code structure. We are also specifying the tools used in generating the images. In particular, “for the plots, …  have been built using datawrapper, an online data visualizations. Figures… have been generated using the matplotlib library.“

 

> Discussion and conclusion sections should be added to the article.

 

We are also expanding Table 1 to cover a more diverse range of projects and languages to support our discussion on the importance of comments.

We are rewriting the entire Conclusion and future work section, to capture more discussions about the performance of the models. We are adding 3 dedicated paragraphs in the conclusion of the work to address this: 

 

“For the proposed solutions, the research observed that Levenshtein-based models offer excellent precision but may miss broader semantic similarities, while contextualized word embeddings such as Word2Vec or Universal Sentence Encoders perform better, by capturing some of these complexities. However, these models might have some more false positives than Levenshtein-based, which minimizes almost to zero this metric. 

 

The dataset used for evaluation was extracted from a subset of three Kubernetes versions, accumulating over 12 thousands lines of code and around 50 thousands identified similarities in comments. The best\footnote{Based on the F1-score achieved on the dataset presented in this paper, from collected commentaries from three Kubernetes versions.} model was based on the Universal Sentence Encoder~\cite{cer2018universal}, achieving a strong balance between precision and recall, but showed some limitations when operating with sentences containing numerical values.

 

Also, the Universal Sentence Encoder-based model stands out with efficient embedding generation, though processing of the embeddings is slower, due to their increased size. 

As expected, simpler models such as Levenshtein- and Word2Vec-based are significantly faster (see Table~\ref{tab:time}). Due to their still good performance, Word2Vec-based models can provide feasible alternatives to Universal Sentence Encoders, as they only use a fraction of the computational power required to run the analysis.”

Reviewer 3 Report

Comments and Suggestions for Authors

Author proposed "Code Comments: A Way Of Identifying Similarities In The
Source Code"

However, the manuscript must improved before before it can be further  processed based on the following :

1.  Other models like  Levenshtein-based models, Word2Vec-based models, etc results  used in justification of the proposed work model is not enough . Additional  experiment will help justify comparison with the proposed work.

Other concerns are as follow;

1. Abstract has some english grammatical errors
--Introduction errors in punctuations
2. What tool was used to analyze result and for drawing the graphs in figure 2 and the validity of the result must be justified

Comments on the Quality of English Language

please see above  recommendations for authors

Author Response

Thanks for the reviews. Based on the comments that we have received, we are uploading a revised version of the manuscript, with the following major changes:

  • We have rewritten the Introduction section and created a new dedicated section, for comments analysis. Also, we have expanded the presentation of other methods used for computing code similarities, in the literature, for a better context on where our contributions are targeted.
  • We are introducing a comparative performance study (accuracy, precision, recall, F1-scores) of the presented methods along with computational overheads (see the summary in Table 3)
  • We are adding a performance study of the Universal Sentence Encoder, when tested against different values of r’s. 
  • We have rewritten the Implementation Details section and added a new section for reproducibility
  • We expanded the ideas discussed in the article in the Conclusions section and scope future work

 

 

 

Author proposed "Code Comments: A Way Of Identifying Similarities In The
Source Code"

However, the manuscript must improved before before it can be further  processed based on the following :

 >1.  Other models like  Levenshtein-based models, Word2Vec-based models, etc results  used in justification of the proposed work model are not enough . Additional  experiments will help justify comparison with the proposed work.

In this revision, we are adding Table 3 to cover this quantitative analysis, where we present a performance based comparison of different models tested against around 35 thousands comments from three versions of Kubernetes source code for the 5 analyzed models. The table evaluates the models using precision, recall and F1-Score as metrics. It also computes the macro and weighted average, as well as the accuracy. The model based on the Universal Sentence Encoder seems to achieve the best quantitative results, maintaining a good balance between precision and recall, on the analyzed Kubernetes-based test dataset, described above.

> Other concerns are as follow;

We will address in-line below all the concerns mentioned by the reviewer: 

> 1. Abstract has some english grammatical errors
--Introduction errors in punctuations
We have acknowledged these errors, and tried, to the best of our ability, to address them. The abstract has been revisited, and the introduction has been reviewed and parts of it have been rewritten/rephrased. We are adding a dedicated paragraph in the introduction section to address this. “Our research aims to expand automated methods for detecting software code similarities, designing tools that facilitate collaboration between human experts and the software. This will support informed plagiarism decisions through an interactive process where the broader context of the code, including comments and coding style, is automatically analyzed.”

> 2. What tool was used to analyze result and for drawing the graphs in figure 2 and the validity of the result must be justified

We are expanding the section about tooling used in the “Implementation details” section. We are now referencing the reproducibility of each data proposed towards Figure 1-5 and Table 1 from the code structure. We are also specifying the tools used in generating the images. In particular, “for the plots, Figures~\ref{fig-comments-lines-k8s},~\ref{fig-comments-words-k8s},~\ref{ESlint-table-similarity} and~\ref{fig-alpha-comments-words-k8s} are built using datawrapper, an online data visualizations. Figure~\ref{ESlint-table-similarity} is generated using the matplotlib library. “

Reviewer 4 Report

Comments and Suggestions for Authors

The paper investigated code similarity using techniques like Word2Vec, ELMO, ROBERTA and USE models.

1. A comparative performance study (accuracy, precision, recall, F1-scores) based on True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) of these methods along with computational overheads (training and testing times etc..) may be studied.

2. Results can be compared with MOSS a public domain software which is in the reference.

3. In case of measuring software similarities of multiple programs instead of 2 programs in Figure 4 we have seen two matrices the first one with low resolution and second one concentrating on the blocks of interest with higher resolution authors can mention the choice of (r=6) will be affected. Also, whether making comparison with each pair whether we can improve the efficiency of choosing one among the pair with cosine similarity near 1 or low Levenshtein distance for further similarity comparisons. 

 

 

 

Comments for author File: Comments.pdf

Author Response

Thanks for the reviews. Based on the comments that we have received, we are uploading a revised version of the manuscript, with the following major changes:

  • We have rewritten the Introduction section and created a new dedicated section, for comments analysis. Also, we have expanded the presentation of other methods used for computing code similarities, in the literature, for a better context on where our contributions are targeted.
  • We are introducing a comparative performance study (accuracy, precision, recall, F1-scores) of the presented methods along with computational overheads (see the summary in Table 3)
  • We are adding a performance study of the Universal Sentence Encoder, when tested against different values of r’s. 
  • We have rewritten the Implementation Details section and added a new section for reproducibility
  • We expanded the ideas discussed in the article in the Conclusions section and scope future work

The paper investigated code similarity using techniques like Word2Vec, ELMO, ROBERTA and USE models.

> 1. A comparative performance study (accuracy, precision, recall, F1-scores) based on True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) of these methods along with computational overheads (training and testing times etc..) may be studied.

In this revision, we are adding Table 3 to cover this quantitative analysis, where we present a performance based comparison of different models tested against around 35 thousands comments from three versions of Kubernetes source code for the 5 analyzed models. The table evaluates the models using precision, recall and F1-Score as metrics. It also computes the macro and weighted average, as well as the accuracy. The model based on the Universal Sentence Encoder seems to achieve the best quantitative results, maintaining a good balance between precision and recall, on the analyzed Kubernetes-based test dataset, described above.

> 2. Results can be compared with MOSS a public domain software which is in the reference.

Thank you for the excellent suggestion. We are scoping this towards potential future work with these two models.

> 3. In case of measuring software similarities of multiple programs instead of 2 programs in Figure 4 we have seen two matrices, the first one with low resolution and second one concentrating on the blocks of interest with higher resolution authors can mention the choice of (r=6) will be affected. 

Figure 4 presents results for the machine-readable model, which is not parameterized with r=6 value. The parameter r only applies for the human-readable comments. That is why we omitted it from the captions.

> Also, whether making a comparison with each pair whether we can improve the efficiency of choosing one among the pair with cosine similarity near 1 or low Levenshtein distance for further similarity comparisons. 

Project Martial implements both techniques in an ensemble fashion. We are also including this remark “the two models are used in an ensemble, meaning their predictions are combined to achieve a more accurate and robust outcome.”

 

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Figure number 1 needs a more detailed explanation so that the reader can understand the purpose of this figure in the text. Second, what results does it display? It should be determined precisely the purpose of mentioning Table No. 1 in the article and what parts or sections have been used by the author. The rewriting section needs fundamental rewriting according to the writing style of the article. The article lacks a discussion section, which should be added to it.

Comments on the Quality of English Language

Figure number 1 needs a more detailed explanation so that the reader can understand the purpose of this figure in the text. Second, what results does it display? It should be determined precisely the purpose of mentioning Table No. 1 in the article and what parts or sections have been used by the author. The rewriting section needs fundamental rewriting according to the writing style of the article. The article lacks a discussion section, which should be added to it.

Author Response

 

> Figure number 1 needs a more detailed explanation so that the reader can understand the purpose of this figure in the text. 

We re-wrote the caption to figure 1: “The evolution of the total number of lines of comments and lines of code across different versions of Kubernetes. In parentheses, we'll capture the proportion of lines of comments within the total source code corpus. This analysis spans roughly 10 years of continuous development, between the first release in 2013 and the first available release from 2023. Kubernetes releases currently happen approximately three times per year. Because commenting practices within the Kubernetes project might have changed over time, potentially offering insights into maintainability, collaboration dynamics, and code readability within this complex open-source system, we are analysing the ratio trend, over time.”

> Second, what results does it display? 

The result displays the evolution of the total number of lines of comments and lines of code across different versions of Kubernetes.

 


> It should be determined precisely the purpose of mentioning Table No. 1 in the article and what parts or sections have been used by the author. 

We are enforcing the emphasis on this table, in the conclusion section, as well as in the caption.

 


> The rewriting section 

We’d like to hear more on what section the “rewriting section” refers to. Our manuscript does not have a “rewriting” section.

 

> The article lacks a discussion section, which should be added to it.

We are introducing a dedicated section for “Discussion”.

For the proposed solutions, the research observed that Levenshtein-based models offer excellent precision but may miss broader semantic similarities, while contextualized word embeddings such as Word2Vec or Universal Sentence Encoders perform better, by capturing some of these complexities. However, these models might have some more false positives than Levenshtein-based, which minimizes almost to zero this metric. 

 

The dataset used for evaluation was extracted from a subset of three Kubernetes versions, accumulating over 12 thousands lines of code and around 50 thousands identified similarities in comments. The best model was based on the Universal Sentence Encoder, achieving a strong balance between precision and recall, but showed some limitations when operating with sentences containing numerical values.

Also, the Universal Sentence Encoder-based model stands out with efficient embedding generation, though processing of the embeddings is slower, due to their increased size. 

As expected, simpler models such as Levenshtein- and Word2Vec-based are significantly faster. Due to their still good performance, Word2Vec-based models can provide feasible alternatives to Universal Sentence Encoders, as they only use a fraction of the computational power required to run the analysis.

Reviewer 3 Report

Comments and Suggestions for Authors

The authors proposed: "Code Comments: A Way Of Identifying Similarities In The Source Code"

The manuscript has been improved by the authors . The authors have clarified their proposed model, They also removed the unneeded part and sent to the appendix which makes the reading clear.

The manuscript can be accepted , except with minor corrections in line 276  which has confusion mixed brackets, the authors should rectify that before publication of the manuscript is considered,

Comments on the Quality of English Language

a little proof reading for grammatical sentence construction will be good

Author Response

The authors proposed: "Code Comments: A Way Of Identifying Similarities In The Source Code"

> The manuscript has been improved by the authors . The authors have clarified their proposed model, They also removed the unneeded part and sent to the appendix which makes the reading clear.

Acknowledged.

> The manuscript can be accepted , except with minor corrections in line 276  which has confusion mixed brackets, the authors should rectify that before publication of the manuscript is considered,

By t ∈(0, 1], we denote an open-closed interval, (a, b], that is an interval that includes the b but does not include a.

Back to TopTop