Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Distributed Representation for Assembly Code

Computers 2023, 12(11), 222; https://doi.org/10.3390/computers12110222

by Kazuki Yoshida^*

, Kaiyu Suzuki

and Tomofumi Matsuzawa^*

Reviewer 1: Anonymous

Reviewer 2:

Oleg Sychev

Computers 2023, 12(11), 222; https://doi.org/10.3390/computers12110222

Submission received: 11 August 2023 / Revised: 25 October 2023 / Accepted: 25 October 2023 / Published: 1 November 2023

(This article belongs to the Special Issue Best Practices, Challenges and Opportunities in Software Engineering)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper presents a representation learning model to extract feature expressions from assembly codes obtained by statically analyzing them to perform similarity determination of software. The Authors applied as a method for discriminating similar software and programs distributed representation of assembly codes obtained by statically analyzing their executable codes. The experiments were carried out, and the Authors generated expression vectors from multiple programs and verified the accuracy of clustering them to classify similar programs into the same cluster. The proposed method, according to the authors, performs better in terms of accuracy and execution time compared to existing methods that only take semantics into account. The topic is interesting and the paper corresponds well with the journal’s aim and scope.

The paper is well structured. The Authors clearly presented the problem, however, in the Introduction section the Authors' contribution should be more emphasized. Highlights are also missing.

The method and experiments section seems to be complete.

As in the Introduction section, the Conclusions section lacks a clear emphasis on the Authors' contributions, future works, and limitations.

The list of references needs to be extended.

Overall, the paper looks good but some changes are required.

Minor typos:

Figure 1 should be provided in better quality – it is a bit blurry.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The article is devoted to generating program representation that can be used to detect potential malware by combining existing methods Asm2Vec and FEATHER. This is an important topic with the growing number of new software developed.

The article is well written and reports interesting methods and results. However, there are methodological problems which make authors' claims different from their experimental results. Namely,

1. The authors write that "This is due to the increase in the number of malware variants that have been modified from existing malware and the increase in the number of similar software that share common functions and control structures through the reuse and theft of source codes". However, for their experiment they use a different kind of data, " we perform K-means clustering with cos similarity as the distance function on the set of expression vectors obtained from the experimental dataset, and measure the accuracy with which the solutions of the same creator are classified into the same cluster". There are no data supporting that solutions-by-the-same-author (what is actually measured) are the same as "similar software that share common functions and control structures through the reuse and theft of source codes" (the goal). Please either provide references to studies showing that the methods that help with one problem also help with the other with the same efficiency, or consider using a software dataset related to malware to support your original claims. Why don't you use datasets of malware programs?

2. The authors write that the supposed use of their method is "by automatically identifying similar software from static analysis results, the load of manual work involved in reverse engineering can be greatly reduced" - i.e. they suppose to use their method as pre-screening for manual analysis. In that kind of usage, the general accuracy measures they report ("correct response rate") are not informative because precision and recall must be analyzed separately. When doing pre-screening, you are normally interested in better recall rates even at the cost of some precision because the cost of missing a malware (false negative) is significantly higher than the cost of manually analysing a non-malware (false positive). Accuracy and F1 measures cannot be used naively in applications like that. However, there is no discussion of the desired properties of the representation and classification algorithm and then analyzing the resulting algorithms according to them. Please, consider the requirements stemming from your potential implementation carefully and measure and discuss the precision and recall of your method separately.

The article can also be improved in other ways:

1. The Related Works section is rather short and is mostly devoted to describing the models that the authors combined; it doesn't give a comprehensive review of state-of-the-art in program similarity detection which is a very advanced field (e.g., see software plagiarism detection algorithms).

2. Linked to that, the reference list is rather small for a journal article, with 5 references out of 18 placed together simply after naming a term "distributed representation of graph data, graph embedding [9] [10] [11] [12] [13]". Please, expand your reference list to credit other works in your field and people who developed methods you used (e.g., you use Levenshtein's distance without referencing Levenshtein's original work). Also, there are no DOIs in your reference list - please add them.

3. Section 4.3 has a long list with sub-lists of the same structure which can be better shown as a table.

Nevertheless, the research is performed well and tackles important problem, which can be used in other domains e.g., plagiarism detection (where making experiments on classifying the code by its author makes more sense). The main problem here is making a proper connection between the intended usage (and so requirements for the method) and the experimental methodology (what is really measured).

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

During the revisions, the article was made better, however, there was no attempt to address the most serious weakness (number 2 in the original review): the authors did not calculate and discuss precision and recall of their method. The authors argued in their reply "We used the metrics as evaluation index to evaluate whether the proposed method is able to embed vectors generated from similar programs in such a way that they are close in vector space"

The problem is, using just one measure to evaluate the method is wrong; two measures are needed:

* precision measures how many of the programs your method determines as close are really close;

* recall measures how many of the programs that are really close your method determines as close.

The same accuracy measure can be achieved with high precision and low recall, and with low precision and high recall. For different practical applications, the requirements for precision and recall are different. For example, if you are generating images, can easily generate them by thousands but your goal is to generate a few very realistic images, your preferred method should have high precision (all selected images are realistic) with low recall (you can omit any dubious image and generate more). On the contrary, when selecting programs for manual evaluation for malware or detecting plagiarism, you may want high recall even with relatively low precision because you don't want to give a pass to some malware code even if it requires manually verifying a few innocent programs more.

So by calculating, reporting and discussing precision and recall of your method (compared to other methods), you will give the readers (and yourself) a better picture of its strengths, weaknesses and possible applications.

Calculating precision and recall is common in evaluating program code embedding methods: consider, for example, this article in a Q1 journal https://dl.acm.org/doi/pdf/10.1145/3428301 .

Also, the long list of models in lines 225-275 can be formatted as a table.

Comments on the Quality of English Language

The article has minor English flaws in the new text, for example "If the semantic and control structures between the software are similar" (structures between the software?), "can embed vectors generated from similar programs in close in the vector space" (in close in ... space?)

Sometimes, the sections are finished with a colon, followed by a title, not a list (see lines 283, 292)

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 3

Reviewer 2 Report

Comments and Suggestions for Authors

The problems raised in the previous review stage were adequately addressed.

I consider the article publishable in the current state, but it can be improved by adding the discussion of the method's precision and recall to the Conclusions section.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

Distributed Representation for Assembly Code

Further Information

Guidelines

MDPI Initiatives

Follow MDPI