Code reuse is widespread in software development as well as internet of things (IoT) devices. However, code reuse introduces many problems, e.g., software plagiarism and known vulnerabilities. Solving these problems requires extensive manual reverse analysis. Fortunately, binary clone detection can help analysts mitigate manual work by matching reusable code and known parts. However, many binary clone detection methods are not robust to various compiler optimization options and different architectures. While some clone detection methods can be applied across different architectures, they rely on manual features based on human prior knowledge to generate feature vectors for assembly functions and fail to consider the internal associations between features from a semantic perspective. To address this problem, we propose and implement a prototype GeneDiff, a semantic-based representation binary clone detection approach for cross-architectures. GeneDiff utilizes a representation model based on natural language processing (NLP) to generate high-dimensional numeric vectors for each function based on the Valgrind intermediate representation (VEX) representation. This is the first work that translates assembly instructions into an intermediate representation and uses a semantic representation model to implement clone detection for cross-architectures. GeneDiff is robust to various compiler optimization options and different architectures. Compared to approaches using symbolic execution, GeneDiff is significantly more efficient and accurate. The area under the curve (AUC) of the receiver operating characteristic (ROC) of GeneDiff reaches 92.35%, which is considerably higher than the approaches that use symbolic execution. Extensive experiments indicate that GeneDiff can detect similarity with high accuracy even when the code has been compiled with different optimization options and targeted to different architectures. We also use real-world IoT firmware across different architectures as targets, therein proving the practicality of GeneDiff in being able to detect known vulnerabilities.
This is an open access article distributed under the Creative Commons Attribution License
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited