Automatic Malicious Code Classiﬁcation System through Static Analysis Using Machine Learning

: The development of information and communication technology (ICT) is making daily life more convenient by allowing access to information at anytime and anywhere and by improving the efﬁciency of organizations. Unfortunately, malicious code is also proliferating and becoming increasingly complex and sophisticated. In fact, even novices can now easily create it using hacking tools, which is causing it to increase and spread exponentially. It has become difﬁcult for humans to respond to such a surge. As a result, many studies have pursued methods to automatically analyze and classify malicious code. There are currently two methods for analyzing it: a dynamic analysis method that executes the program directly and conﬁrms the execution result, and a static analysis method that analyzes the program without executing it. This paper proposes a static analysis automation technique for malicious code that uses machine learning. This classiﬁcation system was designed by combining a method for classifying malicious code using a portable executable (PE) structure and a method for classifying it using a PE structure. The system has 98.77% accuracy when classifying normal and malicious ﬁles. The proposed system can be used to classify various types of malware from PE ﬁles to shell code.


Introduction
The development of information and communication technology (ICT) is making daily life more convenient by allowing access to information at anytime and anywhere and by improving the efficiency of organizations [1]. However, various cyber attacks such as information leakage and ransomware have also been increasing. Most of these cyber attacks are caused by malicious code.
Malware is becoming increasingly complex and sophisticated as a result of evolving computer technology. When the open source concept emerged, various types were generated, and now even novices can easily create malicious code using hacking tools. Such code is increasing and spreading exponentially [2][3][4]. Approximately 20% of all malicious code in circulation is classified as variants of existing code [5]. Thus, various types have been developed, further increasing malware attacks and making it difficult for a person to manually respond to such malicious code. As a result, a considerable amount of research has been done to automatically analyze and classify it.
There are currently two methods for analyzing malicious code: a dynamic analysis method that executes the program directly and confirms the execution result, and a static analysis method that analyzes the program without executing it. The dynamic analysis method monitors changes that occur as the file is executed and checks what function it performs. Although it has the advantage of analyzing changes during the actual execution, there is a limit in the ability to analyze all the execution paths [6]. A significant limitation is that this method cannot analyze a malicious code that triggers underlying behavior, such as only working at a specific point in time. Thus, it is difficult to analyze and classify it Symmetry 2021, 13, 35 2 of 11 effectively using only a dynamic analysis automation system. Therefore, it is necessary to pursue research and systems that utilize technologies that can automate the static analysis of harmful code. This paper proposes a static classification system for malicious code that combines machine learning and deep learning. Section 2 describes malicious code classification research, and Section 3 proposes a system architecture. Section 4 presents the results of experimental and performance analyses, and Section 5 presents the implications and conclusions related to the system.

Related Works
Various studies have been conducted regarding the analysis and classification of malicious code. This includes extracting the application programming interface (API) used in the program and calculating the weight based on the probability that the API will occur in malicious code [7], converting it to the N-gram method, and extracting the entire N-series element binary file into a string to create a general file, which is then applied to malicious code classification [7]. Static levels of packing and obfuscation signs have been used to calculate the level of risk and examine the structure of the binary files to investigate the extent of the malicious code [7].
Nataraj proposed a method to visualize and classify malicious code files, using an image gabor filter to train the classifier as an image feature extractor, along with a k-nearest neighbors (kNN) classifier [8]. Chen suggested DroidVecDeep, which is a malicious code detection method that uses deep learning technology to effectively detect unknown code on the Android platform. DroidVecDeep first extracts various features and ranks them using the mean decrease impurity. Then, it transforms the features into compact vectors based on word2vec. Finally, it trains the classifier based on a deep learning model [9]. Naeem proposed the cross-platform malware variant classification system (CP-MVCS), which converts binary malicious code into a grayscale image and extracts malicious functions from images using combined SIFT-GIST malware (CSGM) [10].
Other studies have classified malicious code based on information extracted from the portable executable (PE) header and section information [11,12].

Design and Implementation of Malicious Code Classification System
A malicious code classification system is proposed to automate a static analysis to distinguish and classify the nature of the file itself without running it. The designed classification system receives all of the files as input data, including the malicious code, normal file, and source file. Figure 1 shows the structure of the whole system.
In the preprocessing stage, the PE data extraction module and the image generation module are used to generate input data for each module used in the classification stage. In the next classification step, each model individually judges whether it is malicious using several algorithms. Random forest, gradient boosting, and decision tree algorithms classify malicious codes by receiving data generated from PE data, and CNN algorithm classifies malicious codes by receiving images generated by the image generation module as input data. By integrating the classification results of each model, the final malicious code is determined. Finally, this is the step of reflecting the classification results in the DB. DB configuration consists of data information and a value that determines whether the data is malicious. Progress in the system is largely divided into a preprocessing step, malicious code classification step, and step to reflect the results in a database (DB). The contents of each step are as follows.

Preprocessing Step
The system has a learning model that has been trained with various kinds of algorithms. In order to extract and process the input file to enter data into this model, it extracts hash values from input files, extracts PE data, and performs image conversion work.

Hash Extraction
This step extracts the hash value of an input file, which is an eigenvalue. This is done to determine whether the input data has been duplicated. Using the extracted hash value as a primary key, the classification result of the newly entered data in the DB update step is added to the DB, and the duplicated data is modified in the DB.

PE Data Extraction
The information needed for PE files to run in Windows exists in the header and sections of the PE structure. Therefore, information related to malignancy can be obtained from PE structures without executing the malicious code, and the import address table (IAT) within the PE header can be used to determine which dynamic link library (DLL) is loaded, and which function is used in that DLL [13]. If the data has a PE structure, a total of 55 features, including the entropy and packers, are extracted from the header and section parts of the file. At this time, the YARA rule setting is used to find the packing information of the file in the binary file. The YARA rule is composed of tools to identify and classify malicious code types using their signatures. In a traditional malicious code classification system, if the patterns are compared and judged to be malicious, it is used as a considered in the proposed system. Figure 2 shows an example of the attribute information extracted from the data by the YARA rule.

Preprocessing Step
The system has a learning model that has been trained with various kinds of algorithms. In order to extract and process the input file to enter data into this model, it extracts hash values from input files, extracts PE data, and performs image conversion work.

Hash Extraction
This step extracts the hash value of an input file, which is an eigenvalue. This is done to determine whether the input data has been duplicated. Using the extracted hash value as a primary key, the classification result of the newly entered data in the DB update step is added to the DB, and the duplicated data is modified in the DB.

PE Data Extraction
The information needed for PE files to run in Windows exists in the header and sections of the PE structure. Therefore, information related to malignancy can be obtained from PE structures without executing the malicious code, and the import address table (IAT) within the PE header can be used to determine which dynamic link library (DLL) is loaded, and which function is used in that DLL [13]. If the data has a PE structure, a total of 55 features, including the entropy and packers, are extracted from the header and section parts of the file. At this time, the YARA rule setting is used to find the packing information of the file in the binary file. The YARA rule is composed of tools to identify and classify malicious code types using their signatures. In a traditional malicious code classification system, if the patterns are compared and judged to be malicious, it is used as a considered in the proposed system. Figure 2 shows an example of the attribute information extracted from the data by the YARA rule.

Image Create
Image Create is a module that visualizes the entered file for the CNN and converts it into image data. The input data are treated as a one-dimensional vector.
of 55 features, including the entropy and packers, are extracted from the header and tion parts of the file. At this time, the YARA rule setting is used to find the packing i mation of the file in the binary file. The YARA rule is composed of tools to identify classify malicious code types using their signatures. In a traditional malicious code c fication system, if the patterns are compared and judged to be malicious, it is used considered in the proposed system. Figure 2 shows an example of the attribute i mation extracted from the data by the YARA rule.  First, this one-dimensional vector is converted to a two-dimensional vector using the following equation. Equation (1) is an equation that calculates the length of one side of the image and converts it to a 2D vector. Use of Equation (1) allows to find the size and convert it to a 2D vector based on the size found.
Second, the data converted to a 2D vector is used to generate a color image of a 3D vector by assigning a byte value of 0-255 to each ID value for the converted vector using the basic palette of Deluxe Paint. Figure 3 shows the results of converting the file to an 8-bit color image. Image Create is a module that visualizes the entered file for the CNN and converts it into image data. The input data are treated as a one-dimensional vector.
First, this one-dimensional vector is converted to a two-dimensional vector using the following equation. Equation (1) is an equation that calculates the length of one side of the image and converts it to a 2D vector. Use of Equation (1) allows to find the size and convert it to a 2D vector based on the size found.
Second, the data converted to a 2D vector is used to generate a color image of a 3D vector by assigning a byte value of 0-255 to each ID value for the converted vector using the basic palette of Deluxe Paint. Figure 3 shows the results of converting the file to an 8bit color image.
Finally, to apply it to the CNN, it is necessary to change the size of the generated images to the same size. OpenCV is used to convert the size of a generated image to a 256 × 256 image. Figure 4 shows the pseudo code for the image creation.  Finally, to apply it to the CNN, it is necessary to change the size of the generated images to the same size. OpenCV is used to convert the size of a generated image to a 256 × 256 image. Figure 4 shows the pseudo code for the image creation. Symmetry 2021, 13, x FOR PEER REVIEW 5 of 12

Classification Step
The malicious code classification proceeds by using the pre-processed data. Various experiments have been done to find a model suitable for classification. Details of these experiments are described in Section 4.

Classification Using PE Structure
For modules that use PE structures, the classification of the malicious code is performed using the decision tree, random forest, and gradient boosting algorithms, which have excellent performances. Rather than using all the properties of the PE structure, an automated feature selection method is used to score the importance and then classify using the feature with the highest importance score. Table 1 lists 12 attributes extracted from 54 attributes.

Classification Step
The malicious code classification proceeds by using the pre-processed data. Various experiments have been done to find a model suitable for classification. Details of these experiments are described in Section 4.

Classification Using PE Structure
For modules that use PE structures, the classification of the malicious code is performed using the decision tree, random forest, and gradient boosting algorithms, which have excellent performances. Rather than using all the properties of the PE structure, an automated feature selection method is used to score the importance and then classify using the feature with the highest importance score. Table 1 lists 12 attributes extracted from 54 attributes.

Classification Using Image
The image module uses AlexNet [14] to proceed with the classification. The detailed layers and parameters of the classifier are given in Table 2.

Final Classification Result
The result is selected based on the maximum frequency among the classification results of the four models, and a determination is finally made about whether it is malicious code. If the frequency is the same, the classification is conducted again. If the result of the reclassification is the same, the result is considered malicious code, because it is more dangerous to view normally if it is considered as malicious code. Figure 5 shows the pseudo code of the decision module. In Figure 5, good and bad indicate the number of normal and malicious codes.

Database Application Step
This module corrects the weight value in the database as a result of classifying whether or not the code is malicious. When the weight value reaches a specific value in the DB, there is an advantage that the classification speed is increased by distinguishing whether a malicious or non-malicious DB value is seen without passing through the model in the next classification. If the file is not registered with the DB, it is first registered with the DB before starting the operation. Figure 6 shows the operational process of the DB update module.

Database Application Step
This module corrects the weight value in the database as a result of classifying whether or not the code is malicious. When the weight value reaches a specific value in the DB, there is an advantage that the classification speed is increased by distinguishing whether a malicious or non-malicious DB value is seen without passing through the model in the next classification. If the file is not registered with the DB, it is first registered with the DB before starting the operation. Figure 6 shows the operational process of the DB update module.

Experiment
In order to select an algorithm for the classification of malicious code, experiments were performed to classify various malicious code samples for each module. The three experiments listed in the following subsections were performed.

Module Using PE Structure
The binary classification used random malware from VX Heaven [15]. This included data from a typical Windows portable program with approximately 1000 files extracted from 10,000 files. In order to find an algorithm suitable for the classification of malicious code, experiments were done with five algorithms. The algorithm was carried out using the sklearn [16] library. The first algorithm was AdaBoost, which was repeated 50 times based on a decision tree. The second algorithm was random forest, with no 10 depth limit using bootstrap gradient boosting without prior probability. The third algorithm was a decision tree with the maximum depth limited to 10. The fourth algorithm was logistic regression, and the last algorithm was Gaussian Naive-Bayes(GNB) with 50 booster repeats. In addition, experiments were conducted to classify groups of malicious code. The verification method was learned with 80% of the data, and the cross verification method used the remaining 20% of the data. Figure 7 shows the detailed results of the malicious code classification experiment on a module using PE information. The blue bar is the result accuracy of the binary classification and the orange bar is the result accuracy of the malicious code classification.

Experiment
In order to select an algorithm for the classification of malicious code, experiments were performed to classify various malicious code samples for each module. The three experiments listed in the following subsections were performed.

Module Using PE Structure
The binary classification used random malware from VX Heaven [15]. This included data from a typical Windows portable program with approximately 1000 files extracted from 10,000 files. In order to find an algorithm suitable for the classification of malicious code, experiments were done with five algorithms. The algorithm was carried out using the sklearn [16] library. The first algorithm was AdaBoost, which was repeated 50 times based on a decision tree. The second algorithm was random forest, with no 10 depth limit using bootstrap gradient boosting without prior probability. The third algorithm was a decision tree with the maximum depth limited to 10. The fourth algorithm was logistic regression, and the last algorithm was Gaussian Naive-Bayes(GNB) with 50 booster repeats. In addition, experiments were conducted to classify groups of malicious code. The verification method was learned with 80% of the data, and the cross verification method used the remaining 20% of the data. Figure 7 shows the detailed results of the malicious code classification experiment on a module using PE information. The blue bar is the result accuracy of the binary classification and the orange bar is the result accuracy of the malicious code classification.

Module Using Image
The performance of the classifier was tested using the Microsoft Malware Classification Challenge (Big 2015) dataset [17]. Of the nine existing classes and 10,868 items in the training data, a bytecode file with the PE header removed was used, with 20% used to verify the training and 80% for random split. To check the performance of various pretreatment methods, an experiment was conducted by organizing the data into black and white, and color images through Deluxe Paint mapping, and then the color images with three bytes were grouped together into a single pixel. The experiment was conducted using TensorFlow [18]. The learning results are shown in Figure 8.

Module Using Image
The performance of the classifier was tested using the Microsoft Malware Classification Challenge (Big 2015) dataset [17]. Of the nine existing classes and 10,868 items in the training data, a bytecode file with the PE header removed was used, with 20% used to verify the training and 80% for random split. To check the performance of various pretreatment methods, an experiment was conducted by organizing the data into black and white, and color images through Deluxe Paint mapping, and then the color images with three bytes were grouped together into a single pixel. The experiment was conducted using TensorFlow [18]. The learning results are shown in Figure 8.
Three experiments for each data set showed similar accuracy in tests with training data, but re-verification using untrained test data showed that the accuracy with a black and white image was approximately 3% lower than with the two color images. The loss rate showed that black and white images were chosen more appropriately than color images, and the overall learning speed was higher.

Proposed Classification System Performance Experiment
In order to evaluate the performance of the designed malicious code classification system, normal and malicious file classification experiments were conducted. Malicious files were randomly selected from approximately 23,000 files from VX Heaven data, with the normal files consisting of approximately 1100 DLLs and executables from Windows. The experimental results showed an accuracy of approximately 98.77%. The detailed accuracy results are listed in Table 3. As the learning progressed, the classification time decreased, while maintaining the accuracy of the DB.

Conclusions
As the proliferation of malicious code increases and it becomes ever more intelligent, there is insufficient manpower to analyze all of it and respond manually. To overcome this problem, this paper proposed a system that automatically and statically analyzes the Three experiments for each data set showed similar accuracy in tests with training data, but re-verification using untrained test data showed that the accuracy with a black and white image was approximately 3% lower than with the two color images. The loss rate showed that black and white images were chosen more appropriately than color images, and the overall learning speed was higher.

Proposed Classification System Performance Experiment
In order to evaluate the performance of the designed malicious code classification system, normal and malicious file classification experiments were conducted. Malicious files were randomly selected from approximately 23,000 files from VX Heaven data, with the normal files consisting of approximately 1100 DLLs and executables from Windows. The experimental results showed an accuracy of approximately 98.77%. The detailed accuracy results are listed in Table 3. As the learning progressed, the classification time decreased, while maintaining the accuracy of the DB.

Conclusions
As the proliferation of malicious code increases and it becomes ever more intelligent, there is insufficient manpower to analyze all of it and respond manually. To overcome this problem, this paper proposed a system that automatically and statically analyzes the code to determine if it is malicious. Various characteristic factors are extracted, such as the hash value, PE metadata, and packer information, and classified using various machine learning algorithms. This is similar to the existing automatic signature-based classification, which is an automatic analysis tool, but differs because the system is used as a consideration, not a pattern to judge whether the code is malicious. In addition, the file itself is visualized through a visualization method and entered into a CNN model. Thus, both PE files and shell-like files are classified.
In the future, this method can be improved through experiments and research to classify various types of malicious code information, instead of just determining the existence of malicious code. In addition, using the designed system, we plan to develop a classification system, as well as a system that is capable of detecting the reception and transmission of a file in real time during network transmission.