FastText-Based Local Feature Visualization Algorithm for Merged Image-Based Malware Classification Framework for Cyber Security and Cyber Defense

: The importance of cybersecurity has recently been increasing. A malware coder writes malware into normal executable files. A computer is more likely to be infected by malware when users have easy access to various executables. Malware is considered as the starting point for cyber-attacks; thus, the timely detection, classification and blocking of malware are important. Malware visualization is a method for detecting or classifying malware. A global image is visualized through binaries extracted from malware. The overall structure and behavior of malware are considered when global images are utilized. However, the visualization of obfuscated malware is tough, owing to the difficulties encountered when extracting local features. This paper proposes a merged image-based malware classification framework that includes local feature visualization, global image-based local feature visualization, and global and local image merging methods. This study introduces a fastText-based local feature visualization method: First, local features such as opcodes and API function names are extracted from the malware; second, important local features in each malware family are selected via the term frequency inverse document frequency algorithm; third, the fastText model embeds the selected local features; finally, the embedded local features are visualized through a normalization process. Malware classification based on the proposed method using the Microsoft Malware Classification Challenge dataset was experimentally verified. The accuracy of the proposed method was approximately 99.65%, which is 2.18% higher than that of another contemporary global image-based approach.


Introduction
Technologies in various fields, such as autonomous control [1], music [2] and multimedia content, are rapidly advancing [3][4][5]. Through these advancements, information access has progressively become easy, which also exposes users to cyber threats. A malware is any malicious software designed to harm computers or computer networks. Accessing a file that contains malware poses a direct threat to personal information; therefore, malware files are blocked before execution. Every type of malware acts differently, depending on the family it belongs to, and thus, countermeasures for them are also different; this necessitates the classification of malware into different families.
For detecting malware, there are majorly two types of methods: signature-based and heuristicbased; and the latter addresses the shortcomings of the former [6,7]. Heuristic-based malware detection involves malware scanning to detect features suspected of malicious behavior. Towards this end, many dynamic and static analysis methods have been developed [8][9][10][11]. A dynamic analysis method detects malicious behavior by executing the malware itself in an isolated virtual environment [12], whereas a static analysis method detects malicious behavior by identifying the overall structure without executing the malware [13,14].
Previously, a method for visualizing malware has been proposed through static analysis [15,16]. A global image can detect malware mutants as the overall structure is maintained, whereas small changes in malware are captured. Global images generated from malware belonging to the same family are similar; making them suitable for classifying malware. However, a global image cannot capture the actual behavior of obfuscated malware. A method for combining the global image and local features was proposed to increase classification accuracy by considering actual malware behavior [17]. The application programming interface (API) and dynamic link library (DLL) information is utilized as a feature from the text section of the bytes file to extract local features. The global image of malware combined with local features can be used for accurate malware classification. However, there is a need for a method to extract any local features of obfuscated malware, which is considered a difficult task.
This paper proposes a merged, image-based malware classification framework (MMCF) to classify malware. MMCF includes local image visualization, global image-based local feature visualization, and global image merging methods. This paper describes the local feature visualization technique for MMCF. The local feature visualization method embeds the local features extracted from malware, and generates local images based on the embedding results. As per our knowledge, there is no known method for generating local images based on the embedding results of local features in malware detection and classification. By creating a local image based on the embedding results, the relationship between all the local features can be represented in a single image. The generated local image has unique features for each family, because it selects important local features for each family of malware. The contribution of this study is as follows:  FastText model extensibility: The fastText model is used to embed local features, and it aids malware classification.  New local feature visualization: The local features of malware based on the embedding results are visualized. Both relationship and order between local features in a single image can be considered by visualizing the extracted local features based on embedding.  Generating a local image for MMCF: A local image with a simple pattern is generated for MMCF.
The generated local image helps in applying a local feature visualization method based on the global image proposed by MMCF.
The rest of the paper has the following structure: Section 2 covers related works. Section 3 introduces the local feature visualization method for MMCF. Section 4 derives the experimental procedures and results. Section 5 discusses the results. Section 6 presents the conclusion.

Portable Executable
Portable executable (PE) is a file format for executable files utilized by the Windows operating system [18]. PE is the data structure that encapsulates the information required by the windows loader that manages codes, and includes dynamic library references for linking and API import tables. The PE file format could be of different types, such as *.exe file or a *.dll file. A PE file consists of several headers and sections to map file to a memory. Specifically, the text section contains the program code, and the data section contains the global variables. Each section is mapped to a different memory.

Global Image-Based Malware Detection or Classification
Nataraj et al. proposed a method for visualizing malware to address the shortcomings of static and dynamic analyses [19]. They divided the binary information extracted from malware into an 8bit vector and used it as one pixel. Because the representation range of 8-bit vectors is between 0 and 255, it is suitable for grayscale images. They extracted texture features from the generated image and used KNN (K-Nearest Neighbors) as a classifier, achieving good performance.
Kesav Kancherla et al. converted the binary value extracted from executable files into an 8-bit vector, and used it as the pixel intensity [20]. They detected malware using a support vector machine (SVM) as a classifier by extracting the intensity, wavelet and three Gabor features from the generated image. They detected and classified malware based on various features using different feature extraction algorithms.

Local Feature-Based Malware Detection or Classification
Sang Ni et al. proposed a malware classification method using SimHash and Convolutional neural network (MCSC), which combined malware visualization and deep learning [21]. They extracted an opcode sequence from malware files and encoded it using SimHash. They then converted the SimHash result of the extracted opcodes into a grayscale image using the pixel, and verified it through a convolutional neural network (CNN) using 10,805 malware samples.
Jianwen Fu et al. generated a global image using the global features of malware along with local features [17]. They used entropy values, byte values and relative sizes of all the sections for each section from the malware-infected PE files to generate the global image; they extracted the texture and color features from this global image using the gray-level co-occurrence matrix (GLCM) and color moment. They extracted local features from the code and data sections of the malware, and accurately classified malware through the random forest (RF) method by combining global and local features. Table 1 summarizes the difference between conventional malware detection and classification methods and MMCF. Nataraj and Kesav Kancherla generated a global image using binary information of malware; they detected and classified malware based on the features of the generated global image [19,20]. When detecting and classifying obfuscated malware only with global images, it is difficult to attain accuracy, as the actual malware behavior is not considered. We accurately classify non-obfuscated and obfuscated malware by visualizing the opcodes and API function names that represent malware behavior.  [21]. However, there is a problem with their method-the relationships among opcodes cannot be represented when the extracted opcodes are encoded with SimHash. Our method embeds opcodes and API function names through the fastText model and visualizes them, thereby overcoming these shortcomings to a significant extent.

Comparison between Prior Works and Our Proposed Method
Jianwen Fu et al. classified malware by generating a global image with the global features of malware and extracting local features from the code and data sections of malware [17]. Such extraction of local features from obfuscated malware through static analysis remains to be a difficult task. We therefore propose MMCF, which visualizes local features based on the global images generated by the binary information of non-obfuscated and obfuscated malware; it classifies malware by merging global and local images.

Overview
The proposed method includes the input, preprocessing, and training and classification phases, as illustrated in Figure 1. The input phase extracts ASM and bytes files from a database with a disassembler. The preprocessing phase includes global image generation and local feature visualization. A global image is generated using the binary information extracted from the bytes file through a binary extractor as pixels. The local features are extracted from ASM files through a local feature extractor, and are visualized. The extracted local features are input into an obfuscation checker to determine whether they are obfuscated. If malware is obfuscated, the local features are entered in a GAN executor. If malware has not been obfuscated, the local features are visualized through the local feature visualizer and then a local image is generated. The phases are as follows: 1. Global and local images are input into a GAN trainer. A global image of the obfuscated malware is input into a GAN executor that outputs a local image of the obfuscated malware.  ]. The malware sample , is input into a binary extractor that outputs the ASM file , and the bytes file , .

Preprocessing Phase
refers to a local image set, and consists of local images [ , , , , …, , , …, , ]. A local image , is an image generated by the processes detailed in Figure 2 using a local feature extracted from the ASM file , . The text section refers to the section with the program code in the PE file. A feature extractor receives an ASM file , and extracts a local feature from the text section of the ASM file , based on a predefined list. The local feature is composed of opcodes and API function names. A feature selector receives the local feature and outputs the selected local feature based on the term frequency inverse document frequency (TFIDF) algorithm [15].
The top local features are derived in the ascending order of TFIDF of the local feature for each family. The selected local feature is derived after removing the same local features and the local features belonging to all families. The fastText model represents words with a similar meaning, among the words inputted through distributed representation, as similar vector values [22]. A fastText trainer learns by receiving the local feature and outputs the trained fastText model . The fastText executor outputs the embedded local feature * by receiving the local feature and the trained fastText model . of the element , * is normalized from to . Because the normalized local feature , ^ consists of pixels ranging from to of a grayscale image, a local image is generated by using the pixels. The size of the normalized local feature , ^ is the same as the size of the local feature , * embedded through the fastText model . The row size of the 2-D matrix Ω , is , which is the size of the local feature extracted from the non-obfuscated malware , . The column size of the 2-D matrix Ω , is , which is the size of the embedded local feature * .

Experiments
An experiment was conducted to verify the proposed method by performing the local feature visualization process, and deriving its results and malware classification results through MMCF.

Dataset and Experimental Environment
The dataset used to verify the proposed method is the Microsoft Malware Classification Challenge (BIG 2015) [23]. The BIG 2015 dataset is divided into: (1) training data with label information; and (2) test data without label information. The training and test data consist of ASM files and bytes files extracted from malicious samples through IDA Pro. The datasets composed of 9 families includes 10,868 types of malware with a size of 500 GB. Table 2 details the names and numbers of malware used in the experiment. Ramnit is a worm-type malware, and its total count was 1541, of which 28 were obfuscated. The total number of Lollipop malware was 2478, of which 8 were obfuscated. Vundo, Tracur, Obfuscator.ACY, and Gatak are Trojan-type malware, and their total count was 3467, of which 544 were obfuscated. Kelihos_ver3 and Kelihos_ver1 are botnet-type malware, and their total count was 3340, of which 17 were obfuscated. Simda is backdoor-type malware. In the experiment, the 10,868 ASM files and 10,868 bytes files were used among the training data with label information, because the proposed method could not verify data without label information. A total of 90% of the training data was used for training and 10% for testing.

Family Index
Family Name Non-Obfuscated Malware Obfuscated Malware Total Number  1  Ramnit  1513  28  1541  2  Lollipop  2470  8  2478  3  Kelihos_ver3  2936  6  2942  4  Vundo  447  28  475  5  Simda  34  8  42  6  Tracur  294  457  751  7  Kelihos_ver1  387  11  398  8 Obfuscator.ACY 1170 58 1228 9 Gatak 1012 1 1013 Table 3 lists the parameters used in the experiment; batchsize is the number of images inputted at a time, and imageshape is the size of an image. The 32 × 32 local images outputted from the GAN model are reshaped into 256 × 128 images. The learning rate is represented by learningrate, and epoch is the learning number. Similarly, filter_size is the size of the filter, G_h0 is the size of the first layer of the generator, G_h1 is the size of the first CNN layer of the generator, G_h1 is the size of the second CNN layer of the generator, G_h3 is the size of the output layer of the generator, D_h0 is the size of the first CNN layer of the discriminator, D_h1 is the size of the second CNN layer of the discriminator, D_h1 is the size of the third CNN layer of the discriminator, D_h3 is the size of the output layer of the discriminator, conv1 is the size of the first CNN layer, conv2 is the size of the second CNN layer, conv3 is the size of the third CNN layer, fc1 is the size of the first FC layer and fc2 is the size of the second FC layer.  Table 4 lists the selected local features in descending order of the TFIDF values, spanning the results of embedding and normalization and of local feature visualization for non-obfuscated and obfuscated malware. The same local features were selected from the top 1-3 of the selected local features for each family, but different local features were selected from the top 4. The results are derived by extracting the top X opcodes and API function names for each family of malware, and removing the duplicates. These results prove that different families of malware act differently. The top 45, 50, 55, 60 and 65 with high TFIDF values are selected to visualize the local image.  Table 5 summarizes the extracted opcodes and API function names, embedding results and normalization results. Because the embedded opcodes and element values of API function names are the most widely distributed between −1 and 1, pixels are defined by normalizing the values between −1 and 1 of the elements to the values between 0 and 255. If the element value is less than −1, a pixel is defined as 0, and if it is greater than 1, it is defined as 255. By normalizing the embedded results to values between 0 to 255, which is the pixel range of a grayscale image, embedded results were included in a single image; the image is one of malware, which is included in the relationship between opcodes and API function names.    Figure 4 presents the loss values of the CNN to classify obfuscated and non-obfuscated malware in each family. The loss value represents the difference between the predicted and actual values. The loss value of the CNN started with 2.6 at the first iteration, became 0.268 at the fifth iteration, and converged to 0.0543 at the 2701st iteration.   249  373  497  621  745  869  993  1117  1241  1365  1489  1613  1737  1861  1985  2109  2233  2357  2481  2605  2729  2853 Loss Iteration Figure 5 details the learning accuracy of the CNN for classifying obfuscated and non-obfuscated malware in each family. Accuracy is the rate at which the malware family predicted by the CNN is included in the actual malware family. The learning accuracy that started with 15.6% at the first iteration reached 84.4% at the 270th iteration, and converged to 100% at the 2701 iteration.  Table 6 lists the accuracy of malware classification achieved by the proposed method at 99.65% accuracy for each family. The method proposed by Jianwen Fu [17] classifies malware into different families; a global image is generated using a global feature and a local feature extracted from malware. The method is similar to the proposed method, in that it uses local features with images of malware. The performance of the method proposed in this study was 2.18% better than that of the method proposed by Jianwen Fu [17]. The method proposed by Sang Ni [21] classifies malware into different families; local images are created using local features extracted from malware. The performance of the method proposed in this study was 0.39% better than that of the method proposed by Sang Ni [21]. Furthermore, in comparison with the global image-based malware detection and classification method proposed by Nataraj [19] and Kancherla [20], the method proposed in this study achieved a 1.6% higher accuracy.  [17] 97.47 -Global Image, Local Feature Sang Ni et al. [21] 99.26 -Local Image Nataraj et al. [19] 98.00 -Global Image Kancherla et al. [20] 95.95 -Global Image

Non-Obfuscated and Obfuscated Malware Classification Results Achieved by the Proposed MMCF
The non-obfuscated malware classified all the 1024 types of test malware for each family, resulting in 100% accuracy. The obfuscated malware classified 124 out of 128 types for each family, resulting in 96.87% accuracy. The local image of the obfuscated malware based on the global image Accuracy(%) Iteration was generated through the GAN model. However, the generated local image was inaccurate in comparison with the local image of the non-obfuscated malware. The classification accuracy of the obfuscated malware was lower than that of non-obfuscated malware, because the unique patterns of each family were not clearly displayed. Table 7 summarizes the confusion matrix that classifies obfuscated malware into different families. Yellow color in Table 7 means the numbers of accurate classification and red color means that of inaccurate classification.

Comparison between the Results of the Proposed MMCF and Those of Other Research
The malware classification technique based on the global image and local features (text) proposed by Jian Fu et al. [17] is similar to the method proposed in this study. However, MMCF, using the proposed global and local images, yields a 2.18% higher accuracy. The local image-based malware classification method proposed by Sang Ni et al. [21] used the same dataset as the proposed MMCF. Comparing MMCF (with the global and local images) to the method proposed by Sang Ni (with local images) [21], the result of the former is 0.39% better than that of the latter. On one hand, Sang Ni et al. [21] experimented using only 10,805 of the 10,868 BIG 2015 datasets. On the other hand, MMCF was experimented using all the 10,868 BIG 2015 datasets. Even though Sang Ni et al. [21] experimented with limited datasets, MMCF yielded a higher accuracy.

Computational Complexity
The proposed MMCF has a strong advantage, in that it delivers higher accuracy than other methods proposed in recent research studies; however, it has the disadvantage of increased computational complexity. When detecting or classifying malware in real-time, computational complexity is one of the important considerations. However, this study does not consider computational complexity, given that the proposed MMCF focuses on demonstrating the availability of fastText model, which is one of embedding models frequently utilized in natural language processing to express the co-relationship of malware local features by a single image. To further explore the computational complexity of MMCF, the reduction of the computational complexity of generating a local image using embedded models will be studied in future.

Security of Big Data
Big data contains a wide variety of data. Information protection for big data is necessary, especially when it contains personal information or important company information. My T. Thai et al. describe the applications of big data and social networks, and the protection of privacy and security [24].
If a system that handles big data is infected with malware, the root authority of the system is hijacked. Personal information or confidential information contained in the database is stolen easily if the hacker gains root authority of the system that controls the database. Therefore, the ability to detect and classify malware before their execution is important, because malwares are considered as the starting point of cyberattacks.

Conclusions
MMCF offers three methods-local feature visualization, global image-based local feature imaging and global and local image merging methods. This paper described the local feature visualization method. First, the ASM and bytes files were extracted from a database. Second, the local features of each family of malware were selected, based on the TFIDF algorithm. Third, the selected local features were embedded through fastText. Fourth, the embedding results were normalized for each local feature to use them as pixels. Fifth, a local image was generated using the normalized results. Sixth, based on the generated local and global images, malware were classified into different families.
The performance of the proposed method was experimentally verified as follows: First, the selected local feature results based on TFIDF were derived. Second, the embedding results through fastText and the normalized results through the embedding results were derived. Third, the local feature visualization results of obfuscated and non-obfuscated malware based on the normalized results were derived. In comparison with the method proposed by Jianwen Fu, which is the most similar to the proposed method, the proposed method achieved approximately 2.18% higher performance.
Future work will focus on improving the local image of obfuscated malware visualized based on a GAN. Because the derived local image is blurry, it is necessary to obtain a simpler pattern to generate an image based on the GAN. Methods for reducing the number of local features extracted from malware or selecting meaningful local features will also be studied. Also, MMCF will improve the ability to detect files as malicious/benign, and classify malicious files (malware) into each family.