# A Simhash-Based Integrative Features Extraction Algorithm for Malware Detection

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Methods

#### 3.1. Static Analysis

#### 3.2. Dynamic Analysis

#### 3.3. Integrative Features Extraction

#### 3.3.1. Simhash Algorithm

#### 3.3.2. Integrative Process

**A. Data Preprocessing**

**B. Weight Calculating**

**Definition**

**1.**

**Definition**

**2.**

**C. Hash**

**D. Weighting**

**E. Merging**

Algorithm 1 Get_ Integrativefeature( ) |

// Integrative feature extraction algorithm |

Input: .json files of dynamic analysis, disassembled files of static analysis, samples n (1 ≤ n ≤ S). |

Output: four integrative features ${Z}_{{X}_{S}\cup {F}_{S}}$, ${Z}_{{X}_{S}\cup {R}_{S}}$, ${Z}_{{X}_{S}\cup {P}_{S}}$, ${Z}_{{X}_{S}\cup {N}_{S}}$. |

Step 1. Let n = 1, read each line of .json file and disassembly file, capture API call sequence, ${X}_{S}$, file behavior sequence, ${F}_{S}$, registry behavior sequence, ${R}_{S}$, process behavior sequence, ${P}_{s}$, and network behavior sequence, ${N}_{S}$. |

Step 2. Calculate the weight, w, of each of the four behavior sequences corresponding to the API call sequence, and obtain ${W}_{{X}_{S}\cup {F}_{S}}$, ${W}_{{X}_{S}\cup {R}_{S}}$, ${W}_{{X}_{S}\cup {P}_{S}}$, ${W}_{{X}_{S}\cup {N}_{S}}$. |

Step 3. Calculate each hash value of ${X}_{S}$, ${F}_{S}$, ${R}_{S}$, ${P}_{s}$, ${N}_{S}$, and the result is represented as ${H}_{{X}_{S}\cup {F}_{S}}$, ${H}_{{X}_{S}\cup {R}_{S}}$, ${H}_{{X}_{S}\cup {P}_{S}}$, ${W}_{{X}_{S}\cup {N}_{S}}$. |

Step 4. Calculate each weight value of ${H}_{{X}_{S}\cup {F}_{S}}$, ${H}_{{X}_{S}\cup {R}_{S}}$, ${H}_{{X}_{S}\cup {P}_{S}}$, ${H}_{{X}_{S}\cup {N}_{S}}$ and the result is represented as ${H}_{{X}_{S}\cup {F}_{S},b-bits}$, ${H}_{{X}_{S}\cup {R}_{S},b-bits}$, ${H}_{{X}_{S}\cup {P}_{S},b-bits}$, ${H}_{{X}_{S}\cup {N}_{S},b-bits}$. |

Step 5. Accumulate the sequence of each b-bit in ${H}_{{X}_{S}\cup {F}_{S},b-bits}$, ${H}_{{X}_{S}\cup {R}_{S},b-bits}$, ${H}_{{X}_{S}\cup {P}_{S},,b-bits}$, ${H}_{{X}_{S}\cup {N}_{S},,b-bits}$, and merge to a final b-bits sequence, and then normalize it to obtain the integrative features ${Z}_{{X}_{S}\cup {F}_{S}}$, ${Z}_{{X}_{S}\cup {R}_{S}}$, ${Z}_{{X}_{S}\cup {P}_{S}}$, ${Z}_{{X}_{S}\cup {N}_{S}}$. |

## 4. Experiments and Results

#### 4.1. Experimental Configuration

#### 4.2. Experimental Design

#### 4.2.1. Classification Effect Evaluation

#### 4.2.2. Obfuscated-Detection Evaluation

#### 4.2.3. Time Performance Evaluation

#### 4.3. Experimental Results

#### 4.3.1. Classification Effect Evaluation

#### 4.3.2. Obfuscated-Detection Evaluation

#### 4.3.3. Time Performance Evaluation

## 5. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Appendix A

## References

- Sikorski, M.; Honig, A. Practical Malware Analysis: The Hands-On Guide to Dissecting Malicious Software; No Starch Press: San Francisco, CA, USA, 2012. [Google Scholar]
- Liu, J.; Su, P.; Yang, M.; He, L.; Zhang, Y.; Zhu, X.; Lin, H. Software and Cyber Security-A Survey. Chin. J. Softw.
**2018**, 29, 1–20. [Google Scholar] [CrossRef] - Shalaginov, A.; Franke, K. Automated intelligent multinomial classification of malware species using dynamic behavioural analysis. In Proceedings of the 14th Annual Conference on Privacy, Security and Trust (PST), Auckland, New Zealand, 12–14 December 2016; pp. 70–77. [Google Scholar]
- Lee, T.; Kim, D.; Jeong, H.; In, H.P. Risk Prediction of Malicious Code-Infected Websites by Mining Vulnerability Features. Int. J. Secur. Appl.
**2014**, 8, 291–294. [Google Scholar] [CrossRef] - Sugunan, K.; Kumar, T.G.; Dhanya, K.A. Advances in Big Data and Cloud Computing; Advances in Intelligent Systems and Computing In Static and Dynamic Analysis for Android Malware Detection; Rajsingh, E., Veerasamy, J., Alavi, A., Peter, J., Eds.; Springer: Singapore, 2018; pp. 147–155. [Google Scholar] [CrossRef]
- Kang, H.; Mohaisen, A.; Kim, H. Detecting and classifying android malware using static analysis along with creator information. Int. J. Distrib. Sens. Netw.
**2015**, 7, 1–9. [Google Scholar] [CrossRef] - Zhang, M.; Raghunathan, A.; Jha, N.K. A defense framework against malware and vulnerability exploits. Int. J. Inf. Secur.
**2014**, 13, 439–452. [Google Scholar] [CrossRef] - Sujyothi, A.; Acharya, S. Dynamic Malware Analysis and Detection in Virtual Environment. Int. J. Mod. Educ. Comput. Sci.
**2017**, 9, 48–55. [Google Scholar] [CrossRef][Green Version] - Cui, H.; Yu, B.; Fang, Y. Analytical method of high dimensional feature fusion for malicious classification. Appl. Res. Comput.
**2017**, 34, 1120–1123. [Google Scholar] [CrossRef] - Islam, R.; Tian, R.; Batten, L.M.; Versteeg, S. Classification of malware based on integrated static and dynamic features. J. Netw. Comput. Appl.
**2013**, 36, 646–656. [Google Scholar] [CrossRef] - Yang, H.; Zhang, Y.; Hu, Y.; Liu, Q. A malware behavior detection system of android applications based on multi-class features. Chin. J. Comput.
**2014**, 1, 15–27. [Google Scholar] [CrossRef] - Zeng, N.; Wang, Z.; Zineddin, B.; Li, Y.; Du, M.; Xiao, L.; Liu, X.; Young, T. Image-based quantitative analysis of gold immunochromatographic strip via cellular neural network approach. IEEE Trans. Med. Imaging
**2014**, 33, 1129–1136. [Google Scholar] [CrossRef] [PubMed] - Luo, X.; Zhou, M.; Leung, H.; Xia, Y.; Zhu, Q.; You, Z.; Li, S. An Incremental-and-static-combined schemefor matrix-factorization-based collaborative filtering. IEEE Trans. Autom. Sci. Eng.
**2016**, 13, 333–343. [Google Scholar] [CrossRef] - More, S.; Gaikwad, P. Trust-based voting method for efficient malware detection. Procedia Comput. Sci.
**2016**, 79, 657–667. [Google Scholar] [CrossRef] - Ni, S.; Qian, Q.; Zhang, R. Malware identification using visualization images and deep learning. Comput. Secur.
**2018**. [Google Scholar] [CrossRef] - Idrees, F.; Rajarajan, M. Investigating the android intents and permissions for malware detection. In Proceedings of the 2014 IEEE 10th International Conference on Wireless and Mobile Computing, Networking and Communications, Larnaca, Cyprus, 8–10 October 2014; pp. 354–358. [Google Scholar]
- Rosmansyah, Y.; Dabarsyah, B. Malware detection on android smartphones using API class and machine learning. In Proceedings of the 2015 International Conference on Electrical Engineering and Informatics, Denpasar, Indonesia, 10–11 August 2015; pp. 294–297. [Google Scholar]
- Zhu, L.; Zhao, H. Dynamical analysis and optimal control for a malware propagation model in an information network. Neurocomputing
**2015**, 149, 1370–1386. [Google Scholar] [CrossRef] - Wei, T.; Mao, C.; Jeng, A.B.; Lee, H.; Wang, H.; Wu, D. Android Malware Detection via a Latent Network Behavior Analysis. In Proceedings of the 2012 IEEE 11th International Conference on Trust, Security and Privacy in Computing and Communications, Liverpool, UK, 25–27 June 2012; pp. 1251–1258. [Google Scholar]
- Shibahara, T.; Yagi, T.; Akiyama, M.; Chiba, D.; Yada, T. Efficient Dynamic Malware Analysis Based on Network Behavior Using Deep Learning. In Proceedings of the 2016 IEEE Global Communications Conference, Washington, DC, USA, 4–8 December 2016; pp. 1–7. [Google Scholar]
- Zhu, H.; You, Z.; Zhu, Z.; Shi, W.; Chen, X.; Cheng, L. Droiddet: Effective and robust detection of android malware using static analysis along with rotation forest model. Neurocomputing
**2017**, 272, 638–646. [Google Scholar] [CrossRef] - Bai, L.; Pang, J.; Zhang, Y.; Fu, W.; Zhu, J. Detecting Malicious Behavior Using Critical API-Calling Graph Matching. In Proceedings of the First International Conference on Information Science & Engineering, Nanjing, China, 26–28 December 2009; pp. 1716–1719. [Google Scholar]
- Yang, M.; Wang, S.; Ling, Z.; Liu, Y.; Ni, Z. Detection of malicious behavior in android apps through API calls and permission uses analysis. Concurr. Comput. Pract. Exp.
**2017**, 29, 41–72. [Google Scholar] [CrossRef] - Duan, X. Research on the Malware Detection Based on Windows API Call Behavior. Ph.D. Thesis, Southwest Jiaotong University, Chengdu, China, 2016. [Google Scholar]
- Alazab, M.; Venkatraman, S.; Watters, P.; Alazab, M. Zero-day Malware Detection based on Supervised Learning Algorithms of API call Signatures. In Proceedings of the Ninth Australasian Data Mining Conference, Ballarat, Australia, 1–2 December 2011; pp. 171–182. [Google Scholar]
- Automated Malware Analysis: Cuckoo Sandbox. Available online: http://docs.cuc-koosandbox.org (accessed on 13 January 2018).
- Manku, G.S.; Jain, A.; Sarma, A.D. Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web, Banff, AB, Canada, 8–12 May 2007; pp. 141–150. [Google Scholar]
- Feng, B. Malware Detection Techniques Based on Data Mining and Machine Learning. Master Thesis, Central South University, Changsha, China, 2013. [Google Scholar]
- Naive Bayes Classifier. Available online: https://en.wikipedia.org/wiki/Naive_Bayes_classifier (accessed on 27 July 2018).
- Song, M.; Montanari, A.; Nguyen, P. A mean field view of the landscape of two-layers neural networks.
**2018**. [Google Scholar] [CrossRef] - Cortes, C.; Vapnik, N. Support-vector networks. Mach. Learn.
**1995**, 20, 273–297. [Google Scholar] [CrossRef][Green Version] - AdaBoost. Available online: https://en.wikipedia.org/wiki/AdaBoost (accessed on 27 July 2018).
- Ho, T. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell.
**1998**, 20, 832–844. [Google Scholar] [CrossRef]

**Figure 1.**The process of the simhash-based integrative feature extraction algorithm. (PEiD (PE iDentifier), IDA Pro (Interactive Disassembler Professional)).

**Figure 4.**The ROC (Receiver Operating Characteristic curve, ROC) curve of the algorithms. TP (True Positive rate, TP), FP (False Positive, FP), NB (Naive Bayes, NB), SGD (Stochastic Gradient Descent, SGD), SVM (Support Vector Machine, SVM), Ada (Adaboost, Ada) and RT (Random Trees, RT).

**Figure 5.**Classification effect of different types of features under different algorithms. CC (Correctly Classified rate, CC), NB (Naive Bayes, NB), SGD (Stochastic Gradient Descent, SGD), SVM (Support Vector Machine, SVM), Ada (Adaboost, Ada) and RT (Random Trees, RT).

**Figure 6.**Classification effect of different algorithms under different types of features. CC (Correctly Classified rate, CC), NB (Naive Bayes, NB), SGD (Stochastic Gradient Descent, SGD), SVM (Support Vector Machine, SVM), Ada (Adaboost, Ada) and RT (Random Trees, RT).

**Figure 7.**The obfuscated-detection results. DR (Detection Rate, DR), NB (Naive Bayes, NB), SGD (Stochastic Gradient Descent, SGD), SVM (Support Vector Machine, SVM), Ada (Adaboost, Ada) and RT (Random Trees, RT).

Class | Amount | Average Volume (KB) | Min-Volume (Byte) | Max-Volume (KB) |
---|---|---|---|---|

Backdoor | 2200 | 48 | 3500 | 9277 |

Trojan | 2350 | 147.7 | 215 | 3800 |

Virus | 1048 | 71.1 | 1500 | 1278 |

Worm | 351 | 199.3 | 394 | 3087 |

Property Item | Host | Virtual Host | Virtual Guest |
---|---|---|---|

Operating System | window7 64-bit | Ubuntu 16.04 64-bit | window7 32-bit |

Running Memory | 16 G | 4 G | 2 G |

Processor | Core i5-4690 | Core i5-4690 | Core i5-4690 |

Hard Disk | 1 T | 120.7 G | 20 G |

Software Configuration | IDA pro 6.8; PEiD; VMware workstation 11; inetsim-1.2.6; cuckoo sandbox 2.0-rc2; wireshark 2.2.6; process monitor |

**Table 3.**The results of the naive Bayes algorithm.CC (Correctly Classified rate, CC), TP (True Positive rate, TP), TN (True Negative, TN), FP (False Positive, FP), FN (False Negative, FN), T (training Time, T).

Feature Type | Naive Bayes | |||||
---|---|---|---|---|---|---|

CC | TP | TN | FP | FN | T | |

Static feature | 0.953590 | 0.973 | 0.810 | 0.190 | 0.027 | 0.32s |

Dynamic feature | 0.897578 | 0.952 | 0.734 | 0.266 | 0.048 | 0.07s |

Integrative feature | 0.908021 | 0.925 | 0.773 | 0.227 | 0.075 | 0.12s |

**Table 4.**The results of the stochastic gradient descent algorithm. CC (Correctly Classified rate, CC), TP (True Positive rate, TP), TN (True Negative, TN), FP (False Positive, FP), FN (False Negative, FN), T (training Time, T).

Feature Type | Stochastic Gradient Descent | |||||
---|---|---|---|---|---|---|

CC | TP | TN | FP | FN | T | |

Static feature | 0.983857 | 0.982 | 0.879 | 0.121 | 0.018 | 406.79s |

Dynamic feature | 0.925664 | 0.968 | 0.717 | 0.283 | 0.032 | 126.68s |

Integrative feature | 0.945034 | 0.965 | 0.807 | 0.193 | 0.035 | 83.08s |

**Table 5.**The results of the support vector machine algorithm. CC (Correctly Classified rate, CC), TP (True Positive rate, TP), TN (True Negative, TN), FP (False Positive, FP), FN (False Negative, FN), T (training Time, T).

Feature Type | Support Vector Machine | |||||
---|---|---|---|---|---|---|

CC | TP | TN | FP | FN | T | |

Static feature | 0.905541 | 0.934 | 0.791 | 0.219 | 0.066 | 1.57s |

Dynamic feature | 0.848652 | 0.865 | 0.702 | 0.298 | 0.135 | 0.96s |

Integrative feature | 0.881117 | 0.921 | 0.768 | 0.232 | 0.079 | 1.26s |

**Table 6.**The results of the AdaBoost algorithm. CC (Correctly Classified rate, CC), TP (True Positive rate, TP), TN (True Negative, TN), FP (False Positive, FP), FN (False Negative, FN), T (training Time, T).

Feature Type | AdaBoost | |||||
---|---|---|---|---|---|---|

CC | TP | TN | FP | FN | T | |

Static feature | 0.914763 | 0.936 | 0.782 | 0.218 | 0.064 | 0.27s |

Dynamic feature | 0.835647 | 0.883 | 0.703 | 0.297 | 0.117 | 0.11s |

Integrative feature | 0.892046 | 0.897 | 0.807 | 0.193 | 0.103 | 0.14s |

**Table 7.**The results of the random trees algorithm. CC (Correctly Classified rate, CC), TP (True Positive rate, TP), TN (True Negative, TN), FP (False Positive, FP), FN (False Negative, FN), T (training Time, T).

Feature Type | Random Trees | |||||
---|---|---|---|---|---|---|

CC | TP | TN | FP | FN | T | |

Static feature | 0.954179 | 0.975 | 0.793 | 0.217 | 0.025 | 2.01s |

Dynamic feature | 0.914794 | 0.951 | 0.689 | 0.311 | 0.049 | 1.41s |

Integrative features | 0.941315 | 0.968 | 0.814 | 0.196 | 0.032 | 1.73s |

**Table 8.**The obfuscated-detection results of models trained by different types of features. DR (Detection Rate, DR), NB (Naive Bayes, NB), SGD (Stochastic Gradient Descent, SGD), SVM (Support Vector Machine, SVM), Ada (Adaboost, Ada) and RT (Random Trees, RT).

Feature Type | DR | ||||
---|---|---|---|---|---|

NB | SGD | SVM | Ada | RT | |

Static feature | 0.4726 | 0.5243 | 0.5137 | 0.4783 | 0.5939 |

Dynamic feature | 0.7339 | 0.7595 | 0.5861 | 0.6754 | 0.7589 |

Integrative features | 0.7220 | 0.7669 | 0.578 | 0.6812 | 0.7652 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Li, Y.; Liu, F.; Du, Z.; Zhang, D. A Simhash-Based Integrative Features Extraction Algorithm for Malware Detection. *Algorithms* **2018**, *11*, 124.
https://doi.org/10.3390/a11080124

**AMA Style**

Li Y, Liu F, Du Z, Zhang D. A Simhash-Based Integrative Features Extraction Algorithm for Malware Detection. *Algorithms*. 2018; 11(8):124.
https://doi.org/10.3390/a11080124

**Chicago/Turabian Style**

Li, Yihong, Fangzheng Liu, Zhenyu Du, and Dubing Zhang. 2018. "A Simhash-Based Integrative Features Extraction Algorithm for Malware Detection" *Algorithms* 11, no. 8: 124.
https://doi.org/10.3390/a11080124