HM^{3}alD: Polymorphic Malware Detection Using Program Behavior-Aware Hidden Markov Model

## Abstract

^{3}alD, a novel program behavior-aware hidden Markov model for polymorphic malware detection. The main idea is to use an effective clustering scheme to partition the program behavior of malware instances and then apply a novel hidden Markov model (called program behavior-aware HMM) on each cluster to train the corresponding behavior. Low-level program behavior, OS-level system call sequence, is mapped to high-level action sequence and used as transition triggers across states in program behavior-aware HMM topology. Experimental results show that HM

^{3}alD outperforms all current dynamic and static malware detection methods, especially in term of FAR, while using a large dataset of 6349 malware.

## 1. Introduction

^{6}malware variants were created [3]. In this regard, many methods have been proposed that focus on detecting and classifying malware [1]. Due to the increasing growth of malware, anti-viruses are usually unable to completely detect them, because malware programs usually attempt to hide themselves using obfuscation methods so they are hard to detect by static analysis [4].

#### 1.1. Malware

#### 1.2. Malware Analysis and Detection

#### 1.3. Contributions

#### 1.4. Paper Structure

## 2. Related Work

#### 2.1. Static Analysis-Based Methods

#### 2.2. Dynamic Analysis-Based Methods

## 3. Background

#### 3.1. Malware Obfuscation

#### 3.2. Hierarchical Clustering

#### 3.3. Hidden Markov Model

## 4. Proposed Method: ${\mathbf{HM}}^{\mathbf{3}}\mathbf{alD}$

#### 4.1. Training Phase

#### 4.1.1. Training Profiler

#### 4.1.2. Preprocessing

#### 4.1.3. Clustering Action Sequences

Algorithm 1 : Preprocessing, generating an action sequence from a system call sequence | |

Input:- S = A system call sequence $S=\{{s}_{1},{s}_{2},\dots ,{s}_{n}\}$.
| |

Output:- $AS$ = An action sequence for system call sequence S.
| |

1: | Define an empty array $\mathcal{B}$ such that each array element specifies a system call sequence during the execution of algorithm. |

2: | Initialize A with 26 base actions: $A=\{{v}_{1},{v}_{2},\dots ,{v}_{26}\}$ |

3: | for each ${s}_{i}\in S$ do |

4: | ℓ = OS_Handle(${s}_{i}$) ▷ get the Operating System handle pointer of ${s}_{i}$ |

5: | insert($\mathcal{B}(\ell )$,${s}_{i}$) ▷ for any system call, insert it to $\mathcal{B}(\ell )$ |

6: | if ${s}_{i}$ is a releasing system call then ▷ A releasing system call releases the dependent resources and invalidate the handle. |

7: | v = Match(A,$\mathcal{B}(\ell )$) ▷ Match $\mathcal{B}(\ell )$ to the corresponding action in A by a hash function |

8: | insert($AS$,v) ▷ Insert action (v) to $AS$ |

9: | end if |

10: | end for |

#### 4.1.4. Training HMMs

**Input sequences:**Action sequences of programs at runtime are observation sequences.

**HMM topology:**We define the topology of HMMs based on program behavior. Recall that each program consists of three main stages at runtime: (1) initialization; (2) running; and (3) termination. In the initialization stage, each program sets its variables by initial values and allocates its required resources. Thus, we consider this stage as following the serial Bakis topology. In the running stage, programs perform some iterative units of work in the form of conditions and loops. Thus, we consider this stage as following the parallel fully connected (ergodic) topology. The termination stage is somewhat similar to the initialization stage: it publishes the program outputs, deallocates the resources, and terminates the program. Thus, we consider this stage as following the Bakis topology. Figure 2 expresses the novel program behavior-aware HMM topology where the number of states at running stage should be determined.

**Training:**To learn an HMM, ${\lambda}_{{c}_{i}}=({\pi}_{{c}_{i}},{A}_{{c}_{i}},{B}_{{c}_{i}})$, on each malware cluster ${c}_{i}$, we must set its initial state probability ${\pi}_{{c}_{i}}=\{{\pi}_{1}=1\}$; its transition probability matrix ${A}_{{c}_{i}}=\{{a}_{rj}=P({q}_{t+1}={S}_{j}|{q}_{t}={S}_{r})\}$; and its observation probability matrix ${B}_{{c}_{i}}=\{{b}_{j}({v}_{m})=P({O}_{t}={v}_{m}|{q}_{t}={S}_{j})\}$. To learn HMM ${\lambda}_{{c}_{i}}$, we first determine the number of states ${N}_{{c}_{i}}$ and then initialize ${A}_{{c}_{i}}$ and ${B}_{{c}_{i}}$ matrices.

#### 4.1.5. Computing Decision Thresholds

Algorithm 2 : HM^{3}alD-Training phase | |

Input:- $\mathcal{C}=\{{c}_{1},{c}_{2},\dots ,{c}_{k}\}$ denotes the set of malware clusters
| |

Output:- Trained HMMs set, $TrainedHMMs=\{{\lambda}_{{c}_{1}},{\lambda}_{{c}_{2}},\dots ,{\lambda}_{{c}_{k}}\}$.
| |

1: | $TrainedHMMs=\varnothing $ ▷ an empty set |

2: | for$i=1$ to $\left|\mathcal{C}\right|$ do |

3: | Estimate the number of states (${N}_{{c}_{i}}$) for cluster ${c}_{i}$ |

4: | Initialize ${\pi}_{{c}_{i}}$, ${A}_{i}$ and ${B}_{i}$ using the proposed program behavior-aware topology |

5: | Compute ${\lambda}_{{c}_{i}}({A}_{i},{B}_{i})$ using Baum–Welch algorithm |

6: | Add ${\lambda}_{{c}_{i}}$ to $TrainedHMMs$ |

7: | end for |

**Definition**

**1.**

**Definition**

**2.**

**Definition**

**3.**

Algorithm 3 : Computing decision thresholds for all malware HMMs, ${\lambda}_{{c}_{i}}$ | |

Input:- $fracrej={R}_{{c}_{1}}={R}_{{c}_{2}}=\dots ={R}_{{c}_{k}}$.
- ${V}_{C}=\{{V}_{{c}_{1}},{V}_{{c}_{2}},\dots ,{V}_{{c}_{k}}\}$ and each ${V}_{{c}_{i}}={\left\{{O}^{l}\right\}}_{l=1}^{|{V}_{{c}_{i}}|}$ is the malware validation part of cluster ${c}_{i}$.
- ${V}_{b}={\left\{{O}^{l}\right\}}_{l=1}^{|{V}_{b}|}$, is a set of benign action sequences.
- $Malware\_HMM=\{{\lambda}_{{c}_{1}},{\lambda}_{{c}_{2}},\dots ,{\lambda}_{{c}_{k}}\}$, learned HMM set corresponding to malware clusters.
| |

Output:- $TV=\{{\mathcal{T}}_{{c}_{1}},{\mathcal{T}}_{{c}_{2}},\dots ,{\mathcal{T}}_{{c}_{k}}\}$, the threshold vector corresponding to malware HMMs. In other words, $\{({\lambda}_{{c}_{1}},{\mathcal{T}}_{{c}_{1}}),({\lambda}_{{c}_{2}},{\mathcal{T}}_{{c}_{2}}),\dots ,({\lambda}_{{c}_{k}},{\mathcal{T}}_{{c}_{k}})\}$
| |

1: | for$i=1$ to k do |

2: | $TV=\varnothing $ ▷ an empty set |

3: | $PV=\varnothing $ ▷ an empty set |

4: | for each ${O}^{l}\in {V}_{{c}_{i}}$ do |

5: | $Pm=LP({O}^{l}|{\lambda}_{{c}_{i}})$ |

6: | insert($PV$,$Pm$) |

7: | end for |

8: | sort($PV$) ▷ sort in ascending order |

9: | $t\_index$=round($fracrej\times \left|PV\right|$) |

10: | $tm=PV(t\_index)$ |

11: | Compute $P{b}_{{c}_{i}}$ by using Equation (3) |

12: | if $P{b}_{{c}_{i}}\le tm$ then |

13: | ${\mathcal{T}}_{{c}_{i}}=P{b}_{{c}_{i}}$ |

14: | else |

15: | ${\mathcal{T}}_{{c}_{i}}=tm$ |

16: | end if |

17: | insert($TV$,${\mathcal{T}}_{{c}_{i}}$) |

18: | end for |

#### 4.2. Detection Phase

**forward**(${O}^{l}$,${\lambda}_{{c}_{i}}$) on the induced action sequence ${O}^{l}$ from the running of program ${P}_{l}$ and it calculates $LP({O}^{l}|{\lambda}_{{c}_{i}})$ iteratively. The result will be “Malware” when at least one of the malware HMMs return the probability value $forward\_valu{e}_{i}$ greater than or equal to the corresponding threshold value ${\mathcal{T}}_{{c}_{i}}$, otherwise it will be “Benign”.

#### 4.3. Time Complexity Analysis

#### 4.3.1. Training Phase

#### 4.3.2. Detection Phase

Algorithm 4 : HM^{3}alD-Detection phase | |

Input:- ${O}^{l}$ = an induced action sequence ${O}^{l}$ from the running of program ${P}_{l}$ in the real environment.
- $Malware\_HMM=\{({\lambda}_{{c}_{1}},{\mathcal{T}}_{{c}_{1}}),({\lambda}_{{c}_{2}},{\mathcal{T}}_{{c}_{2}}),\dots ,({\lambda}_{{c}_{k}},{\mathcal{T}}_{{c}_{k}})\}$, learned HMM set corresponding to malware clusters.
| |

Output:- program type: Malware/Benign
- /*Compute $LP({O}^{l}|{\lambda}_{{c}_{i}})$ of action sequence ${O}^{l}$ with the forward algorithm, for all $i=1\dots k$ malware HMMs.*/
| |

1: | for$i=1$ to k do |

2: | $forward\_valu{e}_{i}$=forward(${O}^{l}$,${\lambda}_{{c}_{i}}$) |

/* The forward() is presented as Algorithm 5 in Appendix B.*/ | |

3: | if $forward\_valu{e}_{i}\ge {\mathcal{T}}_{{c}_{i}}$ then |

4: | return “Malware” & exit ▷ returns “Malware” and then exit. |

5: | end if |

6: | end for |

/* If none of the malware HMMs detect the program ${P}_{l}$ as malware, then the algorithm decides that it is a benign program. */ | |

7: | return “Benign” |

## 5. Experimental Evaluation

#### 5.1. Dataset

#### 5.2. Experimental Setup

#### 5.3. Evaluation Metrics

#### 5.4. The Performance of $H{M}^{3}alD$

#### 5.4.1. Training Phase

**Determining the number of malware clusters**

**Training HMMs**

#### 5.4.2. Detection Phase

**The impact of clustering on the performance of HM**

^{3}alD**The impact of program behavior-aware topology**

**The impact of expressing system call sequences as action sequences**

#### 5.5. Comparing with Other Work

## 6. Analysis and Discussion

## 7. Conclusions and Future Work

## Author Contributions

## Funding

## Conflicts of Interest

## Appendix A. Mapping a Subsequence of System Calls to Actions

- File Actions include “write file”, “read file”, “delete file”, “execute file”, “copy file”, and “move file”.
- Registry Actions indicate the program behavior regarding the registry of Windows operating system (OS), which includes writing, reading, and deleting in the registry.
- Service Actions relate to the registered services in Windows OS. These actions include creating, deleting, and executing a service in Windows.
- Network Actions cover the behavior of executing sample in the transport layer. This Action set is formed based on the state diagram in the TCP protocol, which includes opening and closing a connection (socket), listening on a socket, binding a socket, accepting a socket, sending and receiving on a socket, etc.
- Internet Actions include all actions that occur during communications of a running program in the Application Layer. These actions include opening a session, session connection, and sending and reading files via the network.
- System Actions consist of the sequence of system calls which express a system operation to begin the execution. Load library, memory allocation, address allocation to procedures, etc. are some examples of system actions.
- Process Actions include all the actions related to process and threads, (e.g., creating, executing, and killing a thread.)

## Appendix B. The Forward Algorithm

Algorithm 5 : Forward algorithm | |

It implements the scaled forward algorithm and returns: | |

(1) log($P(O|\lambda )$); (2) scale factor set F = $\{{f}_{1},{f}_{2},\dots ,{f}_{T}\}$; and (3) scale forward variables, ${\widehat{\alpha}}_{t}(i)$. | |

Input:- O = an induced action sequence $\{{O}_{1},{O}_{2},\dots ,{O}_{T}\}$ corresponding to a program P
- A = state transition probability matrix with the number of states (N)
- B = observation probability matrix
| |

Outputs:- scaled forward value = $logP(O|\lambda )$
- scale factor set F = $\{{f}_{1},{f}_{2},\dots ,{f}_{T}\}$
- scale forward variables, ${\widehat{\alpha}}_{t}(i)$ for all $i=1\dots N$ and $t=1\dots N$
| |

Initialization: | |

1: | ${f}_{1}=0$ |

2: | for$i=1$ to N do |

3: | ${\alpha}_{1}(i)={\pi}_{i}{b}_{i}({O}_{1})$ |

4: | ${\alpha}_{1}^{\prime}(i)={\alpha}_{1}(i)$ |

5: | ${f}_{1}={f}_{1}+{\alpha}_{1}(i)$ |

6: | end for |

7: | ${f}_{1}=\frac{1}{{f}_{1}}$ |

8: | for$i=1$ to N do |

9: | ${\widehat{\alpha}}_{1}(i)={f}_{1}{\alpha}_{1}(i)$ |

10: | end for |

Induction: | |

11: | for$t=2$ to T do |

12: | ${f}_{t}=0$ |

13: | for $i=1$ to N do |

14: | $x=0$ |

15: | for $j=1$ to N do |

16: | $x=x+{\widehat{\alpha}}_{t-1}(j){a}_{ji}$ |

17: | end for |

18: | ${\alpha}_{t}^{\prime}(i)={b}_{i}({O}_{t})x$ |

19: | ${f}_{t}={f}_{t}+{\alpha}_{t}^{\prime}(i)$ |

20: | end for |

21: | ${f}_{t}=\frac{1}{{f}_{t}}$ |

22: | for $i=1$ to N do |

23: | ${\widehat{\alpha}}_{t}(i)={f}_{t}{\alpha}_{t}^{\prime}(i)$ |

24: | end for |

25: | end for |

Termination | |

26: | $log\_p=0$ |

27: | for$t=1$ to T do |

28: | $log\_p=log\_p+log({f}_{t})$ |

29: | end for |

30: | $log\_p=-log\_p$ |

31: | return $log\_p$ |

**Figure 3.**A six-state program behavior-aware HMM topology and its corresponding transition probability matrix A.

**Figure 4.**Variation of reconstruction error for different number of malware clusters. Black points shows candidate k values.

**Figure 8.**The performance comparison of $H{M}^{3}alD$ with and without the program behavior-aware topology for $k=24$ and $fracrej=0.05$.

**Figure 9.**The performance comparison of $H{M}^{3}alD$ (for $k=24$), $H{M}^{3}alD\underline{\phantom{\rule{5.69054pt}{0ex}}}S$ (for $k=26$) and $H{M}^{3}alD\underline{\phantom{\rule{5.69054pt}{0ex}}}S\underline{\phantom{\rule{5.69054pt}{0ex}}}R$ (for $k=26$).

Cluster No. | Cluster Size | Centroid |
---|---|---|

1 | 273 | 0.65 |

2 | 1638 | 0.59 |

3 | 243 | 0.89 |

4 | 318 | 0.41 |

5 | 311 | 0.82 |

6 | 268 | 0.81 |

7 | 376 | 0.56 |

8 | 510 | 0.62 |

9 | 372 | 0.73 |

10 | 588 | 0.68 |

11 | 592 | 0.83 |

12 | 860 | 0.95 |

Cluster No. | Cluster Size | Centroid |
---|---|---|

1 | 468 | 0.79 |

2 | 1170 | 0.61 |

3 | 54 | 0.13 |

4 | 264 | 0.58 |

5 | 492 | 0.90 |

6 | 100 | 0.63 |

7 | 92 | 0.72 |

8 | 284 | 0.59 |

9 | 308 | 0.79 |

10 | 64 | 0.56 |

11 | 273 | 0.65 |

12 | 243 | 0.89 |

13 | 311 | 0.82 |

14 | 268 | 0.81 |

15 | 510 | 0.62 |

16 | 588 | 0.68 |

17 | 860 | 0.95 |

Cluster No. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

${N}_{{c}_{i}}$ | 21 | 22 | 17 | 18 | 16 | 15 | 17 | 23 | 26 | 23 | 15 | 21 | 20 | 19 | 22 | 18 | 14 | 23 | 23 | 10 | 12 | 16 | 21 | 9 |

Approaches | Dataset Size (Benign/Malware) | DR(%) | FAR(%) | Accuracy (%) |
---|---|---|---|---|

Shahzad (2013) [20] | 105/114 | 93.7 | 0 | 96.65 |

$H{M}^{3}alD$ | 105/114 | 100 | 0 | 100 |

Elhadi (2014) [21] | 98/416 | 97.57 | 0 | 98.05 |

$H{M}^{3}alD$ | 98/416 | 100 | 0 | 100 |

Salehi (2017) [23] | 1359/3009 | 98.4 | 4.6 | — |

$H{M}^{3}alD$ | 1359/3009 | 98.89 | 1.12 | 98.88 |

Shehata (2015) [22] | 2000/2000 | 97.6 | 2.37 | 96.89 |

$H{M}^{3}alD$ | 2000/2000 | 98.83 | 1.18 | 98.87 |

Methods | Dataset Size (Benign/Malware) | DR(%) | FAR(%) | Accuracy (%) |
---|---|---|---|---|

Kalbhor (2015) [13] | 370/760 | 88.95 | 0.2 | 97.58 |

$H{M}^{3}alD$ | 370/760 | 99.12 | 0.23 | 99.38 |

Song (2012) [16] | 8/200 | 100 | 12.5 | 99.52 |

$H{M}^{3}alD$ | 8/200 | 100 | 0 | 100 |

Shahid (2015) [5] | 2330/1020 | 98.9 | 4.5 | — |

$H{M}^{3}alD$ | 2330/1020 | 98.99 | 1.27 | 98.85 |

Furuki (2012) [12] | 2595/3639 | 98.4 | 2.7 | 97.85 |

$H{M}^{3}alD$ | 2595/3639 | 98.81 | 1.36 | 98.70 |

Ding (2013) [17] | 3760/4410 | 97.3 | — | 91.2 |

$H{M}^{3}alD$ | 2676/4410 | 98.79 | 1.85 | 98.39 |

