Deep Learning Model with Sequential Features for Malware Classiﬁcation

: Currently, malware shows an explosive growth trend. Demand for classifying malware is also increasing. The problem is the low accuracy of both malware detection and classiﬁcation. From the static features of malicious families, a new deep learning method of TCN-BiGRU was proposed in this study, which combined temporal convolutional network (TCN) and bidirectional gated recurrent unit (BiGRU). First, we extracted the features of malware assembly code sequences and byte code sequences. Second, we shortened the opcode sequences by TCN to explore the features in the data and then used the BiGRU network to capture the opcode sequences in both directions to achieve deep extraction of the features of the opcode sequences. Finally, the fully connected and softmax layers were used to output predictions of the deep features. Multiple comparisons and ablation experiments demonstrated that the accuracy of malware detection and classiﬁcation were effectively improved by our method. Our overall performance was 99.72% for samples comprising nine different classes, and our overall performance was 96.54% for samples comprising two different classes.


Introduction
With the continued development of information technology, security incidents are exponentially growing while the network is becoming increasingly sophisticated and convenient.Since the first virus, Morris worm, was discovered in the 1980s, there has been a growing international concern about cyberspace security.Currently, malware is evolving at an increasingly rapid pace, and the creators of viruses have introduced polymorphism to counteract virus detectability by constantly modifying and obfuscating malware, resulting in malware of the same type that, although having the same malicious behavior, appears to be different software.The multiplicity and amorphism of malware have made the prevention and control of cyberspace security extremely difficult.The current problem is, therefore, to quickly detect and classify malware so as to protect the network accordingly.
The problem of malware family detection is essentially a classification problem, i.e., the malicious samples to be detected are classified into different families for screening.Malware detection analysis is divided into dynamic and static analyses.The dynamic analysis approach runs in a secure and controlled environment and analyzes the behavior of malicious samples.Using a secure and controlled environment for analysis makes it easy for malicious samples to detect differences in the environment, but it is too costly for dynamic analysis to be exclusively used in the real environment.Static analysis, on the other hand, is a way to understand the logical structure of the code without executing it and make judgments accordingly.Compared with dynamic analysis, the static analysis method consumes much less time and resources; thus, this study adopted the static analysis method.This method generally extracts features through reverse engineering technology to build a model.The extractable features include string [1], opcode [2], executable file structure [3], and function call graph [4].Opcodes are machine language instructions describing program execution operations, which are relatively more practical and reliable.The n-gram method is used to extract opcodes.The advantage of this method is that it uses great likelihood estimation and is easy to understand.After extracting the features, a model is constructed to classify the malicious families.Santos [5] et al. proposed a method to detect the maliciousness of unknown programs by calculating the frequency values of opcodes appearing in the code as features.Kang et al. [6] proposed extracting the sequence of opcodes from the disassembled files to represent the temporality of malware execution and then used the n-gram algorithm to characterize opcode sequences.Since Nataraj et al. first proposed converting malware executable files into two-dimensional grayscale maps using image texture features with a certain level of similarity in each family for training, image features have been widely used in the field of malware.In recent years, deep learning algorithms have developed rapidly in areas such as natural language processing, which has powerful learning capabilities and more advantages in mining data structures in high-dimensional data.Applying deep learning to the field of malware is a hot topic of current research.Deep learning algorithms such as the recurrent neural network (RNN) [7] and gated recurrent unit (GRU) can be used to implement malware detection.Kwon et al. [8] proposed an RNN approach using an API call function to classify malware.These authors used dynamic analysis to extract representative API call functions of nine malware families as a training set and used LSTM for classification with an average accuracy of 71%.Messay-Kebede et al. [9] proposed a detection model using both traditional machine learning methods and autoencoder-based methods.A few classes were identified by the traditional machine learning model, and others were classified with autoencoders.Gibert et al. [10] extracted byte and opcode sequences, which were fed into a classifier composed of two convolutional neural networks (CNNs).Although the structure was relatively simple, the accuracy failed to exceed that of complex classifiers.Yan et al. [11] proposed the Malnet detection model, which used CNN to learn the features of grayscale maps and LSTM to learn the opcode and then merged the classifications using a simple weighting approach.Barath et al. [12] used a CNN-LSTM approach for feature extraction and two types of machine learning for classification using support vector machines and logistic regression.Researchers Ahmadi M and Zhang Y et al. [13,14] extracted 15 and 6 features from malware, respectively, with more comprehensive information extraction, but feature extraction and selection were time-consuming and contained features that had little effect on classification.
Because a single feature has limitations, and in order to improve the ability of feature mining, the accuracy of malware classification, and reduce the interference of malware variants, packaging and obfuscation technologies, the present study proposed a multi-classification method of malware families incorporating TCN-BiGRU.The main contributions are as follows.

1.
A malware detection and classification method (TCN-BiGRU) that fuses the temporal convolutional network and the bidirectional gated recurrent unit was proposed to improve the overall performance of the malware detection and classification model.

2.
Opcode and bytecode sequences were fused to obtain their occurrence frequencies, reduce interference from shelling and obfuscation techniques, and improve the accuracy rate.

3.
The feature extraction capability of temporal convolutional networks (TCN) for temporal data was introduced to fully learn the dependency relationship among data.

4.
The output of the maximum pooling layer and the output of the average pooling layer were fused for relatively comprehensive extraction of data features.

5.
The nonlinear fitting ability of a bidirectional gated recurrent unit (BiGRU) was used, and further feature extraction was conducted to learn the dependency of the before and after information in the opcode sequence, extracting the opcode features based on the time series to improve the model classification detection effect.
This paper proceeds as follows.Section 1 introduces the relevant background and related work.Section 3 presents the model.Section 4 presents the experimental results and analysis.Finally, Section 5 summarizes the experimental conclusions and discusses future research prospects.

N-Gram Method
N-gram is an important method for processing utterances in natural language processing; it uses the Markov assumption to relate the probability of occurrence of the nth word to the first n−1 words only.Based on this assumption, the probability value of the occurrence of a sentence in a text is calculated by multiplying the probability of the occurrence of each word or phrase, which is expressed in Equation (1) as follows.
The n-gram in the field of malware detection refers to the n opcode or byte sequences that occur in a piece of code [15] to obtain a tighter contextual connection.
The algorithm is implemented by fixing a sliding window of size n and moving forward one opcode at a time.The value of n in the n-gram is generally an integer from 1 to 5. The computational volume of the model increases with the value of n; thus, more information is obtained, and classification accuracy is higher.At the same time, model size exponentially increases.In practical applications, the selection of n values also affects the accuracy of the model and the size of the loss value.

Temporal Convolutional Network (TCN)
A temporal convolutional network (TCN) is a network structure proposed by Bai,Shaojie,et al. [16] for processing time series data based on convolutional neural networks (CNNs).TCN incorporates causal convolution to make causal relationships between upper and lower layers and uses dilated convolution and skip connect to avoid the gradient disappearance problem of RNNs.The use of a temporal convolutional network model not only maintains a large receptive field for the data but also reduces computational effort to better control model memory length and improve time series classification accuracy [17].
Compared with ordinary 1D convolutional networks, TCN brings three main improvements.
(a) Causal convolution: The output value for any moment t is related to the input only before moment t and the previous layer [18].While traditional CNN networks can see future information, causal convolution can only see past information; it is causally consequent, so causal convolution has very strict temporal constraints and is a one-way structure.When the number of convolutional kernels is 4, a single causal convolutional structure is shown in the left panel of Figure 1, and the overall structure is shown in the right panel of Figure 1.A convolution kernel of 4 means that four points are selected from the previous layer for sampling input to the next layer.(b) Dilated convolution: With the gradual increase in the number of dilated convolution layers, the dilation coefficient exponentially increases, and the increase in the range of the receptive field of each layer reduces the number of convolution layers to reduce computational effort and simplify the network structure.To address the problems of traditional neural networks that require the linear stacking of multiple layers of convolution to extend the model of time series, TCN achieves a reduction in the number of convolutional layers by increasing the range of the receptive field of each layer by using dilated convolution [19], with a convolutional kernel of 4 and a dilation coefficient of 1, as shown in Figure 2. When the dilation coefficient of the input layer is 1, the samples in this model are sampled from the previous layer at an interval of 1 and input to the next layer.(b) Dilated convolution: With the gradual increase in the number of dilated convolution layers, the dilation coefficient exponentially increases, and the increase in the range of the receptive field of each layer reduces the number of convolution layers to reduce computational effort and simplify the network structure.To address the problems of traditional neural networks that require the linear stacking of multiple layers of convolution to extend the model of time series, TCN achieves a reduction in the number of convolutional layers by increasing the range of the receptive field of each layer by using dilated convolution [19], with a convolutional kernel of 4 and a dilation coefficient of 1, as shown in Figure 2. When the dilation coefficient of the input layer is 1, the samples in this model are sampled from the previous layer at an interval of 1 and input to the next layer.The difference between dilated convolution and normal convolution is that dilated convolution allows the presence of interval sampling of the input during convolution, and the sampling rate depends on the dilation coefficient.Equation (2) of the receptive field is where K is the convolution kernel size, and d is the dilation coefficient.There are two ways for the TCN to increase the receptive field: one is to increase the size of the dilation coefficient, and the other is to choose a larger value of the convolution kernel.In the dilated convolution operation, the dilation coefficient exponentially grows with the depth of the network, so it is possible to use fewer layers to obtain a larger receptive field.
(c) Residual block: This is another important network structure in the TCN network.The residual block, shown in Figure 3, contains two layers of dilated causal convolution and nonlinear mapping.It has a constant mapping method of connection across layers, which enables the network to transfer information through a connection across layers.Through skip connect, it can not only speed up the response and convergence of the deep-level network but also solve the problem of too slow learning due to overly complex network hierarchical overlay structure.Dropout and batch normalization are also added to prevent model overfitting and speed up training [20].The difference between dilated convolution and normal convolution is that dilated convolution allows the presence of interval sampling of the input during convolution, and the sampling rate depends on the dilation coefficient.Equation (2) of the receptive field is where K is the convolution kernel size, and d is the dilation coefficient.There are two ways for the TCN to increase the receptive field: one is to increase the size of the dilation coefficient, and the other is to choose a larger value of the convolution kernel.In the dilated convolution operation, the dilation coefficient exponentially grows with the depth of the network, so it is possible to use fewer layers to obtain a larger receptive field.
(c) Residual block: This is another important network structure in the TCN network.The residual block, shown in Figure 3, contains two layers of dilated causal convolution and nonlinear mapping.It has a constant mapping method of connection across layers, which enables the network to transfer information through a connection across layers.
Through skip connect, it can not only speed up the response and convergence of the deep-level network but also solve the problem of too slow learning due to overly complex network hierarchical overlay structure.Dropout and batch normalization are also added to prevent model overfitting and speed up training [20].
The skip connect transforms the input x-value through a series of modules to output f(x); the equation for skip connect is

Bidirectional Gated Recurrent Unit (BiGRU)
As a variant of RNN, gated recurrent unit (GRU) also has a recursive structure similar to that of RNN and has the function of "memory" in processing time series data.At the same time, GRU can effectively alleviate the gradient disappearance and gradient explosion problems that may occur during RNN training, thus effectively solving the long-term memory problem.Long short-term memory (LSTM) networks are also a variant of RNN [21] and are comparable to GRU in terms of performance, but GRU is structurally simpler and can reduce computational effort and improve training efficiency [22].The internal structure of GRU is shown in Figure 4. GRU has two inputs, the output state at the previous time and the input sequence value at the current time; the output is the state at the current time.GRU mainly updates the model state through a reset gate and an update gate.The reset gate controls the degree of forgetting historical state information so that the network can discard unimportant information; the update gate controls the weight of the past state information into the present state to help the network remember the information for a long time [23].The internal equations of GRU are as follows: The skip connect transforms the input x-value through a series of modules to output f(x); the equation for skip connect is

Bidirectional Gated Recurrent Unit (BiGRU)
As a variant of RNN, gated recurrent unit (GRU) also has a recursive structure similar to that of RNN and has the function of "memory" in processing time series data.At the same time, GRU can effectively alleviate the gradient disappearance and gradient explosion problems that may occur during RNN training, thus effectively solving the long-term memory problem.Long short-term memory (LSTM) networks are also a variant of RNN [21] and are comparable to GRU in terms of performance, but GRU is structurally simpler and can reduce computational effort and improve training efficiency [22].The internal structure of GRU is shown in Figure 4. GRU has two inputs, the output state at the previous time and the input sequence value at the current time; the output is the state at the current time.GRU mainly updates the model state through a reset gate and an update gate.The reset gate controls the degree of forgetting historical state information so that the network can discard unimportant information; the update gate controls the weight of the past state information into the present state to help the network remember the information for a long time [23].The internal equations of GRU are as follows: 1)

W x + U h W x + U h W x + U r h z h z h
(4) The sigmoid activation function is shown in Equation ( 4) and Figure 4.It serves to convert the intermediate states to the range of 0 to 1; ht−1 and ht are the output states at moments t−1 and t, respectively; xt is the input sequence value at moment t (it is the candidate output state);   ,   ,   ̃,   and   are the corresponding weight coefficient matrices of each component; tanh is the hyperbolic tangent function (it is the Hadamard product of the matrix).
GRU can process the data only from forward to backward and ignores the effect of the latter moment on the data of the previous moment.To combine forward and backward data for integrated learning, BiGRU is used for further learning of the features of the malware.In the BiGRU, which consists of a forward gated recurrent unit and a backward gated recurrent unit, the network model learns the sequence from forward to backward and vice versa.The hidden layer contains two output units with the same input and is The sigmoid activation function is shown in Equation ( 4) and Figure 4.It serves to convert the intermediate states to the range of 0 to 1; h t−1 and h t are the output states at moments t−1 and t, respectively; x t is the input sequence value at moment t (it is the candidate output state); W r , W z , W h , U r and U z are the corresponding weight coefficient matrices of each component; tanh is the hyperbolic tangent function (it is the Hadamard product of the matrix).
GRU can process the data only from forward to backward and ignores the effect of the latter moment on the data of the previous moment.To combine forward and backward data for integrated learning, BiGRU is used for further learning of the features of the malware.In the BiGRU, which consists of a forward gated recurrent unit and a backward gated recurrent unit, the network model learns the sequence from forward to backward and vice versa.The hidden layer contains two output units with the same input and is connected to the same output.The features can be better learned to increase the time series involved in training, thus providing higher accuracy for longer time series data.

Malware Classification Method Based on Sequence Features and Deep Learning
This section introduces the proposed TCN-BiGRU network.This network can extract past data features by one-dimensional, causal convolution with a simple structure, low memory consumption, fast operation speed, and easy superposition.The bidirectional GRU can capture a series of long-term dependencies in both directions, and the bidirectional GRU model can effectively utilize future moment information, which can compensate for the disadvantage of the one-way structure of the causal sequence in the TCN structure and the lack of comprehensive information extraction.The advantages of the two models were fully utilized and combined into a new hybrid model TCN-BiGRU, which enabled the model to conduct more comprehensive feature extraction to further improve the accuracy of malware classification and identification.
First, sample feature extraction was conducted.The originally extracted one-hot encoding and standards were normalized, after which the convolution operation was conducted using TCN to shorten the long-time sequence and extract the deep features of the network.At the same time, the maximum pooling and average pooling operations were conducted, and the extracted features were fused as the pooling output; after normalization and reconstruction, they were passed into the BiGRU network for the deep extraction of temporal features to complete malware detection classification.Finally, the most suitable hyperparameters were selected for the model to improve detection performance.The malware classification process included three stages: pre-processing, feature extraction and training, and classification.The model structure is shown in Figure 5.

Features Extraction
(1) Malware opcode features Programs are sequential instructions, and the underlying operation of a computer con-

Features Extraction
(1) Malware opcode features Programs are sequential instructions, and the underlying operation of a computer consists of the execution of instructions.Instructions generally comprise two parts: opcodes and operands.One of the static features commonly used in malware detection is the opcode feature.Batch disassembly is performed using the IDA pro tool on an executable PE file to obtain the .asmfile with opcode sequence.The .asm file is generally divided into three segments: .text,.data,and .bss.The opcodes in the .textsegment are shown in Figure 6.

Features Extraction
(1) Malware opcode features Programs are sequential instructions, and the underlying operation of a computer consists of the execution of instructions.Instructions generally comprise two parts: opcodes and operands.One of the static features commonly used in malware detection is the opcode feature.Batch disassembly is performed using the IDA pro tool on an executable PE file to obtain the .asmfile with opcode sequence.The .asm file is generally divided into three segments: .text,.data,and .bss.The opcodes in the .textsegment are shown in Figure 6.An opcode can usually be divided into four categories: data movement, arithmetic or logic, control flow types, and others, from which the more important opcodes are filtered to extract the opcode sequence text.The opcode codes category is shown in Table 1. Figure 6 shows a sample of 0A32eTdBKayjCWhZqDOQ.The opcodes in each .asmfile are sequentially extracted by regularization.The extracted opcode sequence text is shown in Figure 7.An opcode can usually be divided into four categories: data movement, arithmetic or logic, control flow types, and others, from which the more important opcodes are filtered to extract the opcode sequence text.The opcode codes category is shown in Table 1.

Category
Operation Codes Data move mov, movzx, push, pop, lea, xchg Arithmetic/logic add, sub, inc, dec, imul, or, xor, shl, shr, ror, rol Control flow jmp, jz, cmp, jnb, call, retf, retn Other nop Figure 6 shows a sample of 0A32eTdBKayjCWhZqDOQ.The opcodes in each .asmfile are sequentially extracted by regularization.The extracted opcode sequence text is shown in Figure 7.The opcode sequences differ.Some are extremely long, so to extract more complete information, the method based on n-gram in natural language processing is used to extract the opcode features.By treating each opcode as a word, the n-gram method takes subsequences of the opcode sequence according to the magnitude of the n value with a sliding window, and then the frequency of the corresponding subsequence is calculated.Then, a word frequency threshold is set, and the subsequence with a particular number of occurrences above the threshold is retained.The retained subsequence is a feature of The opcode sequences differ.Some are extremely long, so to extract more complete information, the method based on n-gram in natural language processing is used to extract the opcode features.By treating each opcode as a word, the n-gram method takes subsequences of the opcode sequence according to the magnitude of the n value with a sliding window, and then the frequency of the corresponding subsequence is calculated.Then, a word frequency threshold is set, and the subsequence with a particular number of occurrences above the threshold is retained.The retained subsequence is a feature of the malware.
(2) Malware bytecode features The malware itself is a file consisting of a series of bytes.One idea is to convert the binary file of malware into a grayscale image using the similarity between the values of the bytes and the range of pixel values taken in the grayscale image.The classification of malware families is achieved based on the texture similarity of grayscale images of the same family of malware and the different textures due to the different structures of different families of malware.To detect similar variants of malware, binary files can be better differentiated such that the impact of obfuscation is reduced.
Malware is converted into a sequence consisting of a binary, and the hexadecimal .bytefile is read in binary, then divided by 16 bits in order, and converted into decimal values within [0, 256).The first line number of each byte file is ignored, and only the hexadecimal values after the line number are extracted.Only the values and letters in the byte file are kept, and the rest of the symbols are replaced with zeros, thus converting the malware file into a one-dimensional vector of decimal numbers.
The steps for extracting bytecode sequence features from malware are shown in Algorithm 1. return byte; 6. end function Similarly, the length of the sequence of each sample varies.To extract more complete information, intercept a particular length, then use each decimal number within that length as a feature, then calculate the frequency of each decimal value.

Feature Pre-Processing
After the malware features (opcode and bytecode features) were extracted, we checked whether this data had missing values, treated the missing values as 0 uniformly, and then performed standard normalization on the malware feature data.Data normalization reduced the variance of the features to a smaller interval, reduced the impact of the difference in the size of different feature values, and improved the convergence rate of the model.Current normalization methods are commonly used to normalize the values to (0,1) and (−1, 1).The normalization method used in this study was maximum-minimum normalization, which scales the values to the interval (0,1), as shown in Equation (5).
where x is the scaled value, M min is the smallest value in the feature dimension, and M max is the largest value in the feature dimension.In the process of malicious code feature extraction, there are many zero values.This method can retain the zeros in the features and can handle the data values with small variances in the features.

Combine TCN and BiGRU for Feature Extraction
The advantages of the TCN model are extraction of past data by one-dimensional causal convolution to guarantee temporality, time savings via the skip connect block, extraction of temporal features by dilated convolution, and the fusion of the average pooling layer with the maximum pooling layer.The advantage of using the GRU model is its nonlinear fitting ability to efficiently extract the data features and its faster convergence speed than the LSTM model [24].The two-way GRU model better captures the sequence features of the opcode by collecting information forward and backward, thus improving the accuracy of model classification.These two models are integrated into the TCN-BiGRU model to obtain better accuracy as well as lower loss values.The structure of the integrated model is shown in Figure 8.In Figure 8, the TCN-BiGRU model structure includes: a. Input layer: processed malicious code opcode feature data and shape (total number of samples, time step, and feature dimension).b.Time series convolutional network layer: the feature vectors j T were extracted via TCN, and the residual units were set up in two layers.A residual unit consisted of two convolutional units and one nonlinear mapping, and the convolutional kernel weights were normalized.The residual unit in Figure 8 was used only as the input layer to the hidden layer; the same was true for the hidden layer to the output layer.The convolution kernel size value was 4, and the dilation coefficient was (1, 2).Dropout was added to prevent overfitting in training.c.The different features extracted from the average pooling layer, as well as the maximum pooling layer, were fused as pooling outputs.We merged the average with the maximum pooling layer.d.The combined pooling layer consisted of a maximum pooling and an average pooling layer, each of which was calculated as shown in Equation (6).Maximum pooling and average pooling were obtained by traversing the pooling window with the input from the previous layer of the network.The pooled maximum and average values were then summed and passed to the next layer of the model structure.In Figure 8, the TCN-BiGRU model structure includes: a.
Input layer: processed malicious code opcode feature data and shape (total number of samples, time step, and feature dimension).b.
Time series convolutional network layer: the feature vectors T j were extracted via TCN, and the residual units were set up in two layers.A residual unit consisted of two convolutional units and one nonlinear mapping, and the convolutional kernel weights were normalized.The residual unit in Figure 8 was used only as the input layer to the hidden layer; the same was true for the hidden layer to the output layer.The convolution kernel size value was 4, and the dilation coefficient was (1, 2).Dropout was added to prevent overfitting in training.c.
The different features extracted from the average pooling layer, as well as the maximum pooling layer, were fused as pooling outputs.We merged the average with the maximum pooling layer.d.
The combined pooling layer consisted of a maximum pooling and an average pooling layer, each of which was calculated as shown in Equation (6).Maximum pooling and average pooling were obtained by traversing the pooling window with the input from the previous layer of the network.The pooled maximum and average values were then summed and passed to the next layer of the model structure.
where h is the input from the upper layer network into the fused pooling layer; h max is the maximum pooling output; h avg is the average pooling output; and h f use is the output obtained by combining maximum pooling and average pooling in parallel.
e. Bidirectional gated recurrent unit layer: The figure shows the structure of the GRU unit when it had two layers.The output vector of the TCN model was first used as the input of the GRU to extract the long-term correlation in the time series.Then the data were output with the results obtained from two layers of BiGRU.f.
Output layer: Output the result of the last moment of the BiGRU to the classification layer.

Classification Output Layer
The classification output layer contained fully connected and softmax layers.The fully connected layer was used to obtain the display expression of the classification, and the softmax function was used to calculate the classification result of malicious code y.The structure of the classification output layer is shown in Figure 9.

Classification Output Layer
The classification output layer contained fully connected and softmax layers.The fully connected layer was used to obtain the display expression of the classification, and the softmax function was used to calculate the classification result of malicious code y.The structure of the classification output layer is shown in Figure 9.The fully connected layer multiplied the weight matrix by the input vector and added a bias to map n (−∞, +∞) real numbers to K (−∞, +∞) real numbers (fractions); Softmax mapped K real numbers.The real numbers of (−∞, +∞) were mapped to K (0,1) real numbers (probabilities) while ensuring that their sum was 1.  z x  = = + wb (7) where y denotes the probability of classification into malicious family type I; w denotes the weight matrix of the fully connected layer; and b is the bias vector of class i; at time t, replace x with tn h .The softmax layer superimposed the input features linearly with the weights.The number of neurons in the softmax layer was set by the number of malicious code types.

Experimental Setup
To test the performance of the malicious code classification method fusing TCN and BiGRU, the following experiments were implemented: The fully connected layer multiplied the weight matrix by the input vector and added a bias to map n (−∞, +∞) real numbers to K (−∞, +∞) real numbers (fractions); Softmax mapped K real numbers.The real numbers of (−∞, +∞) were mapped to K (0,1) real numbers (probabilities) while ensuring that their sum was 1.
where y denotes the probability of classification into malicious family type I; w denotes the weight matrix of the fully connected layer; and b is the bias vector of class i; at time t, replace x with h tn .The softmax layer superimposed the input features linearly with the weights.The number of neurons in the softmax layer was set by the number of malicious code types.

Experimental Setup
To test the performance of the malicious code classification method fusing TCN and BiGRU, the following experiments were implemented:

Experimental Environment and Data Set
The experimental environment was a computer configured with Win10, Intel Core (TM)-9880H CPU @ 2.30 GHz, 64 GB RAM, Quadro RTX 4000 GPU; the programming environment was PyCharm2021.2.2, using the Python 3.7 language in a CUDA 11.0 accelerated environment.The neural network model used TensorFlow 2.4.1 and Keras 2.4.3 versions of the deep learning framework.
The experimental datasets were from the open-source dataset provided by Microsoft [15], and the PE samples were from the Datacon Open Data Project provided by Qianxin (China) [25].The malicious code families in the dataset provided by Microsoft were divided into 9 categories, with 10,868 malware samples.Each sample file had two formats: .asmand .bytes; the PE samples provided by Qianxin had two categories, containing a large amount of mining-type malicious code and non-mining samples.These are the latest real samples captured from the existing network; thus, these samples are likely to contain a large number of shelling samples and resource obfuscation samples.To prevent samples' mistaken execution from infecting the environment, the MZ and PE headers, as well as the import and export table parts, were removed.To ensure that the dataset samples have a certain level of diversity, similar samples were filtered.Therefore, in actual use, the MZ and PE headers were artificially added to extract the opcode features, and the samples were disassembled into .asmfiles using IDA tools.The family name, type number, number of samples, and expression number of the malware dataset used are shown in Figures 10 and 11  (China) [25].The malicious code families in the dataset provided by Microsoft were divided into 9 categories, with 10,868 malware samples.Each sample file had two formats: .asmand .bytes; the PE samples provided by Qianxin had two categories, containing a large amount of mining-type malicious code and non-mining samples.These are the latest real samples captured from the existing network; thus, these samples are likely to contain a large number of shelling samples and resource obfuscation samples.To prevent samples' mistaken execution from infecting the environment, the MZ and PE headers, as well as the import and export table parts, were removed.To ensure that the dataset samples have a certain level of diversity, similar samples were filtered.Therefore, in actual use, the MZ and PE headers were artificially added to extract the opcode features, and the samples were disassembled into .asmfiles using IDA tools.The family name, type number, number of samples, and expression number of the malware dataset used are shown in Figures 10 and 11.To fully evaluate our method, experiments were conducted on two differen according to different methods to fully validate the model.The first method us cious families in Kaggle malicious samples labeled 1-9, and the dataset was n To fully evaluate our method, experiments were conducted on two different datasets according to different methods to fully validate the model.The first method used 9 malicious families in Kaggle malicious samples labeled 1-9, and the dataset was noted as 9-class-data.The second method used 0, 1 sub-table labeling on Datacon samples as sample labels, and the dataset was noted as 2-Class-Datacon; a five-fold cross-validation method was used to randomly divide the data into 10 parts, selecting 9 of these parts as the training set and 1 part as the test set.

Experimental Evaluation Criteria
The experiment selected common evaluation criteria in the field of malware classification detection: accuracy (Acc), precision (PR), recall (RR), and f1-score (F1) to evaluate the classification of the network.These criteria were calculated as follows: where q is the number of samples; d is the number of categories; the value type is the processed one-hot code (string consisting of 0 or 1), and ŷid is the output value of the softmax function (∑ d=9 d=1 ŷid = 1).TP is the true class (meaning that malware was correctly classified as malware), FN is the false negative class (meaning that malware was incorrectly classified as normal software), FP is the false positive class (meaning that normal software was incorrectly classified as malware), and TN is the true negative class (meaning that normal software was correctly classified as normal software).
Model performance was presented using a visual representation of the confusion matrix, as shown in Table 2.

Feature Selection Experiments
After obtaining the opcode sequence, the n-gram method was used for feature extraction of the instruction file from the .asmfile.The frequency f of the instruction n-gram in the .asmfile was calculated as the feature and then used as input.
Normalization pre-processes the values of the frequency of the feature extracted from the n-gram of the malicious code, and the one-hot encoding method pre-processes the values of the malicious family categories.For example, the Ramnit family can be represented as 000000001.
In the experiment, the feature extraction was first performed by selecting N = 3, and then the instruction frequency threshold was selected as 300.Then, the test was conducted by increasing the value at 200 intervals, and the highest value selected was 1100.The experimental results showed that when instruction frequency increased, classification accuracy showed a trend of first rising and then decreasing, and the classification effect was best when the frequency was selected as 700, as shown in the lower panel of Figure 12.In the experiments on change in the N value, the comparison experiment of N value was conducted using the frequency with the best effect in the instruction frequency experiment, i.e., a frequency of 700.As the N value increased, classification accuracy showed a trend similar to that of frequency, which also showed a trend of increasing and then decreasing.Through analysis of the experimental results, the classification effect was best at N = 3, and the experimental results are shown in the upper panel of Figure 13.Therefore, in subsequent experiments with the classification model, N = 3 was selected as the feature for input, and a frequency of 700 was selected as the input to the model.In the experiments on change in the N value, the comparison experiment of N value was conducted using the frequency with the best effect in the instruction frequency experiment, i.e., a frequency of 700.As the N value increased, classification accuracy showed a trend similar to that of frequency, which also showed a trend of increasing and then decreasing.Through analysis of the experimental results, the classification effect was best at N = 3, and the experimental results are shown in the upper panel of Figure 13.Therefore, in subsequent experiments with the classification model, N = 3 was selected as the feature for input, and a frequency of 700 was selected as the input to the model.
For the byte code feature, the byte sequence with a length of fewer than 1500 bytes frequency was selected, then the experiment was conducted by increasing the frequency at an interval of 1000, and sequences with no more than 4500 in length were selected for the experiment.The experiment found that accuracy gradually decreased, so sequences within a length of 1000 were selected for training.The experiment then found that accuracy decreased compared to sequences within a length of 1500; thus, in subsequent features, we selected byte code with lengths below a 1500 bytes frequency for the fusion experiment.
was conducted using the frequency with the best effect in the instruction frequency experiment, i.e., a frequency of 700.As the N value increased, classification accuracy showed a trend similar to that of frequency, which also showed a trend of increasing and then decreasing.Through analysis of the experimental results, the classification effect was best at N = 3, and the experimental results are shown in the upper panel of Figure 13.Therefore, in subsequent experiments with the classification model, N = 3 was selected as the feature for input, and a frequency of 700 was selected as the input to the model.For the byte code feature, the byte sequence with a length of fewer than 1500 bytes frequency was selected, then the experiment was conducted by increasing the frequency at an interval of 1000, and sequences with no more than 4500 in length were selected for the experiment.The experiment found that accuracy gradually decreased, so sequences within a length of 1000 were selected for training.The experiment then found that accu-

TCN-BiGRU Model Performance Analysis Experiments
In the TCN-BiGRU model, the choice of some hyperparameters in the model could impact the experimental results.A single feature (n-gram method) was used in tuning the model to optimize model parameters.Two hyperparameters, the number of filters and the number of convolutional kernels, were selected among the optimization class parameters, and the number of BiGRU layers and the number of neurons per layer were selected as variables from the model class parameters.The number of model iterations was set at 50, the dilation coefficient in TCN was exponentially increased by 2, the dilation coefficient was set to (1, 2), the optimization algorithm was chosen as Adamax, and the learning rate was set at 0.002.To avoid the overfitting problem, a dropout layer was added, and the value was taken as 0.2.To make the experimental data more accurate and valid, a five-fold cross-validation method was used.The prediction data obtained from the experiments regarding the classification of malicious code families when setting a different number of filters, the number of convolutional kernels, and the number of neurons are shown in Tables 3-5, respectively.Using the grid search algorithm, parameter search experiments were conducted for the filters (5,7,10,15,20) and the number of convolutional kernels (2,3,4,5,6) to finally determine the optimal parameter settings for the model, as shown in Table 4.
According to the values of each parameter obtained from the above experiments, the fusion of two features with N = 3, frequency = 700 and the first 1500 byte sequences was performed again using the TCN-BiGRU classification model.The confusion matrix for the classification of the 9-class-data dataset is shown in Figure 14, with "Real label" on the vertical axis indicating true malicious code and "Prediction" on the horizontal axis indicating the prediction made by the model.Table 5 shows in more detail the precision, recall, and FN-score (N = 1) of the predictions for each category.Note that for the Lollipop class, Vundo class, and Simda class, the classification is 100%.In the Kelihos_ver1 class, the classification is poor, with an accuracy of only 94%, while the remainder reached more than 99%.The family class Kelihos_ver1 belongs to the backdoor virus type in the broad category, while there are three families that are all backdoor viruses.Their poor classification was probably due to confusion with similar families.Table 5 shows in more detail the precision, recall, and F N -score (N = 1) of the predictions for each category.Note that for the Lollipop class, Vundo class, and Simda class, the classification is 100%.In the Kelihos_ver1 class, the classification is poor, with an accuracy of only 94%, while the remainder reached more than 99%.The family class Kelihos_ver1 belongs to the backdoor virus type in the broad category, while there are three families that are all backdoor viruses.Their poor classification was probably due to confusion with similar families.
Table 6 details the precision, recall, and f N -score (N = 1) for each category of predictions on the 2-Class-Datacon dataset.The table shows that the results were better for the Not_Miner classification on the 2-Class-Datacon dataset, with an accuracy greater than 96% and recall at 97%.The overall accuracy of the 2-Class-Datacon dataset was slightly worse, probably due to the presence of many shelled samples and resource confusion as this dataset was collected from the current network.As for whether model generalization ability was good on the 2-Class-Datacon dataset, model ablation was set, and different comparison tests were performed for verification.

Model Ablation Experiments
To verify the detection effect of the model proposed, model ablation experiments were performed.Under the same experimental conditions, TCN, GRU, TCN-GRU, and our model were compared on two different datasets to detect the corresponding results of each model for various indexes of the dataset.The detection results are shown in Table 6.
Observe from Table 7 that the proposed model significantly improved the classification effect of malicious samples, with accuracy up to 99.72% and 96.54% on the two datasets, respectively.The 9-Class-Datacon dataset has an accuracy improvement of 0.36%, 0.36%, and 0.18% using TCN, GRU, and TCN-GRU, respectively.The accuracy of the 2-Class-Datacon dataset was improved by 1.92%, 0.84%, and 1.02% using TCN, GRU, and TCN-GRU, respectively.Observing the results of the accuracy, completeness, and F1 values of the two datasets on the three models TCN, GRU, and TCN-GRU, it was found that the proposed TCN-BiGRU model outperformed TCN, GRU, and TCN-GRU in all indexes, thus verifying that the combination of both TCN and BiGRU in the model improved the detection effect for malicious code.

Comparison Experiments of Different Pooling Methods
To solve the problem of insufficient feature extraction abilities of the model, this study proposed a pooling fusion method that simultaneously averaged and maximized the pooling of data and performed parallel pooling.This section presents a comparison experiment on the effect of different pooling methods on the performance of malicious code classification.The model adopted four schemes: no pooling, average pooling, maximum pooling, and pooling fusion.The classification accuracy for both datasets is shown in Table 8.
From Table 8, observe that the method using pooling fusion has higher detection accuracy compared with schemes that perform average pooling or maximum pooling alone.By using pooling fusion to combine these two features and complement each other, we better reflected the nature of the network attack data and obtain higher identification accuracy.This experiment demonstrated that our pooling fusion method can significantly improve the ability of the model to extract features.

Comparison Experiments for Classification Algorithms
Regarding model performance, comparative experiments were conducted with reference to existing literature.Comparison experiments were done on the 9-Class-data dataset with reference to the relevant literature [9,10,[26][27][28].Experimental results are shown in Table 8.Among the five comparative studies, two focused on machine learning, one was related to gene sequence classification, and the remaining two concerned deep learning models.Burnaev et al. [26] used opcode features and grayscale map features, which were extracted and later detected by svm for classification.Narayanan et al. [27] processed grayscale graphs converted from malware, downscaling the features by PCA, and then classifying them using the machine learning model known as K nearest neighbor.Drew et al. [28] used a genetic detection method similar to Strand to classify text.Gibert et al. [10] extracted byte and opcode sequences, and then used a classifier composed of two CNNs for classification.Yan et al. [11] extracted features via a CNN model for grayscale maps and LSTM model for opcode features, then fused the results for classification.These methods produced good results, but there remained a gap between them and the method of this study.Under the accuracy evaluation criterion, our proposed TCN-BiGRU model integrating opcode and byte features achieved 99.72% accuracy; the accuracy values of the five comparison studies were all below 99.72%.Therefore, our proposed model incorporating both features and fusing TCN and BiGRU performed best.

Model No Pooling Average Pooling
One-class SVM [26] Opcode + Grayscale map 92% PCA and kNN [27] Grayscale map 96.6% Strand Gene Sequence [28] Asm sequence 98.59% Orthrus [10] Byte + Opcode 99.24% MalNet [11] Opcode + Grayscale map 99.36% Model in this paper Opcode + Byte 99.72% For the 2-Class-Datacon dataset, we referred to the literature [29][30][31] to perform comparison experiments with the results shown in Table 9.Among these three comparison studies, one was on integration learning, one was on deep learning, and one was on machine learning.Guo et al. [29] extracted grayscale maps of malicious samples, extracts feature with different parameters for GIST descriptors, and then adopted the KNN and random forest algorithms to integrate classification by voting algorithm.Saadat et al. [30] also processed malicious sample images; it first pre-trained a good convolutional neural network model and then used the Xgboost algorithm for classification.Liu et al. [31] extracted the assembly instructions of malware samples.The assembly instructions were then pre-processed and downscaled using the LDA algorithm and were finally trained with the random forest algorithm for classification.These methods produced good results, but there remain gaps between them and the method proposed in this paper.Under the ACC evaluation criterion, our TCN-BiGRU model integrating opcode and byte features reached 96.54% accuracy; the ACC values of the three comparison papers were all under 96.54%.After the above comparative experiments on two datasets, it was proved that our proposed model integrating both features and fusing TCN and BiGRU performed best and had strong generalization capability.

Conclusions
Threats to cyberspace security are increasing, and classification of the massive number of viruses has become an increasingly critical issue.This study proposed a static classification model of malicious code fused with TCN and BiGRU to extract and integrate the opcode features and byte features of malicious code.The model focusd on the potential features of the data and obtained the long-term dependencies existing in the sequences through a BiGRU network in both directions.It showed several advantages, such as high classification detection rate, anti-shelling, and obfuscation on both datasets.It also showed good generalizability and adaptability to high data volume requirements.However, the method used for feature extraction was relatively simple and did not bring out the full performance of the features.In follow-up work, we will use a natural language classification model to further process the samples.

Figure 1 .Figure 1 .
Figure 1.Causal convolution.(b) Dilated convolution: With the gradual increase in the number of dilated convo layers, the dilation coefficient exponentially increases, and the increase in the of the receptive field of each layer reduces the number of convolution layers to r computational effort and simplify the network structure.To address the probl traditional neural networks that require the linear stacking of multiple layers o volution to extend the model of time series, TCN achieves a reduction in the n of convolutional layers by increasing the range of the receptive field of each la

Experiment 1 :
Feature selection experiment Experiment 2: TCN-BiGRU model performance analysis experiment Experiment 3: Comparison experiments of different pooling methods Experiment 4: Model ablation comparison experiment Experiment 5: Comparison experiments of different classification algorithms. .

21 Figure 12 .
Figure 12.Experiment on the selection of N value and frequency.

Figure 12 .
Figure 12.Experiment on the selection of N value and frequency.

R a m n i t L o l l i p o p K e l i h o s _ v e r 3 V 1 TFigure 14 .
Figure 14.Confusion matrix.

Table 3 .
Parameter setting of model.

Table 4 .
Parameter setting of model.

Table 5 .
Parameter setting of model.

Table 4 .
Parameter setting of model.

Table 6 .
Ablation experiment of model.

Table 7 .
Accuracy for different pooling methods.