A Malicious Program Behavior Detection Model Based on API Call Sequences

: To address the issue of low accuracy in detecting malicious program behaviors in new power system edge-side applications, we present a detection model based on API call sequences that combines rule matching and deep learning techniques in this paper. We first use the PrefixSpan algorithm to mine frequent API call sequences in different threads of the same program within a malicious program dataset to create a rule base for malicious behavior sequences. The API call sequences to be examined are then matched using the malicious behavior sequence matching model, and those that do not match are fed into the TextCNN deep learning detection model for additional detection. The two models collaborate to accomplish program behavior detection. Experimental results demonstrate that the proposed detection model can effectively identify malicious samples and discern malicious program behaviors.


Introduction
Compared to traditional power systems, the structure of the new power system is more intricate.The integration of new energy sources, such as renewable energy, into the power system and the growing number of distributed power sources connected to the grid from the user's side are posing greater challenges to the stability and security of the power system.Edge-side applications in the new power system are those that directly interact with power equipment, sensors, controllers, and other components.These applications play a crucial role in collecting, processing, transmitting, and controlling power data, significantly influencing the operational status, fault diagnosis, and dispatch control of the power system.Abnormal or malicious behaviors in edge-side applications, such as data tampering, command errors, or cyber-attacks, can lead to failures, damage, or even paralysis of the power system, with potentially severe socio-economic repercussions [1].Consequently, real-time behavior detection of edge-side applications is essential to detect and prevent such adverse behaviors, ensuring the safe and stable operation of the new power system.
There are two primary approaches to analyzing malicious programs: static analysis and dynamic analysis [2].Static analysis does not involve executing the code; instead, it assesses whether a program is malicious based on code attributes, control flow diagrams, function call diagrams, system call sequences, and other characteristics [3][4][5][6][7].While static analysis offers the benefits of rapid execution and high efficiency, its capacity for behavioral analysis is notably limited.It struggles to detect obfuscation and encryption techniques frequently employed by malicious programs, such as packing, modifications to the PE header, and code obfuscation.In contrast, dynamic analysis evaluates the behavior of an executable file by running it.This method's advantage lies in its resilience to code obfuscation, shelling, and polymorphism, providing the most authentic representation of the program's behavior [8].
Behavior detection technology is a form of dynamic analysis that involves monitoring a program's behavior during execution and determining its maliciousness based on this behavior.An Application Programming Interface (API) provides a means for applications to interact with the system.The sequences of API calls can reveal the functionality and behavioral traits of an application, making them a crucial element in the detection of malicious code [9].
At present, the dynamic features of malicious programs are usually based on API call sequences, with machine learning employed for their detection.However, traditional machine learning methods face challenges in feature selection and struggle to address the anomaly detection issue in high-dimensional massive network traffic, resulting in a high false detection rate.In the detection of malicious programs, rule matching is a commonly used method characterized by its simplicity and low false detection rate.However, it fails to capture the deep relationships within API call sequences and does not leverage the temporal nature of APIs, and the temporal nature of APIs is not utilized, which leads to low accuracy of the model.Deep learning models, known for their robust learning and classification capabilities, have found applications in power systems [10].These models can automatically extract features and categorize them based on these features, leading to enhanced performance.Consequently, many researchers have explored using deep learning to address the analysis of malicious programs.To ensure the stable operation and data security of edge-side applications in new power systems and to reduce the false positive rate of malicious program detection, we propose a malicious program behavior detection model based on API call sequences by integrating rule matching and deep learning.Initially, the API call sequence is processed to complete the preprocessing of the dataset.Subsequently, the PrefixSpan algorithm is employed to mine frequent API call sequences and establish a rule base for malicious behavior sequences.Finally, the behavior sequence matching model and the TextCNN [11] deep learning detection model are jointly utilized to detect test API call sequences, achieving program behavior detection.
The contributions of this study can be summarized as follows: • We used the PrefixSpan algorithm to mine the frequent API call sequences from various threads within a malicious program.These sequences serve as a foundation for directly discriminating for malignant API execution actions, obtaining malicious behavior sequences, and constructing a malicious behavior sequence rule base.

•
We utilized two models, the behavior sequence matching model and the TextCNN deep learning detection model, to collaboratively detect the tested API call sequences.Firstly, the tested API call sequences are input into the behavior sequence matching model to compare them against the malicious behavior sequence rule base.If a match is found, the sequence is deemed malicious; otherwise, it is passed to the TextCNN neural network model for further analysis to obtain the detection results, thereby enhancing the accuracy of the detection.

Related Work
Malicious programs represent a significant security threat to the Internet, with attackers leveraging them for profit through remote control, private information theft, and occasionally targeting network infrastructures.Malware typically disseminates and infects susceptible systems via diverse propagation methods for nefarious purposes, including spam distribution, privacy compromise, system disruption, and denial-of-service attacks.Notable examples of malicious programs include Trojan horses, worms, viruses, ransomware, adware, and spyware [12,13].
On the Windows platform, program behaviors are predominantly executed through system API calls, making the use of APIs for dynamic analysis a focal point in malicious behavior detection research.API calls, whereby an application performs services by invoking functions provided by the operating system, encompass activities such as registry operations, process manipulation, accessing network resources, and file reading.Malicious code typically executes specific behaviors by calling a series of APIs, rather than a single API.Analysis of API call sequences reveals that malicious code often invokes fixed sequences to carry out destructive actions, with different malicious behaviors calling distinct sequences rarely used in normal programs [14].Therefore, API call sequences provide a more accurate representation of program behavior, and malicious code can be identified by analyzing these sequences and extracting subsequence features that differentiate malicious from normal programs [15].Modeling the API call behavior of malware and benign software, the actual relationship between API functions is represented as a semantic transformation matrix [16].The author of [17] developed a malware detection model by integrating statistical, contextual, and graph mining features on API call sequences, bridging the gap of dynamic detection methods.The authors of [18] proposed a deep neural network-based malware detection method for the Windows platform, which learns parameter-enhanced API call sequences and employs rule-based and clustering-based classification methods to evaluate the sensitivity of the parameters of the API call sequences to malicious behavior.Kim [19] proposed a malware detection and classification system by generating a chain of behavioral sequences of some malware families and calculating the similarity between the chain of API behavioral sequences and the sequence of target processes.Dynamic malware detection methods like CTIMD [20] utilize Cyber Threat Intelligences (CTIs) to improve the learning of API call sequences with runtime parameters, offering better accuracy and efficiency compared to traditional methods.However, this approach depends on external threat knowledge and may have limited generalization capabilities.
With the ongoing advancements in natural language processing (NLP), API sequences can be viewed as semantically rich text, enabling the application of machine learning (ML) and deep learning techniques to analyze API call sequences [21][22][23].K-nearest neighbor (KNN), naive Bayes (NB), decision tree (DT), and support vector machine (SVM) are widely used in the analysis of API sequences [24].The authors of [25] propose an efficient malware detection system based on deep learning, which uses a reweighted class balance loss function in the final classification layer of the DenseNet model to significantly improve the performance of malware classification by addressing imbalanced data issues.Huang [26] proposed a hybrid visualization approach for malware that integrates static and dynamic analysis, transforming code into images and conducting malicious detection based on the VGG16 network, thereby improving the detection model's performance.However, due to the typically unbalanced distribution of malware and normal software, deep learningbased methods for detecting malicious programs still exhibit a high false detection rate and require further refinement.
A frequent sequence pattern is a subsequence that occurs more than a specified threshold in a sequence database, indicating a regular behavior or trend.PrefixSpan algorithm [27] is a prominent method for mining frequent sequence patterns from sequence data, which employs prefix projection and a depth-first search strategy to efficiently discover frequent patterns in sequence data.This algorithm is highly effective in various fields, including text processing, web log analysis, and bioinformatics.The core of the PrefixSpan algorithm lies in identifying frequent items in each prefix projection, adding them to the prefix to form a new prefix, and continuing this recursive projection and expansion until no further frequent items can be added [28].Consequently, the PrefixSpan algorithm starts with frequent sequences of length 1, progressively generates longer frequent sequences, and ultimately obtains all frequent sequence patterns in the database.The advantage of the PrefixSpan algorithm is that it does not require generating candidate sequences, which reduces computational and storage requirements, and it leverages the prefix information of the sequence, minimizing unnecessary search space and enhancing efficiency.

Model Framework
In this paper, we propose a malicious program detection model based on API call sequences, and its architecture is shown in Figure 1.Initially, the API call sequences to be tested are extracted and subsequently preprocessed.The preprocessed API call sequences are then fed into the behavior sequence matching model for sequence matching.A successful match is denoted as 1, indicating a malicious sequence, while an unsuccessful match is denoted as 0, indicating a normal sequence.Sequences marked as 0 are subsequently input into the TextCNN model for detection, and the detection results from the TextCNN model are considered the final detection results, superseding the results from the behavior sequence matching model.

Data Pre-Processing
Preprocessing API call sequences to remove redundant behavior is an important step in sequence mining.Malicious code frequently incorporates numerous redundant behaviors into normal behaviors, resulting in sequences characterized by multiple consecutive identical APIs or API sequence fragments.This increases the length and complexity of the sequences, leading to an increase in the time and space overhead of sequence mining.Moreover, it impacts the precision and interpretability of the mining process, making it difficult to identify the core behaviors of the malicious code.
In this paper, we deduplicate the sequence of API call sequences to reduce their complexity.The deduplication process involves retaining only one instance of APIs that are repeated consecutively within the sequence.The specifics of this process are outlined in Algorithm 1. new API sequence = join(new API list, " ") return new API sequence

Model Construction
Building a malicious program detection model based on the API call sequence is the key to this study.In the process of API sequence matching, we employ the PrefixSpan algorithm to mine frequent API call sequences and establish a rule base for malicious behavior sequences.This rule base, containing malicious behavior sequences, serves as the foundation for subsequent detection.
In this paper, we consider API call sequences from different processes of the same program.Initially, the PrefixSpan algorithm is employed to mine frequent API call sequences within each process of a malicious program, identifying key API call sequences for each program.Subsequently, the PrefixSpan algorithm is further applied to mine frequent sequences from the key API call sequences of all programs, yielding malicious API call sequences that are stored in the malicious behavior sequence repository.This repository is a collection of sequence sets that encapsulate the typical behavioral characteristics of various malicious programs and can be used to assess the behavior of unknown programs.By analyzing the API call sequences of different processes, we can more comprehensively capture the behavior patterns of malicious programs, enhancing the accuracy and interpretability of the detection model.Examples of malicious API call sequences are illustrated in Figure 2. In this study, we employ regular expressions for sequence matching.Regular expressions utilize metacharacters, quantifiers, grouping, assertions, and additional grammatical constructs to formulate complex matching rules, catering to a wide range of matching requirements.Sequence matching is conducted on the API call sequences of the test set, using the malicious behavior sequence repository.This results in a label of 1 for success-ful matches, denoting a malicious sequence, and a label of 0 for unsuccessful matches, signifying a normal sequence.
After API sequence matching for the API call sequences to be tested, those sequences marked as 0 are further detected using the TextCNN model.The detection process of malicious programs in the neural network model is shown in Figure 3. 1.
Data preprocessing: Remove redundant APIs that appear consecutively in the sequence, retaining only one instance of each API; 2.
Data vectorization: The Keras toolkit is used for word vectorization.Treating the API sequence as text, the "fit_on_texts" method is employed to convert the text into integer sequences.Subsequently, the "texts_to_sequences" method transforms the integer sequences into vectors.Finally, the "pad_sequences" method is used to pad or truncate different-length text sequences to a uniform length, ensuring consistency in input data, with the length set to 5000; 3.
Training model: The detailed structure of the TextCNN model is shown in Figure 4.
The training set data are input to the Input layer, and word embedding is performed in the Embedding layer to obtain the vector , where X i denotes the vector representation of the ith API sequence, x i denotes the vector represented by the ith API in a sample of API call sequences, and n denotes the number of samples in the training set.The vector X i is input into the dropout layer to prevent overfitting, and then input to the TextCNN module.Firstly, feature extraction is performed by three convolutional modules with convolutional kernel sizes of 1, 3, and 5, and then MaxPooling1D is used for pooling, respectively, followed by Concat for feature fusion to obtain the vector M i = {m 1 , m 2 , • • • , m 5000 }, and finally, the vector F i is obtained by using the flatten layer for flattening.
The vector F i is input into a dropout layer to obtain vector D i , preventing over-fitting.The vector is then input into a dense layer.Subsequently, the vector D i is input into both a dropout layer and a dense layer.The softmax function is employed as the classifier to obtain the classification results.
The TextCNN configuration is shown in Table 1.

Datasets
In this paper, we utilize the dataset of AliCloud Security Malicious Program Detection Challenge [29], and the data provided by the competition are derived from the API com-mand sequence of the Windows binary executable program after simulated execution in a sandbox environment, with a total of 13,887 files and nearly 90 million call records, of which 4978 are normal files and 8909 are malicious files.There are a total of 7 types of malicious files, which are worms, infection viruses, Trojans, mining programs, ransomware, backdoor programs, and DDoS Trojans.The distribution of sample types and specific numbers are presented in Table 2.The statistics of API sequence length before and after deduplication are shown in Figure 5.We quantified the number of APIs in the samples before and after deduplication.As shown in the figure, before deduplication, 65% of the samples exhibitedhad an API sequence length of less than 5000.Following deduplication, this proportion increased to 83%.The deduplication process effectively shortened the length of the API sequence, thereby enhancing the efficiency of subsequent analyses.

Performance Metrics Used
For model evaluation, we employed Accuracy, Precision, Recall, F1-score, and execution time as metrics.The specific formula is as follows: where TP (True Positive) represents the number of instances that are actually positive and are correctly predicted as positive, TN (True Negative) represents the number of instances that are actually negative and are correctly predicted as negative, FP (False Positive) represents the number of instances that are actually negative but are predicted as positive, FN (False Negative) represents the number of instances that are actually positive but are predicted as negative.Accuracy represents the ratio of correctly classified samples to all samples; Precision represents the proportion of true positive instances among the instances predicted as positive in the classification model; Recall represents the proportion of correctly predicted positive instances among all positive instances; F1 score is the harmonic mean of Precision and Recall, which balances both precision and recall.A higher value of the metric indicates better classification performance.

Experimental Setup
Experiments were performed on a PC with an Intel Core i5-1240P CPU running at 1.70 GHz.We completed the experiment based on Python 3.8, Keras, Scikit-learn and Matplotlib.The experimental hyper-parameter settings for the TextCNN in this paper are shown in Table 3.

Results
To verify the performance of the model, in the case of the same dataset, we conducted comparisons with the following models using the same dataset: From the comparative results in Table 4, it can be observed that the Recall value of sequence matching is relatively high, indicating its effectiveness in capturing malicious samples.Utilizing a network with multiple convolutional layers allows for a more comprehensive extraction of API features, resulting in better detection performance compared to models with a single convolutional layer.Hence, the CNN model's performance is not as strong as the TextCNN model.By combining sequence matching and the TextCNN model, high values for Accuracy, Precision, Recall, and F1-score are achieved.However, since sequence matching is performed one by one using regular expressions, it results in a longer detection time.Our model, which combines sequence matching and a deep learning model, exhibits the longest execution time.To obtain optimal classification performance, determining the appropriate size of the convolutional kernel in the TextCNN model is essential.The effectiveness of onedimensional convolutional neural networks for text categorization lies in employing convolutional kernels of varying lengths to extract local features of the sequence.Utilizing multiple sizes of convolutional kernels allows for a more comprehensive capture of the information between APIs.In this paper, we fuse multiple convolutional kernels of different sizes to generate various TextCNN models.The experimental results are presented in Table 5.From the comparison results in the above table, it can be seen that the TextCNN model achieves its highest accuracy of 0.9279 when the convolutional kernel sizes are 1, 3, and 5. Therefore, in this paper, we choose the TextCNN model with the convolutional kernel sizes 1, 3, and 5 for the detection of malicious programs.
The confusion matrix of the TextCNN model (1, 3, 5) is shown in Figure 6.Furthermore, to validate the efficacy of deep learning, a machine learning algorithm is employed.The N-Gram algorithm is utilized for feature extraction of API call sequences, and the optimal size of n in the N-Gram algorithm must be determined to achieve the best classification performance.The value of n is set to 2, 3, and 4, and the random forest algorithm (the number of decision trees is set to 500), logistic regression and KNN algorithm are selected for model training.
The results of machine learning experiments with different values of n in the N-Gram algorithm are presented in Table 6.The detection results of each machine learning algorithm are shown in Figure 9.According to Figure 9 and Table 6, it can be seen that for the same value of n, the random forest algorithm outperforms logistic regression and KNN in terms of accuracy, precision and recall, and when n is taken as 3, the random forest algorithm achieves the highest performance in terms of Accuracy, Precision and Recall.All three algorithms exhibit slightly better detection results when n = 3, compared to when n = 2 or 4.However, their performance is still inferior to that of deep learning.The comparison chart of the detection results of machine learning algorithms and TextCNN and our model is shown in Figure 10.
According to Figure 10, it can be seen that the model we proposed outperforms other machine learning algorithms and TextCNN in terms of performance.

Discussion
In this paper, we use the public dataset from the AliCloud Malicious Program Detection Challenge.This dataset is derived from the API instruction sequence of Windows executable programs after sandbox simulation.The imbalance between categories exists in real-world scenarios, so an unbalanced dataset can better simulate the actual situation.However, the disparity between different categories and the relatively small size of this dataset compared to the expanding family of malicious code may impact detection results.
In this paper, we focus on dynamic API call sequences as the primary subject of study.However, dynamic analysis relies on the samples executing in the sandbox and producing an execution report.As malicious code countermeasures improve, anti-virtualization techniques used by malicious code can evade detection, preventing the samples from successfully executing in the sandbox.In the future, we consider employing static analysis for further feature extraction and combining both dynamic and static approaches for malicious code detection.
Although the malicious samples in the training set encompass various types of malicious code, their behavior remains constrained.In the future, a more comprehensive analysis of program behaviors can be conducted manually to uncover potential connections among multiple APIs.Enriching the behavioral dataset will be more conducive to the detection of malicious programs.

Conclusions
In this paper, we initially mitigate the impact of repetitive information in API call sequences, then analyze the API call sequences in the program to extract behavioral characteristics.Based on these characteristics, we employ a combination of rule matching and deep learning to detect malicious programs.Firstly, malicious sequences are filtered out using behavior sequence matching.Subsequently, the remaining sequences are examined using the TextCNN model.Finally, the detection results from the TextCNN model are used as the final outcomes, superseding the results from the behavioral sequence matching model, to achieve more effective detection of malicious sequences.Since this study only considers the names of the APIs, disregarding information such as parameters, a future direction is to incorporate API parameter information to enhance the accuracy of malicious program detection.In future work, we will continue to collect samples to balance the amount of data across categories.Moreover, we will further consider the parameters of API call sequences to augment their expressive capability.For instance, parameter information of API call sequences related to file operations could include file names, paths, sizes, permissions, etc.This information can aid in identifying potential malicious behaviors in the program, such as deleting, modifying, and hiding important files.Additionally, we will address the imbalance of dataset categories, which may lead to the model favoring categories with more data while overlooking the characteristics of categories with less data.

Figure 1 .
Figure 1.Malicious program detection model framework based on API call sequences.

Figure 3 .
Figure 3. Neural network model malicious program detection process.The specific steps for constructing the TextCNN malicious program detection model are as follows:

Figure 5 .
Figure 5. API sequence length statistics before and after deduplication.(a) API sequence length statistics before deduplication.(b) API sequence length statistics after deduplication.

•
TextCNN model: This model employs multiple convolutional kernels of different sizes (1, 3, 5) to extract key information from the program; • Sequence Matching: We use the Prefixspan algorithm to mine the API call sequences of all malicious programs, obtaining the malicious API call sequences, and then perform sequence matching using these sequences; • CNN model: We use Conv1D with convolutional kernel size 3 to extract local features by sliding the convolutional kernel over the input sequences; • Our model: We initially employ the behavioral sequence matching model to match the API call sequences with malicious sequences, marking successful matches as 1 and unsuccessful matches as 0. Subsequently, sequences marked as 0 are input into the TextCNN model for detection.The detection result from the TextCNN model is considered the final detection result, overriding the outcome from the behavioral sequence matching model.For the neural network model, this paper carried out 5-fold cross-validation on the training data, divided the training set into training and verification sets, and trained the model for each fold.

Figure 6 .
Figure 6.Confusion matrix of the TextCNN model (1, 3, 5).The AUC value represents the area under the ROC curve.It serves as an evaluation metric to assess the quality of a classification model and effectively describes the model's overall performance.The ROC curves for each class of the TextCNN model with convolutional kernel sizes of 1, 3, and 5 are illustrated in Figure 7.

Figure 8
Figure8shows the ROC curves of the four TextCNN models, from which it can be seen that the TextCNN model with convolutional kernel sizes 1, 3, and 5 exhibits the largest AUC value, indicating superior performance compared to the other models.

Figure 8 .
Figure 8. ROC curves of the four TextCNN models.

Figure 9 .
Figure 9.The detection results of three machine learning algorithms.(a) The statistical chart of Accuracy.(b) The statistical chart of Precision.(c) The statistical chart of Recall.

Figure 10 .
Figure 10.Comparison chart of performance metrics of different classifiers.

Table 4 .
Comparison of evaluation results of the models mentioned above.

Table 5 .
Comparison of evaluation results of the models mentioned above.The contents of the parentheses refer to the size of convolutional kernels.

Table 6 .
Results of machine learning experiments with different values of n in N-Gram.