Some Improvements of Behavioral Malware Detection Method Using Graph Neural Networks

Tarapata, Zbigniew; Romańczuk, Jan

doi:10.3390/app152111686

Open AccessArticle

Some Improvements of Behavioral Malware Detection Method Using Graph Neural Networks

by

Zbigniew Tarapata

^*

and

Jan Romańczuk

Faculty of Cybernetics, Military University of Technology, Gen. Sylwestra Kaliskiego 2 Street, 00-908 Warsaw, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(21), 11686; https://doi.org/10.3390/app152111686

Submission received: 16 October 2025 / Revised: 29 October 2025 / Accepted: 30 October 2025 / Published: 31 October 2025

Download

Browse Figures

Versions Notes

Abstract

This study proposes improvements to a behavioral malware detection method based on graph convolutional networks (GCNs). Three main modifications were investigated: improved normalization of the adjacency matrix, a multi-layer GCN architecture, and a parallel dual-normalization model. The models were trained on a dataset of 44,000 Windows API call sequences and evaluated using standard metrics—accuracy, precision, recall, F1 score, and ROC AUC. The best performance was achieved by the four-layer GCN, which outperformed the baseline in most metrics. The results also showed a non-monotonic relationship between model quality and network depth, likely caused by over-smoothing effects. This study confirms that properly tuned GCN architectures can significantly improve the accuracy and robustness of malware detection.

Keywords:

graph neural networks; graph convolutional networks; malware detection; behavioral analysis; cybersecurity; deep learning

1. Introduction

A graph neural network (GNN) is a neural network that analyzes data represented by a graph. The primary purpose of such a network is to incorporate vertex neighborhood information during machine learning [1]. Graph neural networks (GNNs) have several applications, including the detection of malicious accounts [2,3], malware detection [4], detection of deep fakes in images and video [5], fraud detection [6,7], software vulnerability detection [8,9,10], forensic analysis of memory [11], and binary code analysis [12,13,14] in the field of cybersecurity. The problem of detecting deep fakes in images and video, a recent and highly relevant issue, has been the subject of thorough literature reviews [15,16].

In this paper, we focus on malware detection based on graph machine learning. Malware detection methods are based on several approaches. One group of methods is based on machine learning, particularly graph-based learning; we will focus exclusively on this group. A graph or network of software behavior can be a suitable representation to characterize malware. Such a graph can represent both low-level activities, such as calls to operating system API functions [17], and high-level representations of communication between services in a computer network. Classic malware detection methods rely on heuristic feature engineering based on expert knowledge and previously detected virus knowledge bases, which puts them several steps behind the latest types of malware. The use of graph machine learning (GML) for automatic feature extraction from function call graphs can increase malware detection effectiveness in a dynamic manner, as early as during the execution of the tested program.

Despite numerous studies exploring graph-based malware detection, most existing works have focused on high-level architecture modifications (e.g., attention mechanisms or hybrid LSTM–GNN models) or heuristic graph similarity measures. However, they rarely examine the structural and normalization aspects that fundamentally affect the stability and expressiveness of GCN models.

The baseline method proposed by Oliveira and Sassi [4] uses a single or double convolutional layer and a simple left-normalization of the adjacency matrix, which may lead to gradient instability and limited neighborhood propagation.

In contrast, our work addresses three under-explored directions:

(i): the use of symmetric normalization, which better preserves feature balance and spectral smoothness in directed behavioral graphs;
(ii): the investigation of multi-layer architectures, allowing deeper propagation of contextual information while studying over-smoothing effects; and
(iii): a parallel dual-normalization model, designed to capture bidirectional dependencies between API call sequences.

These enhancements fill a clear research gap by systematically analyzing architectural and normalization refinements that directly influence the representation of malware behavior—an aspect largely overlooked in previous behavioral malware detection studies.

This paper is organized into five main sections. Section 1 (Introduction) presents the motivation for using graph-based machine learning methods in cybersecurity, especially for behavioral malware detection. It also outlines the main objectives of the study and summarizes related work. Section 2 (Materials and Methods) provides an overview of graph neural network (GNN) architectures, with a focus on graph convolutional networks (GCNs) and graph attention networks (GATs). It also describes the original behavioral malware detection method and introduces three proposed improvements, including normalization refinements, multi-layer configurations, and a parallel dual-GCN model. Section 3 (Assumptions and Implementation) details the dataset, experimental setup, and hyperparameter configuration. It also explains how the improved models were implemented and trained using the PyTorch 2.1 framework. Section 4 (Results and Discussion) presents the evaluation results, comparing the performance of different GCN configurations. It analyzes the impact of the number of convolutional layers on model quality and discusses the observed non-monotonic dependence of the metrics. Section 5 (Conclusions) summarizes the findings, confirms the effectiveness of the proposed improvements, and outlines directions for future research on hybrid and more stable graph-based malware detection architectures.

In this paper, we use abbreviations that are frequently used in the literature. Nevertheless, we will list them at the beginning to make the article easier to read:

API—Application Programming Interface;
CFG—Control Flow Graph;
GAT—Graph Attention Network;
GCN—Graph Convolutional Network;
GED—Graph Edit Distance;
GML—Graph Machine Learning;
GNN—Graph Neural Network.

2. Materials and Methods

2.1. Graph Convolutional Networks and Graph Attention Networks: Short Overview

Here, we review two networks that we specifically use in this study: graph convolutional networks (GCNs) and graph attention networks (GATs). GCNs are among the most relevant deep machine learning methods and excel at aggregating features of data that can be projected onto a certain plane, such as an image. Their graphical generalizations are graph spline networks, which allow the aggregation of multidimensional and non-Euclidean data features. In the context of graph neural networks, convolution (spline) is realized analogously to that in classical convolutional networks. The convolutional layer aggregates neighboring values for a given node in the graph, similarly to what happens to a pixel and its neighboring pixels in an image. If we consider an image to be an array of pixels on a plane, then its graphical counterpart is a simple undirected graph, where the image pixels are the graph’s vertices, and the relations between neighboring pixels are its edges. In both classical and graph convolution, a major role is played by the aggregation method—the splicing of the original layer with the indicated filter. The choice of size, value, and aggregation function will determine with what intensity and in which directions the features of a given vertex or pixel will propagate. The difference between classical and graph convolution is that the filter frame in classical convolution is determined arbitrarily, or perhaps even heuristically, whereas graph convolution benefits from the many advantages of the data structure of the graph. One of the most important characteristics of a vertex is its degree, which is very simple to determine. The degree is defined by the number of edges reaching a given vertex and can easily be used to identify the most important nodes and hierarchize them more generally. This parameter can be used in defining a convolution graph aggregation method, which is, in fact, precisely based on hierarchizing the neighbors of a given vertex; this hierarchization makes it possible to determine the relation through which the feature of the analyzed vertex will propagate the most. The result of the weave aggregation method can be regarded as a certain weighted average, in which the weights represent the degrees of importance of the relations and the objects averaged are the features of the neighboring vertices. Thus, a spline operation on a single node can be defined as the product of a vector of its neighbors’ features, with a vector of weights assigned to each relation [1]. By defining the inverse of the vertex’s degree as the weight of a given neighboring relation, we cause the feature of the vertex to propagate most to neighbors that are proportionally most influenced by the original vertex; and least, to those for which the neighborhood is less important.

Graph attention networks (GATs) address some of the shortcomings of graph convolutional networks, as they provide a more complex and better mechanism to reflect the neighborhood relations of vertices, beyond just their degrees. A solution to this problem can be found by analyzing message passing in a graph. By treating the feature vector of a given vertex as an information resource, one can interpret the propagation of these features as a message-passing process in the graph. Following this line of reasoning, vertex neighbor feature convolution methods implement message propagation in which transition probabilities to each neighbor are expressed by a uniform distribution. This method, as it turns out, is quite unsophisticated, as feature propagation occurs in a purely random manner. Thus, in this context, the aim is to find a transmission probability distribution that assigns the highest propagation probabilities according to the edges representing the most important relations. Such a distribution can, of course, be found heuristically or based on statistical analysis of the flows in the graph. A simple example is a network representing a street layout, storing traffic volume information in the edges, which, in this case, are natural determinants of the attractiveness, or importance, of vertex connections. Nevertheless, these methods have many disadvantages that are inherent in heuristics. They are static—historical data and expert knowledge are needed to use them—while graphs effectively represent dynamic data that can be analyzed in real time. The possibilities for expert analysis are also naturally limited by the size of the graph. Therefore, graph-based attentional networks, GATs, based on the attention mechanism, were proposed by the authors of [18].

2.2. A Short Review of Malware Detection Methods Using GNNs

Malware detection methods are based on several approaches. GML graph similarity measures were first used for malware detection in the method described in [19]. This method uses the control flow graph (CGF) of a computer program at the function call level. Constructing such a graph allows the examined software samples to be effectively cataloged in a knowledge base. The flow graph can distinguish between different types of called functions: local, statically referenced, and dynamically referenced during program execution. The ability to characterize and distinguish between these functions can significantly influence whether software is classified as malware. To enable the cataloging and searching of known viruses based on the program CFG, the authors of [19] proposed a database engine in 2009 that can be used to effectively search previously classified programs. The authors used graph edit distance (GED) [20] as the main measure of similarity between function call graphs. This measure was improved by constructing bipartite graphs based on the vertices of the tested graph and comparing them with a database. A bipartite graph of the sets of vertices represents the similarity of the vertices between the graphs and constitutes their bijection. The Hungarian algorithm [19] is used to construct a complete connection in a bipartite graph.

The GED measure is not the only one used to determine the similarity of CFGs in malware detection. The authors of [21] use a normalized number of common edges between the graph of a program (or behavior) suspected of being a virus and a comparative graph from a database [17]. The graphs are compared vertex by vertex based on their feature vectors. A vertex feature can be, for example, the name of the called function or its signature in the form of a character string.

The comparability of graphs representing suspicious programs and graphs in the knowledge base can be improved by generalizing behavioral program control graphs to graphs representing high-level behavior sequences—operating on groups of methods that perform similar functions instead of binary function records [22]. Building higher-level graphs can improve the generalization of software classification models, for example, to different programming environments or operating systems. The authors of [22] use high-level representation graphs for binary verification of whether the software, represented by a given graph, is malicious. Moreover, if a program is determined to be malware, the virus can be assigned to a specific family using multiple similarity measures, such as the Jaccard measure, the Bray–Curtis measure, and the cosine similarity of neighborhood matrices.

Another example application of graph exploration methods is malware classification using the random forest method. The construction of a CFG can also be useful for extracting features, first from individual vertices (functions) and then from entire graphs. The authors of [17] use decompiled binary files representing suspicious software to build a directed CFG containing data on the sequence of system function calls, library functions, and local functions. Unfortunately, after decompilation, local functions do not have a human-readable signature, but this problem was solved by applying clustering to unlabeled graph vertices using the GED measure [20]. As described earlier, the GED measure has the disadvantage of high computational complexity, which increases very rapidly with the number of vertices in the graph. The authors have improved this proximity measure by applying the Locality-Sensitive Hashing (LSH) method, the idea of which is to map objects from a multidimensional space to a hashing space that assigns the same hash values to objects with a high probability of mutual similarity. This allows a suboptimal similarity solution to be found much faster than when using GED alone. However, due to the random nature of the LSH method, the authors [12] perform parallel sequences of function clustering, each of which processes its vertex feature embedding separately. Functions with assigned classes are then subjected to feature extraction into a vector. The vector embeddings generated from behavioral CFGs are then passed through random forest classifiers, which determine the probability distribution of the locations of a given embedding in each class. The final stage is to collect the results of the preliminary random forest classification in the final classifier layer, which returns the probability distribution for assigning the software to a given class (a certain family of malware or a class of harmless software).

Another example is malware detection using autoencoders [23]. The use of an autoencoder architecture in graph machine learning opens up many opportunities for advanced generation of embeddings of graph vertex features. This architecture involves defining a symmetric neural network with output and input layers of the same size. The task of such a network is to reproduce the input data as accurately as possible at the output. This problem is trivial when each layer has the same number of neurons. The situation becomes more interesting when the number of neurons in the hidden layers decreases from the input layer to the middle layer of the network. In this situation, the network will try to generate a compressed form of the feature vector in the smallest layer. The embeddings generated by the middle layer of this network can be used to effectively store features for the graph’s neighborhood matrix. In [23], a dual architecture of autoencoders was used to generate embeddings for two different program behavior graphs. The first is the CFG of the suspicious program’s control. Using the node2vec method [24], the dimension of its neighborhood matrix is reduced to 500. Then, the low-dimensional matrix embedded in this way is fed into the SDA1 autoencoder. The second embedding, built based on 22,000 Windows system API functions, is a vector that takes the value 1 in cells corresponding to functions called in the suspicious program. The embeddings generated by both autoencoders are aggregated into a single output vector, which is passed through the ReLU (Rectified Linear Unit) activation function (*) = max{0,*}, which returns a value classifying the software as malicious or harmless. The authors trained their model on a set containing both harmless and malicious software of various types and obtained an accuracy of over 99% on the test set.

Many methods for detecting malware based on user behavior analysis use graph machine learning. In the rest of this section, we describe several such approaches.

The authors of [25] present an approach to malware detection and classification based on network flow analysis using graph neural networks (GNNs). The NF-GNN model constructs data flow graphs and uses edge features for classification, which better captures communication patterns in the network. HAWK [26], on the other hand, is a system that detects malware in Android applications by modeling them as heterogeneous information graphs. It uses attention mechanisms in graphs to quickly and accurately detect malware, achieving high effectiveness with rapid detection. Another approach [27] uses Group Sequence Graphs to model the relationships between system calls during program execution. Analysis of the structural features of these graphs allows for effective detection and classification of malware, even if mutations have occurred. The research in [28] focuses on the use of control flow graphs (CFGs) to analyze program behavior. It integrates rule-based approaches with autoencoders to better understand and explain the decisions made by GNN models in the context of malware detection. The hybrid DeepCatra model, described in [29], combines recurrent neural networks (BiLSTM) with GNNs to analyze the behavior of Android applications. The model analyzes API call sequences and data flow graphs, enabling effective detection of malware on the Android platform. The DEGCN method proposed in [30] models software behavior as sequences of API call graphs, considering both local dependencies between calls and their evolution over time. The model uses a Dynamic Evolving Graph Convolutional Network for feature extraction and a Graph-encoding-based Gate Recurrent Unit (GGRU) for temporal pattern analysis. Research has shown that the DEGCN outperforms existing methods in malware detection accuracy. The method described in [31] uses system audit logs to extract semantic features of user activity. It then uses a GNN to analyze these data to detect malware.

A relevant line of work is represented by Hong in [32], who propose a resilience recovery method for complex traffic networks based on trend forecasting. They introduce a SIRD-R fault propagation model and use LSTM to forecast network resilience, followed by recovery strategies.

Our work, although in the domain of cyber-security and behavioral malware detection, aligns with this paradigm of modeling propagation and forecasting to enhance resilience: instead of traffic networks we apply graph-based neural methods to capture malware behavior, instead of forecasting resilience we forecast malicious activity and possible propagation paths, and instead of recovery strategy for transport networks we discuss detection and response strategies in IT infrastructures.

The example we focus on here is based on a publication by Oliveira and Sassi [4], who proposed a GCN architecture that classifies the graph representation of system function calls as malware or harmless software. A detailed description of this approach is presented in Section 2.3.2.

2.3. Improvement of Malware Detection Method Using Graph Neural Networks

2.3.1. Goals

The main objectives of our research were to propose and implement some potential improvements to the malware detection method defined in [4] and to verify whether other commonly used graph-oriented methods and fundamentals can be applied in the field of malware detection.

2.3.2. Detailed Description of the Method Being Improved

The method being improved is based on a publication by Oliveira and Sassi [4], who proposed a GCN architecture that classifies the graph representation of system function calls as malware or non-malware. A behavior graph is defined as G = (N, A), where N is an ordered set of vertices representing ordered API calls (control flow graph, CFG), and A ⊆ N × N is a set of arcs, in which an arc (a_i, a_j) ∈ A corresponds to the temporal relationship between two consecutive API calls a_i and a_j, with a_i being called before a_j. The authors collected about 50,000 samples of software behavior and packaged them into a data frame containing, among other information, a record of the sequence of calls to the various APIs in the form of

x = (x_{0}, x_{1}, \dots, x_{L - 1}), x_{i} \in N

, where L is the number of calls. Each element of the ordered L is an index in the interval [0,306], representing the issuing of the function interface to which the analyzed program, or user, sends a query. The behavioral sequences are classified as malware with a class value of 1 and as goodware with a value of 0. The tuple x defined above, representing a single behavioral sequence, can be interpreted as a behavioral directed graph with a matrix of vertex transitions

P = {[P_{i j}]}_{n \times n}

, where

n = |N| = 306

, and

P_{i j} = \{\begin{matrix} 1, where (x_{i}, x_{j}) \subseteq x; \\ 0, otherwise; \end{matrix}

(1)

Constructing the transition matrix in the considered situation is not encumbered by the so-called curse of dimensionality, as the total number of network nodes under study is 306, which is an acceptable order of magnitude for such calculations. For example, let N = (0, 1, 2, 3) be an ordered set of system API calls. Figure 1 shows the behavior graph resulting from the vertex transition matrix derived from (1) and applied to the following sequence of API calls: x = (0, 1, 2, 0, 2, 3). The input vector for the diagram shown in Figure 1 is an ordered set of indices representing called system functions. From this set, a sequence of calls is constructed, which is already a directed graph in the form of a one-way list, with some vertices repeating. Step I in the figure generates a behavioral graph in which the multiple occurrences of vertices are replaced by cycles. The behavior graph, in contrast to the call sequence, ceases to be a list in most cases. Step II in Figure 1 involves processing the input to the network’s weave layer. The plex layer is based on the product of the neighborhood matrix of the behavior graph and the feature matrix representing the list of call sequences. The plex layer propagates the features of vertices to their neighbors using the inverted vertex degree matrix. Step III generates new embeddings of the behavior graph features, which can be passed through the activation function, as shown in Figure 1, or can be used as input to the next weave layer.

The architecture proved to be as effective as the LSTM (Long Short-Term Memory, a type of recurrent neural network) network used for the same purpose. Although both models were able to achieve over 98% accuracy, the GCN model is far more promising. It can provide a development base for increasing accuracy by extending the behavior graph to include system function call parameters, execution time, the process ID of the current program, and so on. In the next section, we show how to improve the presented method.

2.3.3. Description of Research Methodology

This research was conducted by defining three standalone potential improvements to the method in [4] for malware detection. Two of the three ideas are based on conclusions from fundamental publications in the field of graph-based machine learning. The first potential improvement to the graph convolutional network (GCN) method involves better normalization of the adjacency matrix in the graph convolution mechanism. This method, developed by the authors of [4], utilizes a deep neural network—a binary classifier that takes as input various directed graphs, each representing a sequence of Windows API calls on which the analyzed software would be executed. A substantial part of graph convolution is the normalization of the input graph’s adjacency matrix. The authors perform normalization as follows:

S = D^{- 1} \hat{A}

(2)

where

$S$ : normalized adjacency matrix;
$\hat{A} :$ adjacency matrix with self-loops;
$D^{- 1} :$ inverse matrix of the degree matrix of vertices.

As a result of normalization, we obtain an adjacency matrix representing a graph embedding where the vertex feature value is derived from the weighted average of the feature values of neighboring vertices. The weights of the weighted average are the inverses of the degrees of these vertices. This aggregation method is simpler than the one defined in [33], which is considered the most popular and effective method of directional feature propagation in graphs. This method allows us to define a normalized matrix as follows:

S = D^{- \frac{1}{2}} \hat{A} D^{- \frac{1}{2}}

(3)

where

$D^{- \frac{1}{2}}$ : the degree matrix D raised to the power of $- \frac{1}{2}$ .

The application of method (3) could improve the quality of normalization of the adjacency matrix in the API call CFG.

The authors of [4] utilize either a single-layer or two-layer GCN model. The architecture of the two-layer network is implemented conventionally: convolutional layers are connected in sequence, and nonlinear operations are applied to their outputs to counteract the vanishing gradient problem. Furthermore, the outputs of both layers are concatenated in the final fully connected layer. In the work of [34], it was proven that simplifying the multi-layer architecture of a graph deep network can improve the model’s quality while simultaneously reducing the time and memory complexity of the learning algorithm. Therefore, for our purposes, the multi-layer architecture of the network is implemented as follows (Improvement 1):

\hat{S} = S^{K} X W

(4)

where

$K$ : number of layers;
$S^{K}$ : normalized matrix to the power of $K$ ;
$X$ : feature vector;
$W$ : weight vector.

The second potential improvement to the malware detection method [4] is the addition of the attention mechanism (GAT) [18]. The GAT allows for the application of a feature aggregation function that is more tailored to the graph structure than a weighted average, since the selection of appropriate weights for normalizing the adjacency matrix is handled by the simplest neural network—a single-layer perceptron. GAT operation weights, according to [18], are calculated as follows:

α_{ij} = {softmax}_{j} (a (W h_{i}, W h_{j})) = \frac{e x p (a (W h_{i}, W h_{j}))}{\sum_{k \in N i} e x p (a (W h_{i}, W h_{k}))}

(5)

where

$s o f t m a x$ : activation function;
$α_{i j}$ : feature transmission weight from vertex $i$ to $j$ ;
$a$ : linear attention operation;
$W$ : weights vector;
$h_{i}, h_{j}$ : feature vectors for vertices $i$ and $j$ .

Our final proposed improvement to [4] is a parallel network that utilizes the same GCN as in Improvement 1. The model implementing this method consists of two GCNs that are trained simultaneously, with their outputs concatenated and passed through a fully connected layer. The first network uses the adjacency matrix normalized by the degree matrix of the output vertices, while the second one uses the degree matrix of the input vertices. The concatenation of the outputs from both networks would allow for better capture of the sequentiality in the graphs analyzed.

2.3.4. Assumptions

The aforementioned potential improvements were implemented based on the codebase included in [4]. The functions written by the authors of [4] for preprocessing the data and splitting the dataset were utilized, as were the definitions of neural network models that we extensively enhanced for Improvement 1. The functions for normalizing the adjacency matrices were rewritten from scratch, and the aspect of the multi-layer architecture was introduced into the GCN from Improvement 1 and the GAT from Improvement 2. For each of the improvements, a separate program was written, and distinct models were defined.

The improvements were implemented using Python 3.10.10 with support from the PyTorch 2.1 library, which was utilized to compile graph models on a local GPU with support from the tensor calculation parallelizing tool Cuda from NVIDIA (Hardware: GPU NVIDIA GeForce GTX 1060, processor 8 x Intel Core i7-6700HQ CPU @ 2.60 GHz, RAM 16 GB) (NVIDIA, Santa Clara, CA, USA).

As a dataset, [35] was used, hand-created and crafted by the authors of [4]. It consists of 44,000 programs recorded as sequences of indexed function calls from the Windows API. The data also contains binary values of

1

or

0

, corresponding to a malware or goodware classification for a given program. Each graph in the dataset corresponds to a previously decompiled program. Data row is a sequence of 307 identifiers, in which each column represents one of 100 low level system functions. Every identifier is treated as a node in a final graph. Functions’ identifiers are sequenced in a fixed direction which corresponds to execution of given decompiled program, what results in a directed graph. If a program consists of two immediately consecutive instructions, this will be reflected in the graph as an edge between the nodes that these functions (OS instructions) represent. Programs obviously consist of loops, and if-statements, which in the graph is equivalent to loops and multiple output edges from one node.

An example visualization of a single row from the training data is shown in Figure 2.

The GridSearchCV mechanism from the scikit-learn library was used to evaluate the implemented models. It can perform model training for each value from the defined hyperparameter space (Table 1) and then return the model with the highest accuracy.

Table 1 presents set of basic configuration hyperparameters: number of epochs, batch size, dropout. A fixed vector of values was tested for each of these four parameters. Preprocessing was performed, based on which twenty-four (2 × 2 × 2 × 3) combinations of the four parameter values were tested. The best set of parameters was selected, such that the batch size is 32, the number of epochs is 20, the dropout is 0.1, and the second dimension of a weight’s matrix is 31. For these hyperparameter values, the characteristics were calculated for all layers.

The set chosen for n-layer GCN improvements was also utilized in an original GCN model to check, if possible, improvements could be attributed to architectural changes.

3. Results

All implemented models have been evaluated using the typical key parameters for binary classifiers: accuracy, precision, sensitivity, F1 score, and ROC AUC. The values are presented in Table 2 and Table 3. In brief, accuracy (Acc) is the ratio of the total number of samples correctly classified as positive and negative to the total number of samples; precision (Prec) is the ratio of the number of samples correctly classified as positive to the number of all samples classified as positive; sensitivity (Rec) is the ratio of the number of correctly classified positive samples (true positives) to the sum of the number of true positives and the number of false negatives; F1 score expresses the importance of precision and sensitivity using a single parameter, which is the harmonic mean of these measures; and the area under the ROC (Receiver Operating Characteristic) curve, also known as AUC ROC (Area Under Curve ROC), is a parameter that determines the functional relationship between sensitivity and the false positive rate.

Each training run for each of the improvements’ implementation and original model was conducted 10 times. Table 2 and Table 3 consist of mean values of metrics calculated per series, standard deviation values and 95% confidence interval for the mean.

In order to verify the effectiveness of the implemented improvements, the original model from [4] was also compiled locally and used in comparative analyses. The comparative metric values for this model are shown in the first row of Table 2 and Table 3.

The best results were achieved by the single-layer GCN model—modification 1. It has the highest values for each metric, except for precision and sensitivity, which are still quite high for this model. The five-layer model of the same improvement stands out with the highest precision, and the three-layer GCN with highest sensitivity.
Some of the implemented models achieved better results than the original model. It should be noted, however, that the result values for the original model were obtained from a local compilation of this model, which is why they differ from the accuracy values given by the authors in [4]. The important thing to note as well is that both the original model and the n-layer GCN improvements were trained on the same set of hyperparameters (dropout = 0.1, max epochs = 20, batch size = 32, weights matrix size = 31). The fact that training conditions remained the same for all GCNs implies that the improvement is a result of changes in models’ architecture. The set chosen for n-layer GCN improvements was also utilized by us in an original GCN model (from [4]) to check, if possible, improvements could be attributed to architectural changes. It is worth mentioning that in the research conducted in [4], the authors used pre-selected hyperparameter values equal to 32—the batch size, 30—the number of epochs, and 0.6—the dropout. Models of GAT and concurrent, single-layer GCNs were built using separately chosen sets of hyperparameters because of substantial differences between them and GCN models.

As for the role of multiple-layer GCN architecture, the accuracy, precision, and recall slightly drop with the increase in number of layers. It can be attributed to directional, sequential nature of graphs used in training data, in which feature transfer between nodes does not play a major role in graph classification (Figure 3, Figure 4 and Figure 5).

The computational complexity of the implemented models was estimated based on the calculations presented in [18,34], and the values are listed in Table 4.

Execution time measurements were taken for each of the trained models during the initialization, training, and evaluation phases. The results are presented in Table 5. It is important to note that the implementation and model-building processes were conducted on local machines with limited computational capacity. While the exact execution times are not the primary focus, the ratios between them are significant. It is evident that the original GCN model excels in overall speed for both training and evaluation. In comparison to the original model, multilayer GCNs and concurrent GCNs require approximately three times as long for training, while GAT models take about six times longer.

A more extensive discussion of all the results presented is provided in the next section.

4. Discussion

Table 2 and Table 3 show that the best results were achieved by the single-layer GCN model from Improvement 1, which also has the highest value for all metrics except for precision and sensitivity, which is nonetheless still quite high. The highest precision, on the other hand, is exhibited by the five-layer model of the same improvement and the highest sensitivity is achieved by three-layer model. Some of the implemented models outperformed the original model. It should be noted, however, that the resulting values for the original model were obtained from a local compilation of this model, which is why they differ from the accuracy values provided by the authors in [4].

The variation in performance metrics (accuracy, precision, recall, F1 score, and AUC) across different model configurations, as shown in Table 2 and Table 3, is commonly observed in graph neural network (GNN) research. These differences arise from the complex interplay between the network depth, the aggregation mechanism, and the specific characteristics of the dataset. Increasing the number of convolutional layers does not necessarily guarantee performance improvement—beyond a certain point, deeper GCN architectures may suffer from the over-smoothing phenomenon, where node features become indistinguishable after multiple message-passing steps, degrading classification accuracy [33,34]. In addition, deeper networks are more prone to overfitting, especially when trained on limited or noisy data, which can cause instability in metrics such as precision and recall.

The observed non-uniformity of metric values may also result from stochastic factors such as random weight initialization, batch selection, or variations in gradient propagation during training. Furthermore, GCN-based models are sensitive to hyperparameter settings (e.g., dropout rate, learning rate, normalization scheme), which can amplify differences between configurations. Overall, such variability is typical for deep learning models on graph-structured data and has been documented in previous studies (e.g., [1,33,34]) as an inherent aspect of balancing model complexity, generalization, and stability.

The plotted metrics (Figure 3, Figure 4 and Figure 5) as a function of the number of graph convolutional layers reveal a non-monotonic and partially divergent trend. Although all metrics exhibit an overall decreasing tendency with increasing depth, each reflects a different sensitivity to layer count. Precision slightly increases up to five layers, reaching a local maximum, then decreases, and unexpectedly rises again for ten layers. This oscillation suggests that deeper models occasionally capture more distinctive malware-specific substructures, but at the cost of generalization stability. Accuracy remains approximately constant across the first five layers, then clearly drops for deeper architectures. This plateau followed by a decline indicates that the added propagation depth no longer contributes useful contextual information—node embeddings become increasingly homogenized due to over-smoothing, and classification boundaries blur. Recall (sensitivity) decreases only gradually with the number of layers, implying that the model retains its ability to detect most malicious samples even when deeper, though at the expense of more false positives (lower precision). Together, these patterns reflect the trade-off between local feature aggregation and over-smoothing predicted by the spectral theory of GCNs [36,37]. The local optimum at four to five layers corresponds to the point where contextual information is sufficiently propagated without causing feature convergence. Beyond this threshold, the growing receptive field and gradient attenuation jointly deteriorate discrimination ability. The minor rebound in precision at ten layers is most likely a stochastic artifact—random initialization and dropout effects occasionally emphasize narrow feature subspaces that still separate classes.

The non-monotonic trend observed in Figure 3, Figure 4 and Figure 5 can be better understood through the theoretical framework of feature convergence in deep graph convolutional networks. As the number of layers increases, repeated propagation of features via the normalized adjacency matrix

\tilde{A}

effectively acts as a low-pass filter that smooths node representations. According to [36,37], this process causes node features to asymptotically converge toward the principal eigenvector of

\tilde{A}

, leading to loss of inter-class distinguishability (the over-smoothing phenomenon). Beyond the threshold of approximately five layers, gradient vanishing and over-smoothing effects dominate, consistent with theoretical predictions of spectral convergence and the observed performance degradation. These findings empirically validate the trade-off between expressive depth and feature stability predicted in GCN theory.

Further improvements to the methods described (Table 4) may include adding attributes to the graphs of function call sequences, such as data on the identifier of the process calling the function, the arguments and result of the function call, and the time of the call.

The total training time of the proposed models remains about 60 s on a single GPU (NVIDIA GeForce GTX 1060), suggesting that near real-time operation is feasible. For online malware detection tasks (1–10 Hz inference), the proposed approach can achieve practical latency when deployed via optimized inference engines such as CUDA, ONNX Runtime, or TensorRT. Multi-GPU parallelization and weight quantization could further enhance throughput while maintaining model accuracy.

The analogy between transport-network resilience and cyber-network resilience is compelling: in [32] the network structure, propagation of faults and recovery capacity constitute the resilience cycle. In our work, the detection of malware via graph neural networks can be seen as the ‘early warning’ and forecasting component of such a cycle. This suggests that beyond detection, future work could extend our method into a full resilience framework (detection → reaction → recovery → forecasting) for cyber-networks, mirroring the transport network model. Additionally, the forecasting of behavior/trends (analogous to the LSTM forecasting of resilience in [32]) could be incorporated in our pipeline, enabling proactive rather than reactive cybersecurity interventions.

It should be noted that the original experimental setup did not preserve model probability outputs, which are required to compute the PR-AUC and calibration metrics. However, since the F1-score inherently balances Precision and Recall, the observed improvements in F1 across all enhanced GCN variants suggest a consistent enhancement in the Precision–Recall trade-off. Future work will extend the evaluation to include PR-AUC, confusion matrices, and reliability analyses once probabilistic outputs are retained.

The present study concentrated on improving the internal design of the GCN-based behavioral malware detector rather than comparing it with fundamentally different model families. Nevertheless, models such as BiLSTM, 1D-CNN, and Transformer-based architectures could serve as valuable baselines for future research, especially for sequential API-call data. Similarly, advanced GNN variants such as GraphSAGE or residual GCNs could further enhance feature propagation and gradient stability. Future work will extend the current analysis to include these models, enabling a comprehensive performance comparison across both intra-GCN and cross-architecture perspectives.

In this study, robustness was interpreted as the stability of classification metrics across multiple runs and GCN configurations. Nevertheless, a more rigorous notion of robustness—resistance to perturbations in behavioral API-call sequences—remains an important direction for further work. Future experiments will therefore introduce controlled perturbations (inserting benign no-ops, slight reordering, and wrapper/alias substitutions) and measure corresponding performance degradation curves. Such analysis will enable a quantitative evaluation of resilience to realistic behavioral noise, complementing the structural robustness reported in this paper.

5. Conclusions

This study proposed and experimentally verified several improvements to a behavioral malware detection method based on graph convolutional networks (GCNs). The results confirmed that enhanced normalization of the adjacency matrix and moderate network depth can improve model performance compared to the original approach. Among all tested configurations, the single-layer extended GCN achieved the highest accuracy and the best overall quality metrics. However, the results also demonstrate that increasing the number of convolutional layers does not monotonically enhance performance due to well-known effects such as over-smoothing, overfitting, and training instability, which are typical of deep GNN architectures.

Future research should focus on further stabilizing the learning process, optimizing hyperparameters using automated search techniques (e.g., Bayesian optimization), and developing hybrid architectures combining convolutional and attention-based mechanisms (GCN–GAT) and improvements indicated at the end of the Discussion section. Additionally, expanding the dataset with richer contextual information, such as function call parameters, timing, and process metadata, could improve the generalization and robustness of malware detection models. Inspired by the modeling paradigm of Hong et al. in [32], future research could integrate a forecasting module (e.g., LSTM or other temporal models) on top of our graph-based detector, to predict not only present malicious behavior but future propagation risk. This would enable a resilience-oriented architecture for cyber-defense: forecasting → detection → mitigation → recovery.

Supplementary Materials

The following supporting information can be downloaded at https://1drv.ms/f/c/e00eabb71ae12756/EhtRuoaW9ihJsF202o7PNCIBB5EiZdqUpE7XzZsOrJVC-Q?e=o3oQHE (accessed on 29 October 2025): source codes (scripts) in Python of presented method modifications, link to source data, dynamic API call sequence per malware (csv), results, and output files.

Author Contributions

Z.T. conceptualized the idea behind this research, performed the formal analysis and investigation, conducted a discussion of the research results and formulated the conclusions, reviewed malware detection methods using graph machine learning, reviewed the draft, and supervised the project. J.R. adapted the method from [4] to this specific application, modified the software for malware detection, performed the experiments (collected test data and adjusted and converted them to the format required by the project), collected the results, and drafted the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by Grant UGB 4720500000-000023 W500-22-MPK W511000 from the Faculty of Cybernetics at the Military University of Technology in Warsaw (WAT).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the Supplementary Materials. In the file README.txt (link to the directory in the Supplementary Materials) we specify which scripts correspond to each stage of the workflow.

Acknowledgments

The authors acknowledge the Military University of Technology in Warsaw for partially funding the research, IEEEDataPort for Malware Analysis Datasets: API Call Sequences (https://ieee-dataport.org/open-access/malware-analysis-datasets-api-call-sequences (accessed on 15 September 2025)), and GitHub for the original script GCN from [4] (https://github.com/gptcod/behavioral_malware_detection_dgcnn/blob/master/Model-1_Balanced_Dataset.ipynb (accessed on 15 September2025)).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learning Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef]
Liu, Z.; Chen, C.; Yang, X.; Zhou, J.; Li, X.; Song, L. Heterogeneous Graph Neural Networks for Malicious Account Detection. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino, Italy, 22–26 October 2018; ACM: New York, NY, USA, 2018; pp. 2077–2085. [Google Scholar]
Wang, J.; Wen, R.; Wu, C.; Huang, Y.; Xiong, J. FdGars: Fraudster Detection via Graph Convolutional Networks in Online App Review System. In Proceedings of the Companion Proceedings of the 2019 World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; ACM: New York, NY, USA, 2019; pp. 310–316. [Google Scholar]
Schranko De Oliveira, A.; Sassi, R.J. Behavioral Malware Detection Using Deep Graph Convolutional Neural Networks. TechRxiv 2019. [Google Scholar] [CrossRef]
El-Gayar, M.M.; Abouhawwash, M.; Askar, S.S.; Sweidan, S. A Novel Approach for Detecting Deep Fake Videos Using Graph Neural Network. J. Big Data 2024, 11, 22. [Google Scholar] [CrossRef]
Dou, Y.; Liu, Z.; Sun, L.; Deng, Y.; Peng, H.; Yu, P.S. Enhancing Graph Neural Network-Based Fraud Detectors against Camouflaged Fraudsters. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual, 19–23 October 2020; ACM: New York, NY, USA, 2020; pp. 315–324. [Google Scholar]
Liu, Z.; Dou, Y.; Yu, P.S.; Deng, Y.; Peng, H. Alleviating the Inconsistency Problem of Applying Graph Neural Network to Fraud Detection. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; ACM: New York, NY, USA, 2020; pp. 1569–1572. [Google Scholar]
Cao, S.; Sun, X.; Bo, L.; Wei, Y.; Li, B. BGNN4VD: Constructing Bidirectional Graph Neural-Network for Vulnerability Detection. Inf. Softw. Technol. 2021, 136, 106576. [Google Scholar] [CrossRef]
Cheng, X.; Wang, H.; Hua, J.; Xu, G.; Sui, Y. DeepWukong: Statically Detecting Software Vulnerabilities Using Deep Graph Neural Network. ACM Trans. Softw. Eng. Methodol. 2021, 30, 1–33. [Google Scholar] [CrossRef]
Zhou, Y.; Liu, S.; Siow, J.; Du, X.; Liu, Y. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. arXiv 2019, arXiv:1909.03496. [Google Scholar] [CrossRef]
Song, W.; Yin, H.; Liu, C.; Song, D. DeepMem: Learning Graph Neural Network Models for Fast and Robust Memory Forensic Analysis. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, Toronto, ON, Canada, 15–19 October 2018; ACM: New York, NY, USA, 2018; pp. 606–618. [Google Scholar]
Jafari, O.; Maurya, P.; Nagarkar, P.; Islam, K.M.; Crushev, C. A Survey on Locality Sensitive Hashing Algorithms and Their Applications. arXiv 2021, arXiv:2102.08942. [Google Scholar] [CrossRef]
Li, Y.; Gu, C.; Dullien, T.; Vinyals, O.; Kohli, P. Graph Matching Networks for Learning the Similarity of Graph Structured Objects. arXiv 2019, arXiv:1904.12787. [Google Scholar] [CrossRef]
Xu, X.; Liu, C.; Feng, Q.; Yin, H.; Song, L.; Song, D. Neural Network-Based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3 November 2017; ACM: New York, NY, USA, 2017; pp. 363–376. [Google Scholar]
Mary, A.; Edison, A. Deep Fake Detection Using Deep Learning Techniques: A Literature Review. In Proceedings of the 2023 International Conference on Control, Communication and Computing (ICCC), Thiruvananthapuram, India, 19–21 May 2023; pp. 1–6. [Google Scholar]
Rana, M.S.; Nobi, M.N.; Murali, B.; Sung, A.H. Deepfake Detection: A Systematic Literature Review. IEEE Access 2022, 10, 25494–25513. [Google Scholar] [CrossRef]
Hassen, M.; Chan, P. Scalable Function Call Graph-Based Malware Classification. In Proceedings of the Seventh ACM Conference on Data and Application Security and Privacy, Scottsdale, AZ, USA, 22–24 March 2017; pp. 239–248. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. arXiv 2018, arXiv:1710.10903. [Google Scholar]
Hu, X.; Chiueh, T.; Shin, K. Large-Scale Malware Indexing Using Function-Call Graphs. In Proceedings of the 16th ACM Conference on Computer and Communications Security 2009, Chicago, IL, USA, 9–13 November 2009; pp. 611–620. [Google Scholar]
Gao, X.; Xiao, B.; Tao, D.; Li, X. A Survey of Graph Edit Distance. Pattern Anal. Appl. 2010, 13, 113–129. [Google Scholar] [CrossRef]
Xu, M.; Wu, L.; Xi, S.; Xu, J.; Zhang, H.; Ren, Y.; Zheng, N. A Similarity Metric Method of Obfuscated Malware Using Function-Call Graph. J. Comput. Virol. Hacking Tech. 2013, 9, 35–47. [Google Scholar] [CrossRef]
Nikolopoulos, S.; Polenakis, I. A Graph-Based Model for Malware Detection and Classification Using System-Call Groups. J. Comput. Virol. Hacking Tech. 2017, 13, 29–46. [Google Scholar] [CrossRef]
Jiang, H.; Turki, T.; Wang, J. Malware Detection Using Deep Learning and Graph Embedding. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018. [Google Scholar]
Grover, A.; Leskovec, J. Node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 855–864. [Google Scholar]
Busch, J.; Kocheturov, A.; Tresp, V.; Seidl, T. NF-GNN: Network Flow Graph Neural Networks for Malware Detection and Classification. In Proceedings of the 33rd International Conference on Scientific and Statistical Database Management, Tampa, FL, USA, 6–7 July 2021. [Google Scholar]
Chen, S.; Lang, B.; Liu, H.; Chen, Y.; Song, Y. Android Malware Detection Method Based on Graph Attention Networks and Deep Fusion of Multimodal Features. Expert Syst. Appl. 2024, 237, 121617. [Google Scholar] [CrossRef]
Nikolopoulos, S.D.; Polenakis, I. Behavior-Based Detection and Classification of Malicious Software Utilizing Structural Characteristics of Group Sequence Graphs. J. Comput. Virol. Hack. Tech. 2022, 18, 383–406. [Google Scholar] [CrossRef]
Shokouhinejad, H.; Higgins, G.; Razavi-Far, R.; Mohammadian, H.; Ghorbani, A.A. On the Consistency of GNN Explanations for Malware Detection. Inf. Sci. 2025, 721, 122603. [Google Scholar] [CrossRef]
Wu, Y.; Shi, J.; Wang, P.; Zeng, D.; Sun, C. DeepCatra: Learning Flow- and Graph-Based Behaviours for Android Malware Detection. IET Inf. Secur. 2023, 17, 118–130. [Google Scholar] [CrossRef]
Zhang, Z.; Li, Y.; Wang, W.; Song, H.; Dong, H. Malware Detection with Dynamic Evolving Graph Convolutional Networks. Int. J. Intell. Syst. 2022, 37, 7261–7280. [Google Scholar] [CrossRef]
Zhen, Y.; Tian, D.; Fu, X.; Hu, C. A Novel Malware Detection Method Based on Audit Logs and Graph Neural Network. Eng. Appl. Artif. Intell. 2025, 152, 110524. [Google Scholar] [CrossRef]
Hong, S.; Yue, T.; You, Y.; Lv, Z.; Tang, X.; Hu, J.; Yin, H. A Resilience Recovery Method for Complex Traffic Network Security Based on Trend Forecasting. Int. J. Intell. Syst. 2025, 2025, 3715086. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Wu, F.; Zhang, T.; de Souza, A.H., Jr.; Fifty, C.; Yu, T.; Weinberger, K.Q. Simplifying Graph Convolutional Networks. arXiv 2019, arXiv:1902.07153. [Google Scholar] [CrossRef]
Oliveira, A. Malware Analysis Datasets: API Call Sequences. TechRxiv 2019. Available online: https://www.kaggle.com/datasets/ang3loliveira/malware-analysis-datasets-api-call-sequences (accessed on 29 October 2025).
Li, Q.; Han, Z.; Wu, X.-M. Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning. arXiv 2018, arXiv:1801.07606. [Google Scholar] [CrossRef]
Oono, K.; Suzuki, T. Graph Neural Networks Exponentially Lose Expressive Power for Node Classification. arXiv 2021, arXiv:1905.10947. [Google Scholar] [CrossRef]

Figure 1. Construction of a behavior graph based on a sequence of API function calls of the operating system. 0*, …, 3* represent the high-level features grouped by the natural order of the nodes. Source: [4].

Figure 2. Windows API call sequence graph that represents malicious software. Source: Own study.

Figure 3. The dependence of improved GCN’s Accuracy on the number of convolution layers. Dataset: behavioral API-call graphs (n = 44,000 samples). Hardware: GPU NVIDIA GeForce GTX 1060, processor 8 × Intel Core i7-6700HQ CPU @ 2.60 GHz, RAM 16 GB. Software: Python 3.10/PyTorch 2.1. Key hyperparameters: dropout = 0.1, max epochs = 20, batch size = 32, weights matrix size = 31.

Figure 4. The dependence of improved GCN’s Recall on the number of convolution layers. Dataset: behavioral API-call graphs (n = 44,000 samples). Hardware: GPU NVIDIA GeForce GTX 1060, processor 8 × Intel Core i7-6700HQ CPU @ 2.60 GHz, RAM 16 GB. Software: Python 3.10/PyTorch 2.1. Key hyperparameters: dropout = 0.1, max epochs = 20, batch size = 32, weights matrix size = 31.

Figure 5. The dependence of improved GCN’s Precision on the number of convolution layers. Dataset: behavioral API-call graphs (n = 44,000 samples). Hardware: GPU NVIDIA GeForce GTX 1060, processor 8 × Intel Core i7-6700HQ CPU @ 2.60 GHz, RAM 16 GB. Software: Python 3.10/PyTorch 2.1. Key hyperparameters: dropout = 0.1, max epochs = 20, batch size = 32, weights matrix size = 31.

Table 1. Summary of hyperparameter value spaces for each of the implemented models.

Hyperparameter	Improvement 1 (Extended GCN Model)	Improvement 2 (GAT Model)	Improvement 3 (Parallel GCN Model)
W matrix size	[31, 62]	[31, 62]	[31, 62]
dropout rate	[0.1, 0.4, 0.6]	[0.1, 0.4, 0.6]	[0.4, 0.6]
batch size	[32, 64]	[32, 64]	[32, 64]
number of epochs	[20, 30]	[30, 40]	[30, 40]
GCN layers/GAT heads	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]	[1, 2]	[1]

Table 2. Evaluation metric values for Accuracy, Precision and Recall, highlighting the best (in green) and worst models (in red) for each selection metric. 1. Each model configuration was run ten times with different random seeds to evaluate performance stability, confidence intervals for the mean estimated with 95% confidence level.

Type of Model	No. Layers/Attention Heads	Accuracy			Precision			Recall
Type of Model	No. Layers/Attention Heads	Mean	Std	95% Confid. Interval for the Mean	Mean	Std	95% Confid. Interval for the Mean	Mean	Std	95% Confid. Interval for the Mean
GCN [4]	1	0.922	0.005	[0.918; 0.925]	0.935	0.006	[0.930; 0.939]	0.908	0.009	[0.901; 0.914]
Improvement 1. (extended GCN model)	1	0.935	0.005	[0.931; 0.938]	0.927	0.011	[0.919; 0.934]	0.942	0.005	[0.938; 0.946]
	2	0.925	0.003	[0.923; 0.927]	0.914	0.005	[0.910; 0.918]	0.937	0.004	[0.934; 0.940]
	3	0.929	0.003	[0.927; 0.931]	0.915	0.006	[0.910; 0.919]	0.945	0.002	[0.943; 0.946]
	4	0.932	0.006	[0.928; 0.937]	0.934	0.012	[0.926; 0.943]	0.928	0.008	[0.922; 0.933]
	5	0.931	0.004	[0.928; 0.933]	0.943	0.007	[0.939; 0.948]	0.914	0.005	[0.910; 0.917]
	6	0.923	0.004	[0.920; 0.926]	0.926	0.009	[0.920; 0.932]	0.918	0.003	[0.915; 0.920]
	7	0.908	0.006	[0.904; 0.912]	0.901	0.011	[0.893; 0.909]	0.914	0.004	[0.911; 0.917]
	8	0.891	0.014	[0.881; 0.901]	0.867	0.025	[0.849; 0.884]	0.921	0.006	[0.917; 0.925]
	9	0.877	0.018	[0.864; 0.890]	0.848	0.037	[0.822; 0.874]	0.917	0.016	[0.905; 0.928]
	10	0.874	0.014	[0.865; 0.884]	0.859	0.036	[0.833; 0.884]	0.895	0.023	[0.879; 0.911]
Improvement 2.	1	0.878	0.009	[0.872; 0.885]	0.865	0.015	[0.854; 0.876]	0.892	0.012	[0.884; 0.901]
(GAT model)	2	0.780	0.140	[0.679; 0.880]	0.674	0.338	[0.432; 0.916]	0.679	0.344	[0.432; 0.925]
Improvement 3. (parallel GCNs model)	2 × 1	0.913	0.011	[0.905; 0.921]	0.910	0.018	[0.897; 0.923]	0.919	0.016	[0.907; 0.930]

Table 3. Evaluation metric values for F1, ROC AUC, highlighting the best (in green) and worst models (in red) for each selection metric. 1. Each model configuration was run ten times with different random seeds to evaluate performance stability, confidence intervals for the mean estimated with 95% confidence level.

Type of Model	No. Layers/Attention Heads	F1			ROC AUC
Type of Model	No. Layers/Attention Heads	Mean	Std	95% Confid. Interval for the Mean	Mean	Std	95% Confid. Interval for the Mean
GCN [4]	1	0.921	0.005	[0.917; 0.925]	0.971	0.002	[0.970; 0.972]
Improvement 1. (extended GCN model)	1	0.934	0.004	[0.931; 0.937]	0.981	0.002	[0.980; 0.982]
	2	0.925	0.003	[0.923; 0.927]	0.978	0.002	[0.977; 0.979]
	3	0.930	0.002	[0.928; 0.931]	0.978	0.001	[0.977; 0.978]
	4	0.931	0.006	[0.927; 0.935]	0.973	0.002	[0.972; 0.974]
	5	0.928	0.004	[0.925; 0.931]	0.965	0.001	[0.964; 0.966]
	6	0.922	0.004	[0.919; 0.925]	0.962	0.001	[0.961; 0.963]
	7	0.907	0.006	[0.903; 0.911]	0.951	0.002	[0.950; 0.953]
	8	0.893	0.012	[0.884; 0.901]	0.952	0.001	[0.951; 0.953]
	9	0.880	0.015	[0.870; 0.891]	0.948	0.002	[0.947; 0.950]
	10	0.876	0.010	[0.869; 0.882]	0.943	0.002	[0.942; 0.944]
Improvement 2.	1	0.878	0.008	[0.872; 0.884]	0.938	0.008	[0.932; 0.943]
(GAT model)	2	0.676	0.340	[0.432; 0.919]	0.836	0.170	[0.714; 0.957]
Improvement 3. (parallel GCNs model)	2 × 1	0.914	0.011	[0.906; 0.922]	0.954	0.005	[0.951; 0.958]

Table 4. Computational complexity summary for each of the methods described. |E| is the power of the set of edges in the graph. F′ is the number of classes in the classification. For the analyzed binary classification problem, F’ = 2. F is the size of the feature vector.

Modification/Computational Complexity	Estimated Computational Complexity
GCN [4]	$O (\|E\| F^{'})$
Modification 1 (modified GCN model)	$O (\|E\| F^{'})$
Modification 2 (model GAT)	$O (\|V\| F F^{'} + \|E\| F^{'})$ [34]
Modification 3 (parallel GCN model)	$O (\|E\| F^{'})$

Table 5. Execution times of each phase of given model building algorithm in seconds. Hardware: GPU NVIDIA GeForce GTX 1060, processor 8 x Intel Core i7-6700HQ CPU @ 2.60 GHz, RAM 16 GB. Software: Python 3.10/PyTorch 2.1. Key hyperparameters: dropout = 0.1, max epochs = 20, batch size = 32, weights matrix size = 31.

Type of Model	No. Layers/Attention Heads	Initialization		Training		Evaluation
Type of Model	No. Layers/Attention Heads	Mean	Std	Mean	Std	Mean	Std
GCN [4]	1	0.0005	0.0005	17.262	1.209	0.317	0.050
Improvement 1. (extended GCN model)	1	0.0002	0.0003	46.543	5.514	0.846	0.147
	2	0.0000	0.0000	44.526	2.655	0.876	0.199
	3	0.0002	0.0004	54.217	5.478	1.049	0.131
	4	0.0017	0.0047	58.483	0.961	1.123	0.031
	5	0.0017	0.0047	60.333	0.240	1.158	0.041
	6	0.0000	0.0000	61.409	0.720	1.171	0.033
	7	0.0002	0.0004	61.709	3.042	1.181	0.118
	8	0.0002	0.0004	51.027	2.634	1.070	0.142
	9	0.0005	0.0005	52.283	4.048	1.019	0.121
	10	0.0007	0.0005	60.504	10.117	1.252	0.250
Improvement 2.	1	0.0001	0.0003	97.060	1.101	1.344	0.078
(GAT model)	2	0.0004	0.0005	105.626	6.163	1.523	0.093
Improvement 3. (parallel GCNs model)	2 × 1	0.0003	0.0005	62.273	1.235	1.202	0.038

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tarapata, Z.; Romańczuk, J. Some Improvements of Behavioral Malware Detection Method Using Graph Neural Networks. Appl. Sci. 2025, 15, 11686. https://doi.org/10.3390/app152111686

AMA Style

Tarapata Z, Romańczuk J. Some Improvements of Behavioral Malware Detection Method Using Graph Neural Networks. Applied Sciences. 2025; 15(21):11686. https://doi.org/10.3390/app152111686

Chicago/Turabian Style

Tarapata, Zbigniew, and Jan Romańczuk. 2025. "Some Improvements of Behavioral Malware Detection Method Using Graph Neural Networks" Applied Sciences 15, no. 21: 11686. https://doi.org/10.3390/app152111686

APA Style

Tarapata, Z., & Romańczuk, J. (2025). Some Improvements of Behavioral Malware Detection Method Using Graph Neural Networks. Applied Sciences, 15(21), 11686. https://doi.org/10.3390/app152111686

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Some Improvements of Behavioral Malware Detection Method Using Graph Neural Networks

Abstract

1. Introduction

2. Materials and Methods

2.1. Graph Convolutional Networks and Graph Attention Networks: Short Overview

2.2. A Short Review of Malware Detection Methods Using GNNs

2.3. Improvement of Malware Detection Method Using Graph Neural Networks

2.3.1. Goals

2.3.2. Detailed Description of the Method Being Improved

2.3.3. Description of Research Methodology

2.3.4. Assumptions

3. Results

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI