LLM-SGCF: A Robust Malware Detection Framework with Spatially Guided Convolution

Zhao, Lina; Huang, Hua; Li, Ning; Wang, Yunxiao; Li, Ming

doi:10.3390/computers15060329

Open AccessArticle

LLM-SGCF: A Robust Malware Detection Framework with Spatially Guided Convolution

by

Lina Zhao

,

Hua Huang

^*

,

Ning Li

,

Yunxiao Wang

and

Ming Li

Information and Telecommunications Company, State Grid Shandong Electric Power Company, Jinan 250000, China

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(6), 329; https://doi.org/10.3390/computers15060329

Submission received: 15 April 2026 / Revised: 15 May 2026 / Accepted: 17 May 2026 / Published: 22 May 2026

(This article belongs to the Special Issue AI-Powered IoT (AIoT) Systems: Advancements in Security, Sustainability, and Intelligence)

Download

Browse Figures

Versions Notes

Abstract

With the rapid evolution of cyberattack techniques, identifying dynamic behavioral intents from Application Programming Interface call sequences has become a fundamental modality for ensuring reliable malware detection and information security. However, existing detection methods face the dual challenges of semantic sparsity and inadequate spatial dependency modeling when processing these sequences, which fundamentally undermines their stability against complex structural variations and in-the-wild evasive patterns. To address these critical vulnerabilities, we propose LLM-SGCF, a highly effective malware detection framework that jointly models deep behavioral semantics and spatial structures. Specifically, our framework leverages generative Large Language Models, which are subsequently encoded by BERT, to transform sparse API calls into rich and contextualized descriptions. Concurrently, it employs a novel Spatially Guided Convolution (SGC) module to localize critical malicious segments and extract cross-position dependencies in a two-dimensional semantic space. Extensive experiments on the public Aliyun and Catak datasets demonstrate that LLM-SGCF exhibits exceptional resilience to real-world structural complexity and significantly outperforms state-of-the-art baselines, achieving a peak binary-classification accuracy of 95.82%. Further ablation analyses confirm that the synergistic fusion of semantic enhancement driven by Large Language Models and spatial structural modeling dramatically improves the resilience of the framework against complex attack chains, providing a highly reliable paradigm for next-generation malware recognition systems.

Keywords:

large language model; malicious code detection; spatially guided convolution

1. Introduction

Driven by the continuous expansion of cyberspace and increasingly sophisticated attack techniques, the proliferation and complexity of malware have surged, posing severe threats to information security [1,2]. Contemporary malware frequently exhibits structural polymorphism and evasive behaviors, leveraging automated mechanisms to dynamically mutate in both structure and behavior. These characteristics significantly undermine the efficacy of traditional detection paradigms that rely on signature matching or static features [3,4]. Consequently, it is imperative to develop intelligent detection models capable of understanding deep behavioral semantics and exhibiting stable adaptability against complex structural variations.

Application Programming Interface call sequences have emerged as a fundamental data modality for capturing the dynamic behavioral intent of executing programs in malware analysis. Consequently, deep learning paradigms, notably Convolutional Neural Networks [5], Recurrent Neural Networks [6], and Transformers [7], have become the prevailing approaches for end-to-end malware identification. This prominence stems from their ability to autonomously extract high-dimensional feature representations from raw API sequences. By mining latent behavioral patterns within these call chains, these models accurately distinguish between benign and malicious activities in standard scenarios.

Despite significant progress in detection performance, existing API sequence-based methods [8,9,10] remain vulnerable to complex modern malware. Evasive tactics, such as structural manipulation and the use of infrequent API combinations, pose challenges that traditional models struggle to withstand [11,12]. Based on [13,14], we identify three primary limitations of current methodologies. First, existing input representations are typically limited to symbolic tokens, failing to capture the rich behavioral semantics inherent in API calls. This “semantic sparsity” often leads to misclassifications when models encounter structurally mutated or in-the-wild evasive variants. Second, conventional convolutional and sequential models tend to prioritize local pattern extraction while inadequately modeling long-range dependencies and global spatial structures along the call chain. This deficiency fundamentally limits their stability against complex attack chains. Third, although Large Language Models have been introduced for semantic extraction, current approaches lack systematic solutions to jointly model deep semantic features and spatial structural dependencies within an end-to-end framework.

To address these limitations, we propose LLM-SGCF, a highly effective malware detection framework that jointly models semantic completion and spatial structures to resiliently identify complex malicious behaviors. First, to overcome semantic sparsity, LLM-SGCF employs a generative Large Language Model to produce interpretable behavioral descriptions for APIs and utilizes BERT to encode these descriptions into contextual semantic vectors, thereby establishing a rich semantic representation of the call chain. Second, to capture structural dependencies, we design a novel Spatially Guided Convolution (SGC) module that reshapes sequence semantics into a two-dimensional feature map spanning semantic channels and sequence positions. This module leverages geometric attention to localize critical segments and applies depthwise separable convolutions, coupled with a multiplicative interaction mechanism, to strengthen the representation of key behavioral regions. Inherently adaptive and semantically deep, the framework further utilizes multi-scale convolutions to model behavioral subchains of varying lengths, yielding a robust, unified representation that integrates local evidence with global contexts. Extensive evaluations on the Aliyun [15] and Catak [16] datasets demonstrate that LLM-SGCF achieves superior accuracy and resilience to real-world structural complexity compared with mainstream baselines, confirming the synergistic advantages of our semantic enhancement and spatial fusion strategies.

Fundamentally, the critical role of multi-dimensional feature fusion in sequence analysis lies in its ability to construct a holistic and obfuscation-resistant representation of malware behavior. In real-world scenarios, advanced malware frequently employs evasion techniques such as API substitution to alter underlying semantics or dummy API insertion to manipulate structural sequences. Relying solely on a single dimension is inherently vulnerable: purely semantic analysis may fail to recognize the malicious attack chain if the execution order is intentionally disrupted, while purely structural analysis struggles to differentiate between benign and malicious operations that share similar topological shapes. By fusing multi-dimensional features, specifically the deep contextual semantics extracted by the LLM to identify the execution intent and the spatial structural dependencies captured by the SGC module to pinpoint critical action locations, the framework achieves mutually reinforcing verification. This fusion empowers the model to effectively penetrate structural noise and semantic sparsity, ensuring highly robust and accurate anomaly identification even when dealing with complex, evasive malware sequences.

In summary, our main contributions are as follows:

We propose an LLM-based semantic completion strategy for API behaviors. By generating interpretable behavioral descriptions and encoding them into vector representations, this approach translates sparse symbolic API sequences into rich contextual semantic features, thereby substantially enhancing the model’s robustness against code obfuscation and variants at the input level.
We design a novel Spatially Guided Convolution (SGC) module to extract complex structural dependencies. By projecting sequence semantics into a two-dimensional feature space, this module leverages attention guidance to localize critical malicious segments and employs convolutional modeling to capture cross-position dependencies, effectively strengthening the representation of key call intervals.
We comprehensively validate the proposed framework through extensive experiments on multiple benchmark datasets. Empirical results demonstrate that our method achieves peak accuracy rates of 84.88%, 95.82%, and 63.15% on the Aliyun multi-class, Aliyun binary, and Catak multi-class tasks, respectively, substantially outperforming state-of-the-art baselines and exhibiting superior robustness.

The remainder of this paper is structured as follows: Section 2 reviews the related literature on malware detection and highlights the limitations of current methodologies. Section 3 details the proposed LLM-SGCF framework, elaborating on the LLM-driven semantic enhancement and the Spatially Guided Convolution module. Section 4 presents the experimental setup, comparative evaluations, and ablation studies to validate the framework’s effectiveness and robustness. Finally, Section 5 concludes the paper.

2. Related Work

2.1. Traditional Malicious Code Detection Methods

Early malicious code detection technologies mainly relied on feature engineering and manual rules. Common methods include signature-based detection [17], heuristic detection [18], and static feature-based pattern matching [19]. Signature detection achieves rapid identification by comparing samples with known malicious code feature databases (such as byte patterns, hash values, opcode sequences, etc.), but it cannot detect unknown or variant samples [20]. To address the issues of feature diversification and packing obfuscation, researchers have proposed heuristic and behavior analysis methods. Static features such as PE file structure, API call frequency, and system registry modification count are used to build feature vectors, and thresholds or rules are applied to determine malicious behavior [21]. However, these methods have obvious limitations: strong feature dependency and high cost of updating feature libraries; detection accuracy significantly drops when facing code obfuscation, polymorphism, or automatically generated samples; lack of self-learning ability and semantic understanding, making it difficult to capture the contextual relationships between API calls. As a result, traditional detection methods have gradually been replaced by data-driven machine learning and deep learning models.

2.2. Deep Learning-Based Malicious Code Detection Methods

For static feature-based detection methods, some studies transform the binary files of malicious code into images or vectors and use Convolutional Neural Networks to extract local features and perform classification. Executable file byte streams are mapped to gray-scale images, and local byte distribution features are extracted through multiple layers of convolution to achieve malicious sample recognition [22]. However, such methods perform poorly when samples are encrypted, compressed, or packed, and they struggle to express semantic relationships. For dynamic behavior sequence-based detection methods, other research utilizes sandbox technology to collect the running logs or API call sequences of malicious programs and models their temporal dependencies using Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, or Bidirectional LSTM (BiLSTM). API call sequences are treated as time series, and behavior patterns are captured through sequence learning to distinguish benign from malicious behavior [23]. This approach is more robust than static features but still faces limitations in long-sequence modeling and semantic generalization. For detection methods based on graph structures and hybrid models, researchers have attempted to convert function call relationships, Control Flow Graphs (CFGs), or Data Flow Graphs (DFGs) into graph representations, use Graph Neural Networks (GNNs) for structural feature learning, and also combine CNNs with Transformer to achieve the fusion of local and global features [24,25,26,27]. These methods improve the model’s structural representation ability, but the use of semantic layer information is still limited. Overall, deep learning methods have greatly improved detection performance, but three main issues remain: the model’s insufficient semantic understanding of API calls, making it difficult to handle obfuscation at the behavioral level; the feature extraction process lacks spatial context modeling, which affects cross-sample generalization ability; traditional deep models have weak interpretability, making it difficult to support decision analysis by security experts. Recently, researchers also found that malicious code can also evade detection by manipulating their code structure [28], which is also a new challenging issue in this field.

2.3. Semantic Enhancement Detection Methods Based on Large Language Models

The rise of LLM has injected new vitality into malicious code detection research [29,30]. LLMs, through self-supervised pretraining on large-scale corpora, possess strong natural language understanding and code semantic modeling capabilities, establishing a unified representation space between “code” and “language”. Researchers have begun exploring the application of LLMs in various security fields, including vulnerability detection [31,32], code completion [33], security log analysis [34,35], and malicious sample interpretation [36]. In the field of malicious code detection, they have attempted to input code snippets, API calls, or operation sequences into models such as BERT, CodeBERT, and GPT to extract semantic features for classification or similarity computation [37,38]. For example, CodeBERT can capture the semantic consistency of function calls, and the GPT model can generate functional descriptions of APIs, which helps improve the model’s ability to understand unknown samples. However, existing research primarily focuses on semantic extraction or feature transfer and rarely considers the joint modeling of semantic features and spatial features. To address these shortcomings, this paper designs a Spatially Guided Convolution (SGC) module based on LLM semantic representation. Through a geometric attention mechanism and multi-scale convolution fusion, this method achieves the joint learning of semantic features and spatial structural features, thereby enhancing the model’s ability to represent complex behavior sequences and improving classification performance.

3. Method

This paper proposes a malware detection framework that jointly models semantic completion and spatial structures. As illustrated in Figure 1, the data processing flow of the proposed LLM-SGCF framework consists of three interconnected modules operating on API call sequences. First, in the Representation Generation module, a Large Language Model (LLM) generates high-quality behavioral explanations for raw API calls, which are subsequently encoded by BERT to construct contextual semantic representations and form a sequence-level semantic tensor S. Next, the Spatially Guided Convolution (SGC) module reshapes this tensor and utilizes multi-branch bottleneck convolutions alongside a geometric attention mechanism to extract and fuse key structured behavioral features. Finally, the Multi-Scale Convolution and Classification module employs parallel multi-scale convolutions to capture and fuse behavioral patterns of varying lengths, thereby feeding into a fully connected layer to enable highly accurate malware recognition and classification.

3.1. API Call Sequence Preprocessing

During dynamic execution, malware generates extensive API call sequences that typically exhibit significant noise, frequent adjacent repetitions, and structural instability. Directly utilizing these raw sequences for modeling fundamentally hinders the extraction of robust feature representations. To enhance the quality and learnability of the data, we apply a systematic preprocessing pipeline to the API sequences. Formally, let a raw API call sequence be denoted as:

S = [s_{1}, s_{2}, \dots, s_{n}]

(1)

To mitigate redundancies arising from cyclic behaviors or repetitive API invocations, we eliminate adjacent duplicates by applying a contiguous deduplication strategy:

\tilde{S} = RemoveAdjacentRepeat (S)

(2)

This procedure effectively preserves critical behavioral transitions and key operations while mitigating the noise introduced by redundant API invocations during model training. Following deduplication, we employ a tokenizer to construct a unified vocabulary for the API sequences, mapping each API token to a unique integer identifier:

X = Tokenizer (\tilde{S})

(3)

Subsequently, to ensure dimensional consistency, all tokenized sequences are padded or truncated to a fixed length of

L = 100

:

\hat{X} = Pad (X, L)

(4)

This normalization procedure ensures dimensional uniformity across all input sequences, establishing a robust foundation for the subsequent LLM-driven semantic enhancement and deep spatial feature extraction.

3.2. Semantic Enhancement Driven by Large Language Models

Traditional API-based malware detection methods suffer from a critical “semantic sparsity” problem: raw API strings inherently fail to articulate their actual functions, potential risks, and behavioral significance within an attack chain. This semantic gap fundamentally restricts a model’s capacity to comprehend complex malicious patterns. To bridge this gap, we introduce a semantic enhancement pipeline that sequentially transforms the raw API into a natural language explanation and subsequently into a contextual semantic vector to deeply enrich the behavioral representations of APIs.

Employing a structured prompting paradigm, we combine a predefined template with the target API name to query the Large Language Model (LLM). This process automatically generates a comprehensive natural language explanation of the API’s behavior (approximately 300 words). Formally, let

a p i

denote the input API and let the generative LLM be modeled by the function

g (\cdot)

. The resulting explanatory text,

T_{a p i}

, is formulated as:

T_{a p i} = g (a p i)

(5)

The generated explanations typically encompass four key dimensions: the core functionality of the API, including file operations, memory management, and network communication; its standard usage patterns within the operating system; high-risk execution scenarios indicative of potential malicious activities; and specific behavioral steps or malicious intents embedded within an attack chain. Consequently, sparse symbolic API calls, which were previously restricted to mere function names, are translated into natural language descriptions enriched with behavioral semantics. This semantic augmentation enables the model to comprehend malware logic at a higher cognitive level.

Specifically, enabling the model to comprehend malware logic at a “higher cognitive level” implies a fundamental transition from rigid, symbolic pattern matching to abstract, intent-based reasoning. Traditional models treat APIs as isolated, opaque tokens, merely memorizing the statistical co-occurrence of symbols, such as observing one specific API followed by another. In contrast, by leveraging the contextual dimensions generated by the LLM, including core functionalities and high-risk scenarios, our framework explicitly encodes the underlying execution intent. For instance, instead of merely seeing VirtualAllocEx and WriteProcessMemory as arbitrary integers, the semantically augmented model recognizes them as consecutive steps forming a “process injection” attack chain. This intent awareness allows the framework to group superficially different but semantically equivalent API combinations, thereby grasping the true malicious logic and remaining robust against advanced structural obfuscations.

To mitigate the inherent risk of LLM hallucinations and ensure the factual accuracy of the generated descriptions, we implement a rigorous quality control protocol during the generation phase. First, the LLM’s generation temperature is set to an extremely low value of

T = 0.1

to prioritize deterministic, factual outputs and suppress creative extrapolation. Second, our structured prompting paradigm explicitly constrains the model to derive functional explanations strictly from official technical documentation, including Microsoft MSDN, and prohibits any speculative behavioral descriptions. Finally, to empirically validate the generation quality, we conduct a manual expert review by randomly sampling 10% of the unique generated API descriptions. These samples are cross-referenced against official documentation. The manual inspection confirms that the LLM faithfully articulates the API functionalities without critical hallucinations, thereby providing a highly reliable and factual semantic foundation for the subsequent BERT encoding.

To convert these natural language descriptions into trainable, high-dimensional vector representations, we employ the BERT model for text encoding. Following tokenization, padding, and embedding processes, BERT yields contextualized semantic vectors for each API description:

E_{api} = BERT (T_{api}) \in R^{768}

(6)

Through this process, a distinct semantic representation is established for each unique API. Given a padded API call sequence

\hat{X} = [x_{1}, x_{2}, \dots, x_{L}]

, its corresponding sequence-level semantic tensor is formulated as

S = [E_{x_{1}}, E_{x_{2}}, \dots, E_{x_{L}}] \in R^{L \times 768}

. Ultimately, this semantic enhancement layer preserves the inherent sequential structure of the API calls while projecting them into a dense semantic space. This provides contextualized behavioral representations that serve as a robust foundation for the subsequent Spatially Guided Convolution and behavior pattern extraction.

Regarding the extraction of execution intent, our framework employs a structured prompting paradigm to query the generative Large Language Model. The model generates a comprehensive natural language explanation for each isolated API. This explanation specifically details the core functionality, standard usage patterns, and high-risk execution scenarios indicative of malicious activities. By extracting these contextual dimensions, the framework explicitly decodes the underlying execution intent rather than treating the API as an opaque symbolic token. Subsequently, a BERT encoder transforms these natural language descriptions into dense contextual semantic vectors.

3.3. Spatially Guided Convolution Module

Algorithm 1 details the Spatially Guided Convolution (SGC) module, a core architectural innovation of our framework designed to efficiently extract cross-position dependencies, local behavioral patterns, and malicious feature saliency from API sequences. This module comprises five primary components: input space reconstruction, multi-branch bottleneck convolutions, spatially guided attention, feature fusion, and residual enhancement.

During the initial input space reconstruction, the sequence-level semantic matrix

S \in R^{L \times 768}

is reshaped into a spatial tensor

X^{'} \in R^{8 \times 768 \times L}

, thereby leveraging the inherent positional relationships within a two-dimensional semantic space. Specifically, these “positional relationships” refer to the temporal proximity and execution order of adjacent API calls within an attack chain. In a standard 1D sequence, elements are processed strictly linearly. However, by projecting them into a 2D semantic space, a convolutional receptive field, such as a

3 \times 3

kernel, can encompass a localized window of consecutive APIs simultaneously. This mechanism empowers the model to compute cross-correlations between adjacent execution steps, including instances where a memory allocation API is immediately followed by a process injection API, along with their corresponding semantic dimensions. Consequently, local temporal behavioral patterns are mathematically transformed into distinctive spatial–semantic clusters, enabling the network to recognize complex malicious structural combinations that are difficult to isolate in a purely linear space.

Algorithm 1: Spatially Guided Convolution (SGC) forward propagation

Require:: Sequence-level semantic matrix $S \in R^{L \times 768}$
Ensure:: Enhanced spatial–semantic feature representation Y
1:: Reshape S into a 4D tensor $X^{'} \in R^{8 \times 768 \times L}$
// 1. Extract Local & Deep Features via Bottleneck Branches
2:: Compute main branch initial feature: $H_{1} \leftarrow f_{block 1} (X^{'})$
3:: Compute main branch deep feature: $H_{2} \leftarrow f_{block 2} (H_{1})$
4:: Compute auxiliary branch feature: $H_{3} \leftarrow f_{block 3} (X^{'})$
// 2. Spatially Guided Attention
5:: Generate spatial attention map: $A \leftarrow {Conv}_{3 \times 3} (ReLU ({Conv}_{3 \times 3} (H_{1})))$
// 3. Semantic-Space Feature Fusion
6:: Fuse branch features with attention: $F \leftarrow H_{1} ⊙ H_{3} ⊙ A$
7:: Refine fused representation: $H_{4} \leftarrow f_{block 4} (F)$
// 4. Residual Enhancement
8:: Apply residual connection: $Y \leftarrow H_{4} + X^{'}$
9:: return Y

This reshaping operation conceptualizes the API call sequence as a “single-width semantic feature map,” where the 768-dimensional semantic channels represent the depth and the execution order dictates the width. This structural formulation enables convolutional operators to extract local spatial features along the sequence dimension, thereby leveraging the inherent positional relationships within a two-dimensional semantic space. To capture local features across diverse scales and semantic levels, the SGC module incorporates four parallel bottleneck branches. Each branch comprises a sequential pipeline: a

1 \times 1

convolution for dimensionality reduction, a depthwise separable convolution (using either a

3 \times 3

or

1 \times 1

kernel), a

1 \times 1

convolution for dimensionality expansion, normalization, and a ReLU activation function, culminating in the following structure:

H_{1} = f_{block 1} (X^{'}), H_{2} = f_{block 2} (H_{1}), H_{3} = f_{block 3} (X^{'})

(7)

Specifically, the functional roles of these individual blocks are defined as follows: Block 1 extracts the initial local semantic structures of the API sequence. Block 2 enhances local continuity and deepens the semantic representation. Block 3 acts as an auxiliary pathway to compress the enhanced semantic features. Block 4 performs final feature fusion and refinement. This bottleneck architecture effectively reduces the parameter count while substantially boosting the model’s representational capacity within the semantic space. Furthermore, the SGC module incorporates spatially guided attention to assign higher weights to critical positions within the API sequence. The spatial attention map is generated via two consecutive

3 \times 3

convolutions:

A = {Conv}_{3 \times 3} (ReLU ({Conv}_{3 \times 3} (H_{1})))

(8)

In this formulation,

H_{1}

denotes the output representation from Block 1, and the resulting attention map A captures feature saliency across the following dimensions.

Semantic Dimension (

d = 768

): It identifies key semantic tokens within the explanatory text that correlate with malicious behavior. Sequence Dimension (

L = 100

): It evaluates the positional saliency of critical APIs within the malicious execution chain.

To deeply strengthen the coupling between API semantic structures and their spatial significance, our framework employs a dynamic fusion mechanism driven by the spatial attention map. Intuitively, the semantic representation answers the “what” by identifying the underlying execution intent of an API, such as memory injection, while the spatial significance answers the “where” by pinpointing the critical locations of these execution steps within the entire sequence. Rather than treating them independently, we fuse the main branch features

H_{1}

, auxiliary branch features

H_{3}

, and attention map A via element-wise multiplication:

F = H_{1} ⊙ H_{3} ⊙ A

(9)

This fusion strategy offers a clear and intuitive interpretation: through this tight coupling, the network forces the semantic and spatial dimensions to interact directly. If an API carries high-risk semantics across multiple pathways to represent the “what” and simultaneously appears in a structurally critical position highlighted by the attention map to represent the “where”, its activation features are significantly amplified. Conversely, if an API shows only isolated responses or represents benign background noise, it is aggressively suppressed due to the lack of attentional reinforcement. Consequently, the model achieves a deeply fused representation where behavioral meaning and structural importance mutually reinforce each other, making the detection highly robust against advanced obfuscations. The fused feature F is subsequently refined through Block 4 to obtain:

H_{4} = f_{block 4} (F)

(10)

Subsequently, SGC uses a residual connection to add it to the input tensor

X^{'}

:

Y = H_{4} + X^{'}

(11)

While the SGC module incorporates foundational operations akin to standard inverted bottlenecks such as MobileNet and attention mechanisms including SENet and CBAM, its design rationale and spatiotemporal mechanics are fundamentally distinct and specifically tailored for malware API sequence analysis. First, regarding the input domain, traditional blocks operate on isomorphic visual spaces represented by

H \times W

, whereas SGC operates on a highly asymmetric semantic–temporal space. In SGC, the depthwise convolutions do not extract visual edges; rather, they capture local temporal behavioral patterns within specific LLM-generated semantic pathways. Second, unlike SENet or ECA, which heavily rely on Global Average Pooling to compute channel-wise scaling factors, our spatially guided attention intentionally discards this global pooling mechanism. Instead, it utilizes consecutive

3 \times 3

convolutions to maintain sequence resolution, thereby computing a precise spatiotemporal saliency map that localizes “where” the high-risk actions occur along the attack chain without compressing the temporal dimension. Finally, rather than the standard linear residual additions or simple channel scaling seen in MobileNet and CBAM, SGC introduces a unique three-way multiplicative interaction defined as

F = H_{1} ⊙ H_{3} ⊙ A

. This specific fusion strategy acts as a strict structural filter, enforcing a hard coupling between primary semantics, compressed auxiliary semantics, and spatial saliency, thereby aggressively suppressing structural obfuscation noise such as dummy APIs that traditional attention mechanisms might otherwise preserve.

Furthermore, regarding the capture of spatial structural dependencies, the Spatially Guided Convolution module reshapes the sequence-level semantic matrix into a two-dimensional spatial tensor. In this reconstructed semantic space, the temporal execution order of adjacent API calls transforms directly into spatial proximity. Consequently, a convolutional receptive field simultaneously encompasses a localized window of consecutive APIs. This mechanism empowers the network to compute cross-correlations between adjacent execution steps along their semantic dimensions, thereby mathematically transforming local temporal behavioral patterns into distinct spatial semantic clusters.

3.4. Multi-Scale Convolution Fusion and Classification

After the Spatially Guided Convolution module extracts the fused feature Y, this paper further introduces a multi-scale convolution structure to capture behavioral dependency patterns of different lengths in the API call sequence. Specifically, the output of SGC is fed into three convolutional branches with different receptive fields: the 1 × 3 convolutional branch focuses on modeling short-range call chain features; the 1 × 4 convolutional branch extracts medium-range behavior patterns and local function combinations; the 1 × 5 convolutional branch is used to identify behavioral associations in longer attack sequences.

Each convolutional branch sequentially passes through Batch Normalization, ReLU activation function, and max pooling operations to obtain the corresponding feature vector for that scale. Then, the three sets of features are concatenated along the channel dimension to form the fused global feature representation:

F_{g l o b a l} = [F_{3} ‖ F_{4} ‖ F_{5}]

(12)

Finally, the global feature vector

F_{g l o b a l}

is input into a fully connected layer for classification, and the final class prediction is generated using the Softmax function:

\hat{y} = Softmax (W F_{g l o b a l} + b)

(13)

This multi-scale convolution fusion strategy can simultaneously capture behavior structural features at multiple granularities, thereby enhancing the robustness and accuracy of malware classification.

4. Experiments

To thoroughly validate the robustness and effectiveness of the proposed LLM-SGCF framework, which is fundamentally driven by Large Language Model Semantic Enhancement and Spatially Guided Convolution Fusion, we design a comprehensive evaluation pipeline. This section systematically presents the experimental design and findings across eight key dimensions: dataset description, baselines, evaluation metrics, parameter settings, quantitative results, classification performance visualization, parameter sensitivity analysis, and ablation studies. Extensive experiments are conducted on two publicly available datasets, Aliyun [15] and Catak [16], encompassing both multi-class and binary-classification scenarios to rigorously test the framework’s resilience against complex malware mutations.

4.1. Dataset Description

Evaluating the robustness of a malware detection model requires diverse and highly obfuscated behavioral data. Therefore, we select the Aliyun and Catak dynamic behavior datasets, both of which provide rich API call sequences. These datasets pose significant challenges, including imbalanced family distributions and complex malicious structural patterns, making them ideal testbeds for our method. Figure 2 illustrates the distribution of various malware categories alongside benign samples in the Aliyun dataset, which is utilized for both multi-class and binary-classification tasks. Simultaneously, Figure 3 details the category distribution in the Catak dataset, which captures distinct execution sequences of malicious behaviors for multi-class evaluation. To ensure reliable and unbiased assessment, both datasets are rigorously partitioned into training, validation, and test sets using an 8:1:1 ratio, preserving consistent class distributions across all subsets. Specifically, we employ strict stratified sampling during this partitioning process. This guarantees that the prior probability distributions of all classes, particularly extreme minority categories like “Mining”, are identically preserved across the training, validation, and test sets, thereby mitigating the risk of distribution shift and instability in small-sample evaluations.

Regarding the utilization of the Large Language Model, it is imperative to address potential concerns regarding data leakage, specifically whether its pretraining data intersect with the Aliyun or Catak datasets. In the LLM-SGCF framework, the LLM is strictly isolated from the dataset samples. The Aliyun and Catak datasets consist of dynamic API call sequences and their corresponding malware class labels. However, the LLM does not process these sequences, nor does it perform any end-to-end classification. Instead, the LLM is utilized exclusively as an offline dictionary generator that takes a single, isolated API name, such as VirtualAllocEx, as input to generate a generic functional description. The LLM remains completely blind to the sequence context, structural dependencies, and ground-truth labels of the dataset instances. The actual sequence modeling and classification are entirely driven by the proposed SGC module, which is trained from scratch. Therefore, the prior knowledge of public API documentation embedded within the LLM provides a fair semantic foundation without creating any unfair algorithmic advantage or data contamination risk in the comparative evaluation.

4.2. Baselines

To benchmark the superiority and robustness of LLM-SGCF, we compare it against a broad spectrum of state-of-the-art deep learning models, spanning four architectural paradigms: RNN, CNN, hybrid, and Transformer models. The RNN baseline group includes BiLSTM [39], BiGRU [40], CatakNet [16], and ZhangNet [41]. The CNN group is represented by TextCNN [42]. Hybrid architectures that combine local and sequential modeling include Kolosnjaji [43], LiNet [44], and Mal-ASSF [45]. Finally, for a comparison against modern attention-driven models, the Transformer series includes Transformer [46], Nebula [47], and MalBERT [48]. All baseline models are trained independently under identical data partitioning and hardware conditions, with hyperparameters carefully tuned to extract their peak performance, ensuring a fair comparative environment.

4.3. Evaluation Metrics

To comprehensively quantify the detection performance and the robustness of the models—particularly their ability to minimize false positives while resisting evasion tactics—we employ accuracy, precision, and recall as our primary evaluation metrics. These metrics are evaluated across three distinct experimental settings: the Aliyun multi-class task, the Aliyun binary-class task, and the Catak multi-class task. The specific formulas are defined as follows:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(14)

Precision = \frac{T P}{T P + F P}

(15)

Recall = \frac{T P}{T P + F N}

(16)

Here,

T P

,

T N

,

F P

, and

F N

denote true positives, true negatives, false positives, and false negatives, respectively. Together, these metrics provide a holistic view of the model’s discriminative power and its resilience when identifying diverse malicious behaviors.

4.4. Parameter Setting

To guarantee experimental reproducibility and fairness, consistent training configurations are applied across all models. All experiments are implemented using PyTorch 2.1.0 with CUDA 12.1 and Python 3.10. In Table 1, optimization is performed using the Adam optimizer with a training batch size of 8 over 100 training epochs. The learning rate is initialized at 0.001 and dynamically regulated using a StepLR scheduler (step = 20,

γ = 0.9

) to ensure stable convergence. Additionally, a weight decay of

6 \times 10^{- 5}

and a dropout rate of 0.3 are implemented to mitigate overfitting. All computational processes are executed in a CUDA 12.1-enabled GPU environment, with uniform random seeds (42) strictly enforced across all trials to eliminate variance induced by initialization.

To ensure the full reproducibility of the proposed framework, we further detail the specific micro-architectural parameters of the Spatially Guided Convolution module. Specifically, regarding the bottleneck branches, the intermediate number of filters is set to 96 for both the main branches,

H_{1}

and

H_{2}

, as well as auxiliary branch

H_{3}

, significantly reducing the parameter overhead from the original 768 input channels. For the depthwise separable convolutions within these branches, the depth multiplier is fixed at 1, ensuring a strict one-to-one mapping per channel. The spatially guided attention mechanism is constructed using a two-layer convolutional network comprising a sequence of a

3 \times 3

convolution, a ReLU activation, and another

3 \times 3

convolution. Finally, to ensure stable gradient propagation during the initial training phase, all convolutional and linear layer weights within the framework are initialized using the default Kaiming Uniform method.

During the training phase, the model employs Cross-Entropy Loss as the objective function to evaluate the discrepancy between the predicted probability distribution and the actual ground-truth labels. The Cross-Entropy Loss for a single sample i is mathematically defined as follows:

L_{i} = - log (\frac{exp (s_{y_{i}})}{\sum_{j = 1}^{C} exp (s_{j})})

(17)

where C denotes the total number of classes,

s_{y_{i}}

represents the raw logit score for the ground-truth class, and

s_{j}

indicates the score for class j. Mechanistically, the term inside the logarithm represents the Softmax probability of the true class. Because of the properties of the negative logarithmic function, the Cross-Entropy loss heavily penalizes incorrect predictions made with high confidence. By minimizing this loss value through backpropagation, the framework continuously adjusts its network weights to narrow the distributional gap between predictions and actual labels, which ultimately drives the model to converge towards an optimal classification boundary.

4.5. Quantitative Results

The comprehensive experimental results, summarized in Table 2, clearly highlight the performance disparities among different architectural paradigms. Traditional RNN variants (BiLSTM, BiGRU, and CatakNet) demonstrate acceptable baseline performance on the Aliyun dataset but struggle with the semantic complexity of the Catak multi-class task, indicating a vulnerability to long-range dependency issues. CNN and hybrid models (e.g., TextCNN and Mal-ASSF) show marginal improvements due to local feature extraction but still lack deep semantic understanding. While Transformer-based models (such as Nebula and MalBERT) exhibit strong sequence representation capabilities, they are ultimately surpassed by the proposed framework.

Crucially, the LLM-SGCF model achieves state-of-the-art results across all benchmarks, attaining impressive accuracy rates of 84.88% on the Aliyun multi-class task, 95.82% on the Aliyun binary-class task, and 63.15% on the Catak multi-class task. To rigorously verify that the performance improvement over the strongest baseline, such as TextCNN at 94.53% on the binary task, is not merely a byproduct of random variance, we apply McNemar’s test on the test set predictions. The statistical test yields a p-value of

p < 0.01

, mathematically confirming that the accuracy gain achieved by our framework is statistically significant at the 99% confidence level. This significant performance leap provides strong empirical evidence that our core strategy of fusing LLM-driven deep semantic completion with the cross-position structural modeling of Spatially Guided Convolutions fundamentally enhances the framework’s robustness against complex, structurally mutated malware attacks.

4.6. Model Convergence Stability Analysis

To empirically demonstrate the stability of the proposed LLM-SGCF framework and confirm the absence of overfitting during the training phase of 100 epochs, we plot the learning curves for the training and validation losses. As illustrated in Figure 4, both the training and validation losses decrease rapidly in the initial epochs and subsequently stabilize. Crucially, the validation loss curve closely tracks the training loss without exhibiting any divergence or upward rebound trend in the later epochs. This robust convergence behavior provides concrete evidence that the synergistic effect of our integrated regularization techniques, specifically the dropout strategy with a rate of 0.3 and a weight decay of

6 \times 10^{- 5}

, effectively mitigates overfitting risks, ensuring optimal generalization capabilities without the strict necessity of an explicit early stopping mechanism.

4.7. Classification Performance Visualization

To further investigate the fine-grained classification performance and class-specific discriminative power of the LLM-SGCF framework, we visualize the confusion matrix for the Aliyun multi-class task in Figure 5.

The matrix reveals that our model achieves exceptional diagonal concentration, particularly for predominant behavioral modalities. For instance, in the Infectious Virus and DDoS Trojan categories, respectively denoted Class 1 and Class 6, the framework successfully identifies 486 and 397 samples with minimal inter-class leakage. Even in categories with smaller sample sizes, such as the Mining and Benign categories representing Class 2 and Class 3, the LLM-SGCF framework maintains robust recognition counts of 47 and 105.

A critical observation is the resilience of the model when distinguishing between categories with subtle behavioral overlaps. For example, while traditional models often confuse the Worms category, denoted Class 8, with other malicious types due to structural similarities, our framework correctly classifies 82 samples, significantly reducing the false-positive rate.

To provide deeper insights into the limitations of the model and the inherent challenges of the dataset, we further conduct a detailed error analysis based on the off-diagonal elements of the confusion matrix. The most notable misclassifications occur between the Benign and Trojan categories, denoted Class 3 and Class 5 respectively. An analysis of the underlying API sequences reveals that this confusion primarily stems from behavioral overlap and benign obfuscation. First, legitimate software updaters and background synchronization tools frequently execute API sequences involving network communication, file dropping, and child process creation, including sequential transitions from InternetOpenUrl to WriteFile and finally to CreateProcess. Trojans exhibit nearly identical structural topologies when establishing command and control communications or downloading secondary payloads. Because the LLM-SGCF framework is highly sensitive to intent and spatial structure, benign but aggressive intents occasionally trigger the spatial attention map designed to detect malicious downloading chains, leading to false positives.

Second, a portion of benign samples employs commercial packers to protect intellectual property. These applications generate anti-analysis API patterns during execution, such as the invocation of IsDebuggerPresent and extensive calls to VirtualAllocEx for unpacking, which are fundamentally indistinguishable from malware evasion tactics at the API level. Consequently, the model correctly identifies the obfuscation semantics but inevitably misclassifies the benign sample as malicious due to the lack of transparent payload APIs.

Overall, these results indicate that the integration of LLM-driven semantic enhancement effectively resolves the semantic sparsity problem by providing deep contextual descriptions, while the SGC module successfully localizes critical malicious segments within the API call sequences. This synergistic structural and semantic fusion allows the framework to maintain high robustness and fine-grained accuracy even when processing highly obfuscated or structurally mutated malware variants.

4.8. Sensitivity Analysis

To thoroughly evaluate the robustness, stability, and structural adaptability of the proposed LLM-SGCF framework under varying conditions, we conduct comprehensive sensitivity analyses focusing on two critical aspects: the input API sequence length and the optimization hyperparameters of the learning rate scheduler.

4.8.1. Sensitivity Analysis of Sequence Length

To determine the optimal API sequence length, denoted by L, and investigate the impact of truncation and padding on long-sequence malicious behaviors, we conduct a sensitivity analysis comparing sequence lengths of

L \in {50, 100, 150}

. As shown in Table 3, when the sequence is truncated too severely to a length of 50, the framework misses key downstream attack chain information, resulting in a noticeable drop in accuracy. Conversely, extending the sequence length to 150 yields only marginal improvements in performance while exponentially increasing GPU memory overhead. Furthermore, processing extended sequences introduces excessive padding noise for naturally shorter benign samples. This empirical phenomenon perfectly aligns with the behavioral characteristics of malware, which typically executes its core initialization, unpacking, and anti-evasion routines early in its lifecycle to establish persistence. Therefore, a length of 100 is empirically validated as the optimal threshold, offering the best trade-off between preserving essential malicious semantics and maintaining computational efficiency.

4.8.2. Parameter Sensitivity Analysis

Furthermore, to explicitly evaluate the optimization stability of the framework, we analyze two critical hyperparameters of the learning rate scheduler, specifically Step Size and Gamma, denoted by

γ

. These parameters govern the pace of the StepLR strategy, which is essential to balancing the exploration and exploitation phases during the training process.

Figure 6 illustrates the 2D marginal sensitivity of the model for accuracy, precision, and recall. When the Step Size is fixed at 20, a steady improvement is observed across all performance metrics as Gamma increases. Specifically, as

γ

approaches 0.9, the model converges to a robust near-optimal level. This suggests that higher retention of the previous learning rate, indicated by a higher

γ

, helps the model navigate the complex loss landscape formed by the contextualized semantic features and spatial dependencies. Conversely, when Gamma is fixed at 0.9 and the Step Size varies from 10 to 30, performance exhibits a distinct inverted-U-shaped trend. The metrics peak sharply at a Step Size of 20. This indicates that an excessively short decay interval, such as a Step Size of 10, causes the learning rate to diminish prematurely, preventing the model from adequately learning subtle malicious patterns. Meanwhile, an overly prolonged interval, such as a Step Size of 30, leads to insufficient weight refinement in the later stages of training, and both of these extremes impede the model’s ability to optimally capture the complex semantic spatial features extracted by the Spatially Guided Convolution module.

Furthermore, Figure 7 visualizes the interaction between these parameters via a 3D joint sensitivity analysis. The response surfaces for accuracy, precision, and recall all manifest a distinct and stable localized plateau. The highest detection efficacy is heavily concentrated around the coordinate where the Step Size is near 20 and Gamma is close to 0.9. Deviations towards the extremes of the hyperparameter space, such as a Step Size of 10 and a Gamma of 0.2, result in a steep performance decline. This decline is likely due to the stagnation of the gradient descent process under aggressive decay settings, which fails to overcome local minima in the behavioral semantic space.

To clarify how the multi-dimensional topological analysis proves the optimization stability of the framework, we examine the geometric morphology of the 3D response surfaces to address both the distinctly robust central hyperparameter plateau and the cliff-shaped boundary. First, regarding the central plateau, when the Step Size is near 20 and

γ

is close to 0.9, the performance metrics form a broad and flat elevated region. In this central zone, the model maintains consistently high accuracy with minimal variance. This mathematical flatness demonstrates that the framework is highly resilient to minor perturbations in optimization settings, thereby proving its inherent robustness. Second, regarding the cliff boundary topology, extreme parameter configurations, including an excessively low

γ

of 0.1 or a boundary Step Size, are geometrically located at the margins of the grid. It is visually and mathematically evident that these boundary regions exhibit steep gradient descents, effectively forming performance cliffs. For instance, transitioning from the optimal central plateau towards the extreme edges results in sharp and non-linear degradation across all evaluation metrics. This steep geometric drop empirically proves that the model requires a balanced optimization rhythm to navigate the complex semantic space, demonstrating that the high performance is a result of precise convergence within the optimal plateau rather than random chance. This characteristic ensures that the proposed framework maintains reliable and stable performance in real-world, unpredictable malware detection scenarios where dynamic structural mutations are frequent.

4.9. Ablation Experiment

To unearth the specific mechanisms contributing to the framework’s robustness, we conducted rigorous ablation experiments focusing on two core components: the Spatially Guided Convolution (SGC) module and the multi-scale convolution kernel combinations.

4.9.1. Effectiveness of the Spatially Guided Convolution Module

We first isolate the SGC module to evaluate its direct contribution to the model’s discriminative power. As illustrated in Figure 8 and Figure 9, the integration of the SGC module yields consistent and substantial performance improvements across different datasets. In the Aliyun binary-classification task, the inclusion of SGC elevated precision, recall, and accuracy from 94.78%, 95.91%, and 95.61% to 95.02%, 96.13%, and 95.82%, respectively. The pronounced enhancement in recall is particularly significant, as it highlights the SGC module’s proficiency in robustly identifying evasive malicious API sequences that standard models might overlook.

This optimization effect naturally extends to the more complex multi-class tasks, where the overall accuracy on the Aliyun and Catak datasets surged to 84.88% and 63.15%, respectively. The empirical data confirm that the SGC module functions as a critical structural anchor within the architecture. By effectively projecting semantic features into a spatial domain, it strengthens the model’s ability to decode complex behavioral dependencies and accurately capture features from minority classes.

4.9.2. Impact of Multi-Scale Convolution Kernel Combinations

Furthermore, to investigate the multi-scale robustness of our spatial modeling, we experimented with varying convolution kernel size combinations within the fusion layer. The results, evaluated on the Aliyun dataset and summarized in Table 4, reveal how receptive field sizes influence detection performance.

The combination of kernel sizes (3, 4, 5) emerged as the optimal configuration, achieving the highest overall metrics, with a peak accuracy of 95.82% for binary classification and 84.88% for multi-class classification. Notably, when transitioning to excessively large kernel combinations, such as (5, 7, 9), the binary accuracy experienced a slight degradation to 95.04%. This decline indicates that overly broad receptive fields tend to smooth out or dilute highly localized, subtle malicious signals embedded within the API sequences. Overall, employing a compact and diverse kernel combination like (3, 4, 5) strikes an ideal balance. It successfully captures both local behavioral anomalies and mid-range structural semantics, thereby solidifying the robust cross-category recognition capability of the LLM-SGCF framework.

4.10. Discussion on Computational Complexity

While Large Language Models inherently introduce substantial computational overhead, it is crucial to emphasize that the proposed LLM-SGCF framework completely decouples inference from real-time malware detection via an offline pre-computation strategy. Since the vocabulary of unique APIs within the operating system and our datasets is finite and relatively small, typically in the magnitude of hundreds, we avoid querying the model dynamically during the inference phase. Instead, during an initial offline setup phase, the language model is queried exactly once for each unique API to generate its behavioral description, which is subsequently encoded into a 768-dimensional semantic vector by the BERT-base model implemented using Transformers 4.37.2.

These pre-computed semantic vectors are then persistently cached in a local hash-based dictionary. During real-time malware detection, the semantic representation for any given API in the call sequence of a sample is retrieved via a constant-time lookup operation. Consequently, the inference latency and dynamic query costs per sample are effectively reduced to zero. The only incurred cost is a negligible one-time setup overhead required to construct the API dictionary. This architectural design ensures that the LLM-SGCF framework seamlessly scales to meet the high-throughput and low-latency requirements imperative for real-world malware detection deployment.

5. Conclusions

In summary, we propose LLM-SGCF, a robust malware detection framework that addresses the vulnerabilities of existing methods by jointly modeling semantic completion and spatial structure. To overcome the inherent semantic sparsity of traditional API analysis, we leverage generative Large Language Models combined with BERT to construct deep, contextually enriched behavioral representations. Building upon this semantic foundation, our novel Spatially Guided Convolution module effectively localizes critical malicious segments and captures complex cross-position dependencies. By further integrating a multi-scale convolution architecture, the framework seamlessly fuses behavioral features across multiple granularities.

Extensive experiments on the public Aliyun and Catak datasets demonstrate that LLM-SGCF consistently outperforms state-of-the-art baselines, including mainstream RNN-, CNN-, and Transformer-based architectures, in both multi-class and binary-classification scenarios. Most notably, the evaluation and ablation studies confirm that the synergistic integration of LLM-driven semantic enhancement and spatial structural modeling grants our framework superior robustness and generalization capabilities against in-the-wild structural mutations and complex behavioral variations. This research study elucidates the critical role of multi-dimensional feature fusion in sequence analysis, providing a highly reliable and promising technical pathway for the development of next-generation intelligent malware defense systems.

Despite the promising results, we acknowledge certain limitations in the current evaluation scope. While the utilized datasets encompass a wide array of naturally occurring structural variations, our experiments do not systematically evaluate the framework against synthetically generated adversarial attacks, such as GAN-based evasions, or strictly controlled noise injection, such as manual dummy API insertion. Furthermore, while strict stratified sampling is employed to stabilize the evaluation of extreme minority classes under the current 8:1:1 split, conducting exhaustive K-fold cross-validation on larger-scale imbalanced datasets is necessary to fully map the partitioning sensitivity of the model. Consequently, exploring the extreme boundary robustness and spatial stability of the framework against targeted adversarial mutations remains a critical avenue for our future work.

Author Contributions

L.Z. conducted the experiments and wrote the manuscript. H.H. and N.L. proofread the manuscript. Y.W. handled data preprocessing. M.L. worked on the tables and figures. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the Science and Technology Project of State Grid Shandong Electric Power Company: “Research on Key Technologies of Attack Mechanism Mining, Analysis and Decision-Making for Power System Cross-Domain Security Boundary Protection” under Grant 520627250002.

Data Availability Statement

The datasets used and analyzed during the current study are publicly available from the corresponding official repositories.

Conflicts of Interest

Author Lina Zhao, Hua Huang, Ning Li, Yunxiao Wang and Ming Li were employed by the company Information and Telecommunications Company, State Grid Shandong Electric Power Company. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. This study received funding from Science and Technology Project of State Grid Shandong Electric Power Company (Grant No. 520627250002). The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

References

Cui, Z.; Zhao, Y.; Cao, Y.; Cai, X.; Zhang, W.; Chen, J. Malicious code detection under 5G HetNets based on a multi-objective RBM model. IEEE Netw. 2021, 35, 82–87. [Google Scholar] [CrossRef]
Kim, J.Y.; Cho, S.B. Obfuscated malware detection using deep generative model based on global/local features. Comput. Secur. 2022, 112, 102501. [Google Scholar] [CrossRef]
Yan, K.; Zhang, Y.; Tang, H.; Ren, C.; Zhang, J.; Wang, G.; Wang, H. Signature detection, restoration, and verification: A novel chinese document signature forgery detection benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 5163–5172. [Google Scholar]
Wang, K.; Jiang, Q.; Wu, Y.; Wang, B.; Zhang, H. STATGRAPH: Effective in-Vehicle Intrusion Detection via Multi-View Statistical Graph Learning. IEEE Trans. Mob. Comput. 2025, 25, 6335–6351. [Google Scholar] [CrossRef]
Wang, Z.; Wang, W.; Yang, Y.; Han, Z.; Xu, D.; Su, C. CNN-and GAN-based classification of malicious code families: A code visualization approach. Int. J. Intell. Syst. 2022, 37, 12472–12489. [Google Scholar] [CrossRef]
Guan, Z.; Wang, J.; Wang, X.; Xin, W.; Cui, J.; Jing, X. A comparative study of RNN-based methods for web malicious code detection. In Proceedings of the 2021 IEEE 6th International Conference on Computer and Communication Systems (ICCCS), Chengdu, China, 23–26 April 2021; pp. 769–773. [Google Scholar]
Alshomrani, M.; Albeshri, A.; Alturki, B.; Alallah, F.S.; Alsulami, A.A. Survey of Transformer-Based Malicious Software Detection Systems. Electronics 2024, 13, 4677. [Google Scholar] [CrossRef]
Yang, H.; Wang, Y.; Zhang, L.; Cheng, X.; Hu, Z. A novel Android malware detection method with API semantics extraction. Comput. Secur. 2024, 137, 103651. [Google Scholar] [CrossRef]
Kamalloo, E.; Zhang, X.; Ogundepo, O.; Thakur, N.; Alfonso-Hermelo, D.; Rezagholizadeh, M.; Lin, J. Evaluating embedding APIs for information retrieval. arXiv 2023, arXiv:2305.06300. [Google Scholar] [CrossRef]
Chen, T.; Zeng, H.; Lv, M.; Zhu, T. CTIMD: Cyber threat intelligence enhanced malware detection using API call sequences with parameters. Comput. Secur. 2024, 136, 103518. [Google Scholar] [CrossRef]
Kouliaridis, V.; Karopoulos, G.; Kambourakis, G. Assessing the effectiveness of llms in android application vulnerability analysis. In Proceedings of the International Conference on Attacks and Defenses for Internet-of-Things, Hangzhou, China, 13–14 December 2024; pp. 139–154. [Google Scholar]
Cheng, Y.; Shar, L.K.; Zhang, T.; Yang, S.; Dong, C.; Lo, D.; Lv, S.; Shi, Z.; Sun, L. Llm-enhanced static analysis for precise identification of vulnerable oss versions. arXiv 2024, arXiv:2408.07321. [Google Scholar]
Nam, D.; Macvean, A.; Hellendoorn, V.; Vasilescu, B.; Myers, B. Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 1–13. [Google Scholar]
Hoseini, S.; Burgdorf, A.; Paulus, A.; Meisen, T.; Quix, C.; Pomp, A. Challenges and Opportunities of LLM-Augmented Semantic Model Creation for Dataspaces. In Proceedings of the European Semantic Web Conference, Hersonissos, Greece, 26–30 May 2024; pp. 183–200. [Google Scholar]
Cloud, A. Alibaba Cloud Malware Detection Based on Behaviors. 2018. Available online: https://tianchi.aliyun.com/getStart/information.htm?raceId=231694 (accessed on 11 November 2018).
Catak, F.O.; Yazı, A.F.; Elezaj, O.; Ahmed, J. Deep learning based Sequential model for malware analysis using Windows exe API Calls. PeerJ Comput. Sci. 2020, 6, e285. [Google Scholar] [CrossRef]
Sun, L.; Wang, Y.; Ren, Y.; Xia, F. Path signature-based xai-enabled network time series classification. Sci. China Inf. Sci. 2024, 67, 170305. [Google Scholar] [CrossRef]
Mourtaji, Y.; Bouhorma, M.; Alghazzawi, D.; Aldabbagh, G.; Alghamdi, A. Hybrid Rule-Based Solution for Phishing URL Detection Using Convolutional Neural Network. Wirel. Commun. Mob. Comput. 2021, 2021, 8241104. [Google Scholar] [CrossRef]
Kouli, M.; Rasoolzadegan, A. A feature-based method for detecting design patterns in source code. Symmetry 2022, 14, 1491. [Google Scholar] [CrossRef]
Bhadra, T.; Mallik, S.; Hasan, N.; Zhao, Z. Comparison of five supervised feature selection algorithms leading to top features and gene signatures from multi-omics data in cancer. BMC Bioinform. 2022, 23, 153. [Google Scholar] [CrossRef]
Rabbani, M.; Wang, Y.; Khoshkangini, R.; Jelodar, H.; Zhao, R.; Bagheri Baba Ahmadi, S.; Ayobi, S. A review on machine learning approaches for network malicious behavior detection in emerging technologies. Entropy 2021, 23, 529. [Google Scholar] [CrossRef]
Yan, A.; Chen, Z.; Zhang, H.; Peng, L.; Yan, Q.; Hassan, M.U.; Zhao, C.; Yang, B. Effective detection of mobile malware behavior based on explainable deep neural network. Neurocomputing 2021, 453, 482–492. [Google Scholar] [CrossRef]
Gonzalez, D.; Zimmermann, T.; Godefroid, P.; Schäfer, M. Anomalicious: Automated detection of anomalous and potentially malicious commits on github. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), Madrid, Spain, 25–28 May 2021; pp. 258–267. [Google Scholar]
Hong, Y.; Li, Q.; Yang, Y.; Shen, M. Graph based encrypted malicious traffic detection with hybrid analysis of multi-view features. Inf. Sci. 2023, 644, 119229. [Google Scholar] [CrossRef]
Chen, Z.; Xu, J.; Peng, T.; Yang, C. Graph convolutional network-based method for fault diagnosis using a hybrid of measurement and prior knowledge. IEEE Trans. Cybern. 2021, 52, 9157–9169. [Google Scholar] [CrossRef] [PubMed]
Liu, R.; Wang, Y.; Guo, Z.; Xu, H.; Qin, Z.; Ma, W.; Zhang, F. TransURL: Improving malicious URL detection with multi-layer Transformer encoding and multi-scale pyramid features. Comput. Netw. 2024, 253, 110707. [Google Scholar] [CrossRef]
Wang, Y.; Shi, Y.; Yang, T.; Wang, W.; Sun, Z.; Zhang, Y. Structural performance warning based on computer intelligent monitoring and fractional-order multi-rate Kalman fusion method. Fractal Fract. 2026, 10, 186. [Google Scholar] [CrossRef]
Chen, X.; Li, C.; Wang, D.; Wen, S.; Zhang, J.; Nepal, S.; Xiang, Y.; Ren, K. Android HIV: A study of repackaging malware for evading machine-learning detection. IEEE Trans. Inf. Forensics Secur. 2019, 15, 987–1001. [Google Scholar] [CrossRef]
Hossain, A.A.; PK, M.K.; Zhang, J.; Amsaad, F. Malicious code detection using llm. In Proceedings of the NAECON 2024—IEEE National Aerospace and Electronics Conference, Fairborn, OH, USA, 15–18 July 2024; pp. 414–416. [Google Scholar]
Deng, Z.; Ma, W.; Han, Q.; Zhou, W.; Zhu, X.; Wen, S.; Xiang, Y. Exploring DeepSeek: A Survey on Advances, Applications, Challenges and Future Directions. IEEE/CAA J. Autom. Sin. 2025, 12, 872–893. [Google Scholar] [CrossRef]
Lu, G.; Ju, X.; Chen, X.; Pei, W.; Cai, Z. GRACE: Empowering LLM-based software vulnerability detection with graph structure and in-context learning. J. Syst. Softw. 2024, 212, 112031. [Google Scholar] [CrossRef]
Zhu, X.; Zhou, W.; Han, Q.L.; Ma, W.; Wen, S.; Xiang, Y. When Software Security Meets Large Language Models: A Survey. IEEE/CAA J. Autom. Sin. 2025, 12, 317–334. [Google Scholar] [CrossRef]
Cheng, W.; Sun, K.; Zhang, X.; Wang, W. Security attacks on llm-based code completion tools. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 23669–23677. [Google Scholar]
Zhong, A.; Mo, D.; Liu, G.; Liu, J.; Lu, Q.; Zhou, Q.; Wu, J.; Li, Q.; Wen, Q. Logparser-llm: Advancing efficient log parsing with large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 4559–4570. [Google Scholar]
Zhou, W.; Zhu, X.; Han, Q.L.; Li, L.; Chen, X.; Wen, S.; Xiang, Y. The Security of Using Large Language Models—A Survey with Emphasis on ChatGPT. IEEE/CAA J. Autom. Sin. 2025, 12, 1–26. [Google Scholar] [CrossRef]
Zhan, X.; Carrillo, J.C.; Seymour, W.; Such, J. Malicious LLM-Based Conversational AI Makes Users Reveal Personal Information. arXiv 2025, arXiv:2506.11680. [Google Scholar] [CrossRef]
Chen, J.; Zhong, Q.; Wang, Y.; Ning, K.; Liu, Y.; Xu, Z.; Zhao, Z.; Chen, T.; Zheng, Z. Rmcbench: Benchmarking large language models’ resistance to malicious code. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 995–1006. [Google Scholar]
Deng, Z.; Sun, R.; Xue, M.; Ma, W.; Wen, S.; Nepal, S.; Yang, X. Hardening LLM Fine-Tuning: From Differentially Private Data Selection to Trustworthy Model Quantization. IEEE Trans. Inf. Forensics Secur. 2025, 20, 7211–7226. [Google Scholar] [CrossRef]
Dang, D.; Di Troia, F.; Stamp, M. Malware classification using long short-term memory models. arXiv 2021, arXiv:2103.02746. [Google Scholar] [CrossRef]
Yuan, L.; Zeng, Z.; Lu, Y.; Ou, X.; Feng, T. A character-level BiGRU-attention for phishing classification. In Proceedings of the International Conference on Information and Communications Security, Beijing, China, 15–17 December 2019; pp. 746–762. [Google Scholar]
Zhang, Z.; Qi, P.; Wang, W. Dynamic malware analysis with feature engineering and feature learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 1210–1217. [Google Scholar]
Qin, B.; Wang, Y.; Ma, C. API call based ransomware dynamic detection approach using textCNN. In Proceedings of the 2020 International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), Virtual, 12–14 June 2020; pp. 162–166. [Google Scholar]
Kolosnjaji, B.; Zarras, A.; Webster, G.; Eckert, C. Deep learning for classification of malware system call sequences. In Proceedings of the Australasian Joint Conference on Artificial Intelligence, Hobart, Australia, 30 November–4 December 2016; pp. 137–149. [Google Scholar]
Li, C.; Lv, Q.; Li, N.; Wang, Y.; Sun, D.; Qiao, Y. A novel deep framework for dynamic malware detection based on API sequence intrinsic features. Comput. Secur. 2022, 116, 102686. [Google Scholar] [CrossRef]
Zhang, S.; Wu, J.; Zhang, M.; Yang, W. Dynamic malware analysis based on API sequence semantic fusion. Appl. Sci. 2023, 13, 6526. [Google Scholar] [CrossRef]
Demirkıran, F.; Çayır, A.; Ünal, U.; Dağ, H. An ensemble of pre-trained transformer models for imbalanced multiclass malware classification. Comput. Secur. 2022, 121, 102846. [Google Scholar] [CrossRef]
Trizna, D.; Demetrio, L.; Biggio, B.; Roli, F. Nebula: Self-attention for dynamic malware analysis. IEEE Trans. Inf. Forensics Secur. 2024, 19, 6155–6167. [Google Scholar] [CrossRef]
Xu, Z.; Fang, X.; Yang, G. Malbert: A novel pre-training method for malware detection. Comput. Secur. 2021, 111, 102458. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed LLM-SGCF network architecture. The framework consists of two main phases, specifically Representation Generation and Representation Learning. In the Representation Generation phase, raw API calls are processed by a Large Language Model to produce explanatory text, which is subsequently encoded by BERT into a sequence-level semantic tensor S of dimension

L \times 768

. During the Representation Learning phase, the Spatially Guided Convolution module reshapes this tensor and employs parallel bottleneck branches, denoted by

H_{1}

,

H_{2}

, and

H_{3}

, with 96 channels to extract deep local features. A spatial attention map A is computed to highlight critical semantic and positional features. These components are fused via element-wise multiplication and refined into the output tensor Y. Finally, multi-scale convolutions with varying kernel sizes capture temporal dependencies across different receptive fields. The extracted features are aggregated via adaptive average pooling and concatenated to feed a fully connected layer for malware categorization.

Figure 1. Overview of the proposed LLM-SGCF network architecture. The framework consists of two main phases, specifically Representation Generation and Representation Learning. In the Representation Generation phase, raw API calls are processed by a Large Language Model to produce explanatory text, which is subsequently encoded by BERT into a sequence-level semantic tensor S of dimension

L \times 768

. During the Representation Learning phase, the Spatially Guided Convolution module reshapes this tensor and employs parallel bottleneck branches, denoted by

H_{1}

,

H_{2}

, and

H_{3}

, with 96 channels to extract deep local features. A spatial attention map A is computed to highlight critical semantic and positional features. These components are fused via element-wise multiplication and refined into the output tensor Y. Finally, multi-scale convolutions with varying kernel sizes capture temporal dependencies across different receptive fields. The extracted features are aggregated via adaptive average pooling and concatenated to feed a fully connected layer for malware categorization.

Figure 2. The distribution of malware categories in the Aliyun dataset.

Figure 3. The distribution of malware categories in the Catak dataset.

Figure 4. Learning curves of the LLM-SGCF framework over 100 training epochs. The parallel stabilization of the training and validation curves indicates robust convergence without overfitting.

Figure 5. Confusion matrix of the LLM-SGCF framework on the Aliyun multi-class dataset. Labels 1–8 correspond to the following categories: 1: Infectious Virus, 2: Mining, 3: Benign, 4: Ransomware, 5: Trojan, 6: DDoS Trojan, 7: Backdoor, and 8: Worm. The high diagonal concentration demonstrates the model’s precision across diverse behavioral patterns.

Figure 6. Two-dimensional marginal parameter sensitivity analysis on accuracy, precision, and recall.

Figure 7. Three-dimensional joint parameter sensitivity analysis for Step Size and Gamma in malware detection.

Figure 8. The impact of the SGC module on model performance on the Aliyun dataset. Left and Right represent binary and multi-class tasks, respectively.

Figure 9. The impact of the SGC module on model performance on the Catak dataset. Left and Right represent binary and multi-class tasks, respectively.

Table 1. Hyperparameter settings for the proposed model.

Parameter Name	Value
Epochs	100
Training/Validation Batch Size	8
Test Batch Size	12
Learning Rate	0.001
Random Seed	42
Optimizer Type	Adam
Weight Decay	$6 \times 10^{- 5}$
Loss Function	Cross-Entropy Loss
Learning Rate Scheduler	StepLR (Step = 20, $γ = 0.9$ )
Embedding Dimension	768
Dropout Rate	0.3
Maximum API Sequence Length	100
Explanation Text Word Limit	300

Table 2. Model performance on different tasks.

Method	Source	Type	Aliyun (Multi-ACC)	Aliyun (Binary-ACC)	Catak (Multi-ACC)
Kolosnjaji [43]	SIP’16	CNN+RNN-based	81.57%	93.38%	45.15%
BiGRU [40]	ICICS’19	RNN-based	81.43%	93.52%	49.65%
TextCNN [42]	ICBAIE’20	CNN-based	83.44%	94.53%	47.96%
CatakNet [16]	PCS’20	RNN-based	82.22%	93.45%	49.09%
ZhangNet [41]	AAAI’20	RNN-based	77.75%	89.85%	40.79%
BiLSTM [39]	arXiv’21	RNN-based	82.65%	93.38%	49.51%
MalBERT [48]	CS’21	Transformer-based	77.83%	89.99%	38.82%
LiNet [44]	CS’22	CNN+RNN-based	79.12%	93.74%	48.10%
Transformer [46]	CS’22	Transformer-based	75.95%	91.07%	37.83%
Mal-ASSF [45]	AS’23	CNN+RNN-based	82.36%	93.81%	48.66%
Nebula [47]	TIFS’24	Transformer-based	77.83%	90.50%	46.13%
Ours	-	CNN-based	84.88 %	95.82%	63.15%

Table 3. The impact of different sequence length settings on model performance on the Aliyun dataset.

Sequence Length	Aliyun Binary Precision	Aliyun Binary Recall	Aliyun Binary ACC	Aliyun Multi Precision	Aliyun Multi Recall	Aliyun Multi ACC
50	92.35%	94.46%	93.52%	65.78%	66.19%	83.59%
100	95.02 %	96.13%	95.82%	66.23%	67.10%	84.88%
150	94.46%	95.41%	95.38%	65.89%	66.53%	84.03%

Table 4. The impact of different convolution kernel size settings on model performance on the Aliyun dataset.

Grouped Convolution Kernels	Aliyun (Binary) Precision	Aliyun (Binary) Recall	Aliyun (Binary) ACC	Aliyun (Multi) Precision	Aliyun (Multi) Recall	Aliyun (Multi) ACC
1, 3, 5	94.89%	95.16%	95.75%	64.94%	65.54%	84.67%
3, 3, 3	94.41%	96.01%	95.39%	64.22%	66.54%	84.46%
3, 4, 5	95.02%	96.13%	95.82%	66.23%	67.10%	84.88%
3, 5, 7	94.75%	95.23%	95.68%	61.67%	65.12%	84.18%
5, 7, 9	94.39%	96.12%	95.04%	59.00%	63.91%	84.03%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, L.; Huang, H.; Li, N.; Wang, Y.; Li, M. LLM-SGCF: A Robust Malware Detection Framework with Spatially Guided Convolution. Computers 2026, 15, 329. https://doi.org/10.3390/computers15060329

AMA Style

Zhao L, Huang H, Li N, Wang Y, Li M. LLM-SGCF: A Robust Malware Detection Framework with Spatially Guided Convolution. Computers. 2026; 15(6):329. https://doi.org/10.3390/computers15060329

Chicago/Turabian Style

Zhao, Lina, Hua Huang, Ning Li, Yunxiao Wang, and Ming Li. 2026. "LLM-SGCF: A Robust Malware Detection Framework with Spatially Guided Convolution" Computers 15, no. 6: 329. https://doi.org/10.3390/computers15060329

APA Style

Zhao, L., Huang, H., Li, N., Wang, Y., & Li, M. (2026). LLM-SGCF: A Robust Malware Detection Framework with Spatially Guided Convolution. Computers, 15(6), 329. https://doi.org/10.3390/computers15060329

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

LLM-SGCF: A Robust Malware Detection Framework with Spatially Guided Convolution

Abstract

1. Introduction

2. Related Work

2.1. Traditional Malicious Code Detection Methods

2.2. Deep Learning-Based Malicious Code Detection Methods

2.3. Semantic Enhancement Detection Methods Based on Large Language Models

3. Method

3.1. API Call Sequence Preprocessing

3.2. Semantic Enhancement Driven by Large Language Models

3.3. Spatially Guided Convolution Module

3.4. Multi-Scale Convolution Fusion and Classification

4. Experiments

4.1. Dataset Description

4.2. Baselines

4.3. Evaluation Metrics

4.4. Parameter Setting

4.5. Quantitative Results

4.6. Model Convergence Stability Analysis

4.7. Classification Performance Visualization

4.8. Sensitivity Analysis

4.8.1. Sensitivity Analysis of Sequence Length

4.8.2. Parameter Sensitivity Analysis

4.9. Ablation Experiment

4.9.1. Effectiveness of the Spatially Guided Convolution Module

4.9.2. Impact of Multi-Scale Convolution Kernel Combinations

4.10. Discussion on Computational Complexity

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI