AD-CapsFPN: An Asymmetric Dilated Convolutional Capsule Network with Feature Pyramid for Malware Classification

Wang, Longcheng; Li, Jin; Song, Yafei; Ren, Yanbing; Xu, Yunfei

doi:10.3390/electronics15112355

Open AccessArticle

AD-CapsFPN: An Asymmetric Dilated Convolutional Capsule Network with Feature Pyramid for Malware Classification

by

Longcheng Wang

,

Jin Li

,

Yafei Song

^*

,

Yanbing Ren

and

Yunfei Xu

School of Air Defense and Antimissile, Air Force Engineering University, Xi’an 710051, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(11), 2355; https://doi.org/10.3390/electronics15112355

Submission received: 17 April 2026 / Revised: 20 May 2026 / Accepted: 27 May 2026 / Published: 29 May 2026

(This article belongs to the Special Issue AI in Cybersecurity, 3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

Existing CNN-based visual malware classification methods are often constrained by inductive bias mismatch: standard isotropic convolution kernels and global pooling operations neglect the inherent structural anisotropy of malware images, and these methods struggle to address the spatial rearrangement of code blocks caused by obfuscation, which we term the “Malware Picasso Problem”. To overcome these limitations, we propose AD-CapsFPN, an end-to-end framework representing a significant step toward spatial reasoning over texture memorization, with a synergistic “Rectification–Fusion–Inference” mechanism. Our approach rectifies anisotropic inductive biases in the feature extraction stage, dynamically aggregates cross-scale discriminative features in intermediate layers, injects row-aware spatial biases, and adopts a global pooling-free spatial routing strategy in the classification stage, effectively reconstructing logical associations between obfuscated and scattered code blocks. Experiments on the large-scale Fusion dataset and the obfuscated Androdex dataset demonstrate significant performance improvements: our method achieves a 16.22% boost in macro F1-score over the MobileNetV4 baseline on the Fusion dataset (reaching 97.98%), and hits 92.45% macro F1-score on the highly challenging Androdex-Set1, outperforming state-of-the-art methods such as MDC-RepNet (88.97%) and TAEfficientNet (88.15%). This work confirms that embedding malware domain priors into architecture design is the key to robust malware classification.

Keywords:

malware classification; structural anisotropy; capsule network; feature pyramid network; asymmetric dilated convolution

1. Introduction

Alongside the rapid expansion of the internet and connected cyber-physical systems, cybersecurity threats have become increasingly diverse, including malware attacks and privacy risks in cyberspace [1,2,3]. According to the Kaspersky Security Statistics Report 2025 [4], approximately 500,000 new malicious files were intercepted daily between November 2024 and October 2025, representing a year-on-year increase of about 7%. This finding indicates that malware remains a persistent and rapidly evolving threat in modern cyberspace. Against this backdrop, deep learning methods, capable of autonomously learning high-dimensional features from massive amounts of data, have emerged as the cornerstone for building next-generation malware detection schemes [5]. More broadly, recent deep learning studies suggest that reliable classification models should incorporate task-specific structural priors, robust feature representation, and adaptation to complex or evolving data distributions [6,7,8,9,10].

Among numerous research avenues, malware classification via visualization—which converts malware binaries into images—has become an important direction since its inception by Nataraj et al. [11], primarily due to its ability to directly transfer cutting-edge advancements from computer vision. Malware images are not an independent data modality but an alternative representation of the binary byte stream. This means they can be easily derived from existing conventional static features and dynamic behavior traces. However, existing work suffers from a critical theoretical oversight: these methods generally equate malware images with natural images, directly adopting architectures designed for the latter (e.g., CNNs) while neglecting the inherent structural differences between the two [12,13,14]. Natural images typically exhibit strong local correlation and isotropy. For instance, as shown in Figure 1, pixels within an image of an apple are highly correlated in all directions. In contrast, malware images are formed by folding 1D instruction streams; their representation in the 2D space is essentially a folded mapping of a 1D sequence. This mapping gives rise to a highly anisotropic structure: pixels within the same row represent a continuous instruction stream with strong sequential correlation, whereas the adjacency of pixels between adjacent rows is merely a physical folding effect of fixed image width, leading to a significant reduction in semantic correlation. To address this structural flaw, a recent study by Yu et al. [15] attempted to remedy the semantic truncation caused by line breaks through a complex dual-graph splicing and interleaved coding strategy. This preprocessing-based solution suggests that the row-wise continuity and cross-row discontinuity of malware images are difficult to model using conventional isotropic convolution alone.

Exacerbating this challenge, malware authors widely adopt polymorphic variant techniques based on obfuscation to evade signature-based detection. Through code block rearrangement, instruction substitution, and register renaming, these techniques cause different samples from the same malware family to exhibit significant byte-level differences yet execute the same malicious logic [16,17,18,19]. When mapped to grayscale images, these transformations may produce a phenomenon analogous to the “Picasso Problem” in computer vision. As illustrated in Figure 2, critical code blocks, such as encryption/decryption loops and API-call sequences, may remain locally recognizable but become displaced, rearranged, or separated across distant image regions, while the malicious functionality is preserved. Similarly to Picasso’s Cubist portraits, where identifiable components are spatially distorted but still form a recognizable whole, malware images after obfuscation require the model to capture both local discriminative patterns and their altered spatial relationships. We define this phenomenon as the “Malware Picasso Problem,” a structure-preserving but layout-distorting challenge involving block displacement, spatial permutation, logical-physical mapping decoupling, and inter-region dependency loss. Rather than being completely independent from spatial permutation invariance, long-range dependency modeling, or obfuscation robustness, this formulation emphasizes their coupled occurrence in malware image classification. However, due to their inherent local receptive fields and the global pooling operations that largely destroy spatial information, traditional CNNs can often only identify the presence of features without reasoning about their spatial combinations, making it difficult to capture long-range dependencies that span across the spatial domain [20].

To address these issues, we propose a novel malware classification framework named AD-CapsFPN. First, the framework utilizes MobileNetV4 [21] as the backbone to balance performance and efficiency. We innovatively integrate an Asymmetric Dilated Convolution (ADC) module, which employs an asymmetric sampling strategy: it maintains compact sampling horizontally to preserve instruction stream integrity while expanding the receptive field vertically to capture cross-row long-range dependencies, thereby overcoming the limitations of isotropic processing in standard kernels. Second, we construct an Adaptive Feature Pyramid Network (AFPN) to deeply fuse multi-scale features output by the backbone. Through a top-down information pathway, AFPN injects high-level semantics into high-resolution low-level features—akin to providing “detailed annotations” for precise spatial information—and intelligently fuses features from different scales into a unified representation. Finally, we integrate a Malware-aware Capsule Network (MC-Caps) at the output of the AFPN, replacing the traditional global pooling layer and classification head that destroy spatial relationships. This design allows the model to directly model the ‘pose’ of the features on the spatially rich feature maps output by the AFPN with the assistance of our designed row-aware spatial bias, thereby inferring part-whole spatial hierarchical relationships. This approach not only aims to alleviate the “Picasso Problem” but also achieves end-to-end efficient utilization of spatial information based on the structural characteristics of malware images, ultimately comprehensively improving classification performance.

The main contributions of this paper are as follows:

We systematically analyze the structural anisotropy of malware images, and propose an Asymmetric Dilated Convolution (ADC) module to correct the inductive bias mismatch of existing models. The proposed ADC-UIB built on MobileNetV4 improves the model’s perception of long-range dependencies in instruction streams.
We define the “Malware Picasso Problem” caused by malware polymorphic obfuscation, and propose an end-to-end classification framework that preserves spatial information. We introduce a capsule network to replace the global pooling layer in traditional models, and realize spatial reasoning on fused multi-scale feature maps to effectively mitigate the structural information loss problem.
We build and validate an effective malware classification framework AD-CapsFPN, which integrates MobileNetV4, ADC, adaptive feature pyramid network and capsule network. Extensive experiments show that our method achieves state-of-the-art performance on multiple mainstream malware classification benchmarks.

2. Related Work

The evolution of malware detection technology is fundamentally a quest for more efficient feature representation and more robust structural understanding. Early research primarily relied on static feature matching and heuristic-based detection [22,23,24], subsequently evolving into traditional machine learning methods based on manual feature engineering (e.g., Random Forest, SVM, KNN) [25,26,27]. However, these methods rely heavily on expert knowledge and struggle to cope with the explosive growth of malware volume and the increasing complexity of obfuscation techniques [28]. The seminal work by Nataraj et al. [29], which maps malware binaries to grayscale images, represented a significant methodological advance in the field. This cross-domain transfer enabled researchers to leverage advanced achievements in Computer Vision (CV), with Convolutional Neural Networks (CNNs) subsequently becoming the mainstream tool for automatic feature extraction [30,31,32]. Recent studies have introduced more complex architectures. For instance, Belal et al. [33] utilized Vision Transformers (ViT) to capture global dependencies, while Yang et al. [34] combined Graph Convolutional Networks (GCN) to mine bytecode structural associations. Although these methods involve higher parameter counts and computational overheads, they have achieved moderate performance improvements on benchmark datasets.

Despite the superior numerical performance of existing deep learning methods, they are generally built upon an unexamined assumption: malware images share the same spatial isotropy as natural images. This implicit assumption neglects the essential structural anisotropy of malware images: pixels in the same row have strong sequential correlation from continuous instruction streams, while adjacent rows only have weak semantic correlation caused by physical folding of fixed image width. Existing studies [35] have shown that the image width used for malware visualization can significantly affect classification performance, even when conventional CNN-based models are employed. This finding indicates that malware images are sensitive to the way byte sequences are folded into two-dimensional layouts, and that their spatial structure cannot be treated in the same way as natural images. More critically, in pursuit of compact representation and translation invariance, many existing models adopt Global Average Pooling (GAP) or class-token aggregation as the endpoint of feature extraction. Although these operations reduce the dimensionality of feature representations and help control model complexity, they also compress spatially structured feature maps into global descriptors, substantially reducing explicit positional information. Consequently, such models may be less effective in modeling the spatial dislocation and altered inter-region dependencies caused by polymorphic obfuscation, which are key aspects of the proposed “Malware Picasso Problem.”

To address these structural adaptation issues, particularly capturing long-range dependencies spanning thousands of bytes, researchers have attempted to introduce receptive field expansion techniques and multi-scale feature fusion mechanisms. Regarding receptive field expansion, Ding et al. [36] proposed RepLKNet, demonstrating the effectiveness of extremely large convolution kernels. However, its massive computational overhead struggles to meet the efficiency requirements and multi-endpoint deployment needs of malware detection. As an alternative, Dilated Convolution [37] was introduced to efficiently expand the receptive field. However, standard dilated convolution is spatially symmetric, applying the same dilation rate horizontally and vertically. This design is ill-suited for malware images because horizontal skip-sampling disrupts the continuity of the instruction stream. Since each row corresponds to continuous bytecode containing instruction opcodes and operands, horizontal sparse sampling leads to fragmented instruction semantics and the loss of critical local feature fingerprints. Furthermore, isotropic processing ignores physical dimensional differences. Standard methods treat horizontal and vertical neighborhoods equally, whereas vertical adjacency in malware images implies a folded truncation of the binary stream. Currently, there are limited efforts to improve dilated convolution specifically for malware; what is needed is an asymmetric mechanism that preserves continuity horizontally and extends the receptive field vertically, rather than simple uniform dilation. In terms of feature fusion, although Feature Pyramid Networks (FPN) [38] can effectively address the challenges of scattered malware features and cross-scale distribution, their application in classification tasks faces the problem of spatial information loss in the backend. If the frontend utilizes FPN to recover multi-scale spatial details but the backend classifier still employs a global pooling layer, the recovery of spatial information may not be fully exploited for final classification.

Facing these challenges, Capsule Network [39], with its ability to explicitly model part-whole spatial hierarchical relationships between entities, theoretically presents an ideal architecture for structured malware classification. However, existing applications have not fully exploited their potential and even exhibit inconsistent architectural designs. For instance, Wang et al. [40] and Shelar et al. [41] retained global pooling operations before or after the capsule layers, which may weaken the spatial routing advantage of capsule networks and reduce their ability to model part-whole relationships. On the other hand, improvements to capsule networks in the CV domain (e.g., adaptive bias, residual routing) [42,43] are all based on the isotropy assumption of natural images. Directly transferring them to the malware domain causes routing algorithms to overemphasize inter-row physical adjacency while underrepresenting intra-row logical continuity. Furthermore, if the model architecture fully adopts capsule networks, its computational complexity and model parameter count will increase significantly, making it difficult to meet the efficiency requirements for real-time performance and multi-endpoint deployment in malware detection. Therefore, it needs to cooperate synergistically with efficient front-end modules [44].

In summary, the core contradiction currently facing the malware classification field is twofold: traditional CNN/ViT architectures tend to lose critical spatial structural information, while existing capsule network applications not only fail to effectively model malware’s anisotropic features due to the isotropy assumption and flawed pooling-retaining designs, but suffer from excessive computational complexity that hinders practical deployment. This situation necessitates a novel end-to-end framework that integrates row-aware spatial modeling mechanisms, multi-scale feature fusion, complete pooling elimination, and an efficient front-end backbone.

3. Methodology

To solve the structural anisotropy and “Malware Picasso Problem” in malware image classification, we propose an end-to-end framework AD-CapsFPN, as shown in Figure 3. The framework follows a three-stage synergistic pipeline: structure-aware feature extraction, multi-scale semantic fusion, and spatial hierarchical reasoning, which are implemented by three core modules: ADC-UIB, AFPN, and MC-Caps, respectively.

3.1. Asymmetric Dilated Convolutional UIB Module

As a backbone network, MobileNetV4 is lightweight, efficient, and suitable for multi-endpoint deployment. Its core innovation, the Universal Inverted Bottleneck (UIB) module, adopts depthwise separable convolutions determined via Neural Architecture Search (NAS) [45]. The model structure, optimized on massive natural image data, lays the foundation for its superior performance across multiple public natural image datasets. The structure of the UIB module is shown in Figure 4, where activation and normalization layers between pointwise and depthwise convolution layers are omitted for brevity. However, directly transferring this architecture to malware classification faces a significant inductive bias mismatch.

Limited Receptive Field: Standard UIBs employ fixed small convolution kernels (3 × 3 or 5 × 5), capable of capturing only local byte patterns but failing to cover cross-scale dependencies in malware, ranging from basic instructions to cross-function calls.

Isotropy Defect: Its symmetric receptive field assumes spatial information is equivalent in all directions, failing to adapt to the anisotropic structure of malware images, which are continuous within rows (instruction streams) but discrete between rows (physical truncation).

Noise Sensitivity: The UIB module treats all feature channels with equal weight, lacking noise suppression mechanisms for obfuscated code (e.g., code padded with junk bytes). This deficiency may cause noise to be amplified during the subsequent feature pyramid fusion process, thereby compromising the precision of spatial reasoning in the final capsule network.

Therefore, to address these issues while inheriting the efficiency of MobileNetV4, we designed the Asymmetric Dilated Convolutional UIB (ADC-UIB). This module enhances the model’s perception of malware spatial structures in a lightweight manner. Its core lies in enabling precise perception through “Asymmetric Dilated Sampling + Multi-Scale Parallel Branches + Adaptive Gating,” as shown in Figure 5.

The design objective of ADC-UIB is explicitly defined as maintaining continuity in the row direction while expanding the receptive field in the column direction. By capturing local, mid-range, and long-range dependencies through parallel branches to achieve multi-scale coverage, and finally adaptively fusing features from different channels, it reinforces discriminative features while suppressing noise. We detail each component below.

First is the Asymmetric Dilated Convolution. For an input feature map X ∈ R^(B × C × H × W) (where B is batch size, C is channel count, and H, W are height and width), the calculation is defined as shown in Equation (1):

A D C (X; k, d_{h}, d_{w}, p) = C o n v 2 d (X, k, s t r i d e = 1, d i l a t i o n = (d_{h}, d_{w}), p a d d i n g = p)

(1)

Here, k = 3 or 5, keeping the kernel size unchanged;

d_{w} = 1

, meaning no dilation in the row direction to maintain continuity. This implies we maintain standard compact sampling within rows to ensure the complete capture of basic instruction semantics formed by continuous bytes, mitigating the fragmentation of instruction stream information caused by skip-sampling. In contrast, the vertical dilation rate is set as

d_{h} \in (2, 5, 9)

in the three parallel branches, corresponding to local, mid-range, and long-range vertical contexts. This multi-branch setting is motivated by the structural prior that malware images exhibit strong horizontal continuity but weaker vertical adjacency under fixed-width folding. Specifically, horizontal dilation is avoided to preserve row-wise byte continuity, whereas vertical dilation is introduced to expand the receptive field along the folded sequence direction. The selected dilation rates provide progressively enlarged vertical receptive fields while keeping the horizontal receptive field compact. The influence of different vertical dilation configurations is further evaluated in the sensitivity analysis in Section 5.2.3. Padding

p = (p_{h}, p_{w})

is set to match the output resolution with the input, as shown in Equations (2) and (3), to avoid structural information loss.

p_{h} = ⌊(k - 1) \cdot \frac{d_{h}}{2}⌋

(2)

p_{w} = ⌊(k - 1) \cdot \frac{d_{w}}{2}⌋

(3)

To provide multi-range contextual coverage for malware images, we design three parallel ADC branches. The dilation rates, receptive fields (RF), and their correspondence to malware structures for each branch are presented in Table 1. We embed multi-scale parallel perception into the building blocks of the backbone, maintaining sensitivity to fine-grained structures during feature extraction. Notably, the vertical RF reaches up to 19 (more than six times that of the original UIB), efficiently capturing function call relations spanning hundreds of pixels, while the horizontal RF remains at 3, preserving the continuity of instruction sequences.

Subsequently, the output features of the three branches are first concatenated along the channel dimension to obtain

F_{c a t}

, and then compressed to fuse characteristics, as shown in Equation (4), where

B N

denotes Batch Normalization, followed by a ReLU6 activation function to stabilize training. This ensures the output dimension is consistent with the original UIB, maintaining compatibility with the MobileNetV4 data flow.

F_{f u s} = B N (C o n v 2 d (F_{c a t}, C, 1))

(4)

To dynamically screen discriminative features of malware images and filter out noise, we introduce channel-wise adaptive gating inspired by the SE (Squeeze-and-Excitation) module [46]. First, global information is compressed to obtain the channel statistic vector

Z_{C}

, as shown in Equation (5).

Z_{C} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F_{f u s} (B, C, i, j), Z \in R^{B \times C \times 1 \times 1}

(5)

Then, channel weights

S

are learned through two pointwise convolution layers to highlight effective features and suppress noise channels, as shown in Equation (6), where

r

is the reduction ratio (set to 8) used to balance computation and representational capacity, and

σ

is the Sigmoid function.

S = σ (C o n v 2 d (R e L U 6 (C o n v 2 d (Z, \frac{C}{r}, 1)), C, 1))

(6)

Finally, the weights

S

are multiplied element-wise with

F_{f u s}

across channels to obtain the final output

F_{o u t}

, as shown in Equation (7):

F_{o u t} (B, C, i, j) = F_{f u s} (B, C, i, j) \times S (B, C)

(7)

In the MobileNetV4 backbone, we do not replace all UIB blocks. Instead, we only replace the depthwise separable convolution layers in the UIB blocks within the last 60% of the deep network stages with the ADC-UIB modules we designed. This is because shallow network layers (the first 40%) are primarily responsible for learning low-level features such as edges, corners, and textures. In malware images, this corresponds to basic byte-level patterns, for which the original efficient UIB blocks are sufficient. As the network deepens, the semantic information of feature maps becomes richer, representing higher-level, more abstract concepts. For malware, this equates to recognizing entire functions, algorithmic structures (e.g., encryption routines), and control flow logic. It is in these stages that the directional awareness and multi-scale capabilities of ADC-UIB are most effective, enabling the interpretation of complex structural relationships from abstracted features. This “front-light, back-heavy” hybrid design intelligently allocates computational resources to the network stages that most require structured analysis, ensuring that the deep feature maps input to the subsequent Adaptive Feature Pyramid Network (AFPN) are already enriched with anisotropy-corrected multi-scale spatial information. Based on this, the next subsection will introduce how to integrate these rectified features distributed across different semantic levels into a unified representation that possesses both local details and a global perspective, addressing the challenge of dispersed malware discriminative information.

3.2. Adaptive Feature Pyramid Network

Following the processing by the ADC-UIB module, we extract anisotropy-rectified multi-scale features from the deep layers of the backbone network. However, discriminative features of malware—ranging from minute API call sequences (local details) to large-scale code block structures (global structural features)—are often scattered across different scales and locations in the image. Traditional FPNs typically take the shallowest features as input. Yet, the raw byte-level pixel details of malware images contain substantial invalid padding bytes and noise inserted via obfuscation. Crucially, the core of the downstream Capsule Network lies in inferring part-whole structural relationships (e.g., Instruction Block → Function → Malicious Behavior), which requires structured component features rather than pixel-level noise. If the Capsule Network erroneously learns low-level textures or even noise as components, structural inference accuracy will be severely compromised. Furthermore, traditional multi-scale feature fusion often employs simple concatenation or equal-weight summation. As previously mentioned, discriminative features of malware exhibit a non-uniform cross-scale distribution. Standard fusion methods dilute critical features and ignore the structural disparities among different malware families. To address these challenges, we designed the Adaptive Feature Pyramid Network (AFPN), the structure of which is shown in Figure 6.

We first select feature maps from three distinct stages of the modified MobileNetV4 backbone (excluding the shallow layers), denoted as

{C_{2}, C_{3}, C_{4}}

. These feature maps possess varying spatial resolutions and semantic levels, collectively covering the “local-mid-long” range dependencies of malware. We explicitly discard

C_{1}

(the feature map from the first 40% of the backbone network), as it is redundant and its shallow features are incompatible with the downstream Capsule Network. Moreover, the receptive field of the

C_{1}

layer is extremely small, primarily capturing relationships between adjacent pixels. This hardly aids in understanding the semantics of instructions composed of multiple bytes and is prone to introducing high-frequency noise. Through a top-down pathway and lateral connections, we progressively upsample the deep, strong-semantic feature map (

C_{4}

) and element-wise add it to the shallow feature map (

C_{3}

) which has undergone dimensionality reduction via pointwise convolution. This pointwise convolution ensures that the channel count of the shallow features matches that of the upsampled deep features, reducing feature conflicts caused by dimension mismatch, as shown in Equation (8). This process generates a new set of pyramid feature maps

{P_{2}, P_{3}, P_{4}}

that are rich in semantic information across all scales, with channel counts unified to

C_{F P N}

.

P_{i} = C o n v 2 d (C_{i}, C_{F P N}) + U p s a m p l e (P_{i + 1})

(8)

Standard classification models typically use only the top-level feature of the pyramid, leading to the loss of valuable information from other scales. Conversely, simply concatenating or summing all pyramid layers ignores the fact that features at different scales may possess varying degrees of importance for specific malware samples. For instance, for a Trojan relying on specific API call sequences, the high-resolution

P_{2}

layer may contribute most; whereas for ransomware with a distinct packed structure, the low-resolution yet semantically rich

P_{4}

layer may be more critical. To allow the model to dynamically assess and fuse features from different scales based on the input data, we designed an Adaptive Weighted Fusion Module. This module accepts all pyramid layers output by the FPN as input, following the process below:

First, to resolve dimension mismatch, we unsampled

P_{3}

and

P_{4}

via Bilinear Interpolation to the same spatial dimensions

{(H}_{2}, W_{2})

as the highest-resolution

P_{2}

, obtaining a set of size-aligned feature maps, as shown in Equation (9), where

i = 2, 3, 4

and

P_{i}^{'}

represents the size-aligned feature of the

i

scale.

P_{i}^{'} = I n t e r p o l a t e (P_{i}, s i z e = (H_{2}, W_{2}), m o d e = b i l i n e a r)

(9)

Next, we introduce a learnable scale weight vector

ω = [ω_{2}, ω_{3}, ω_{4}]

, where each element corresponds to the importance of a pyramid layer and is initialized to equal weights. To ensure the weights sum to 1 and remain positive, we normalize them using Equation (10), where

α_{i}

is the normalized weight for the

i

pyramid layer. During training, the model automatically learns the optimal weight distribution via backpropagation, thereby achieving adaptive selection of multi-scale features.

α_{i} = \frac{{e x p (ω}_{i})}{\sum_{j = 2}^{4} {e x p (ω}_{j})}

(10)

Then, we perform element-wise weighting of the normalized weights with their corresponding upsampled feature maps and concatenate them along the channel dimension, as shown in Equation (11).

F_{c a t} = C o n c a t (α_{2} \cdot P_{2}^{'}, α_{3} \cdot P_{3}^{'}, α_{4} \cdot P_{4}^{'})

(11)

Finally, to fuse the concatenated high-dimensional features into a more compact representation, we process them using a pointwise convolution layer to generate the final fused feature map, as shown in Equation (12). Here,

C_{o u t}

is the output channel count after fusion, and the pointwise convolution serves to integrate cross-scale channel information and eliminate channel redundancy.

F_{f u s i o n} = R e L U 6 (B N (C o n v 2 d (F_{c a t}, C_{o u t}, 1)))

(12)

F_{f u s i o n}

not only synthesizes information from all scales but, through the adaptive weighting mechanism, highlights the feature scales most critical for the current classification task. The AFPN acts as a bridge, taking the anisotropy-corrected spatial feature maps unique to malware images from the ADC-UIB module as input, and outputting feature maps enriched with precise spatial layouts and powerful cross-scale semantic annotations. Thus, the next step is to design a classifier capable of directly assembling this structured information and performing advanced spatial relationship reasoning, ultimately solving the “Malware Picasso Problem” in malware image classification.

3.3. Malware-Aware Capsule Network

To avoid irreversible spatial information loss caused by global average pooling (GAP) in traditional CNNs, and realize end-to-end spatial relationship modeling for the “Malware Picasso Problem”, we design a Malware-Aware Capsule Network (MC-Caps) customized for malware images. It consists of a Malware-Aware Primary Capsule (MC-AC) layer and a Class Capsule layer with dynamic routing, as shown in Figure 7.

3.3.1. Malware-Aware Primary Capsule Layer

The task of the primary capsule layer is to convert the convolutional features from

F_{f u s i o n}

into a set of primary capsules that represent basic components. Standard primary capsule layers face several major issues when processing malware images: (1) Locality Limitation: A single convolutional path struggles to capture the complex multi-scale dependencies in malware, ranging from local instructions to global logic. (2) Structure Agnosticism: The indiscriminate processing of spatial positions fundamentally conflicts with the anisotropic structure of malware images, easily leading to erroneous associations of unrelated features between adjacent rows of malware images. (3) Lack of Prior Guidance: Relying solely on data to learn spatial relationships results in slow convergence and vulnerability to noise interference. To address these issues and enable a profound understanding of the unique structure of malware, we design the Malware-Aware Primary Capsule Layer (MC-AC).

To balance the model’s basic feature extraction capability with malware-specific targeting, MC-AC employs a dual-path parallel processing strategy for the input feature maps.

The main path aligns with conventional capsule networks, extracting generic local features through a standard convolutional layer. This preserves the robust feature representation capability of traditional CNNs, serving as the cornerstone of model stability, as shown in Equation (13), where

N

is the number of primary capsules and

D

is the capsule dimension.

F_{p r i} = B N (C o n v 2 d (F_{f u s i o n}, N \times D, 3, s t r i d e = 2, p a d d i n g = 1))

(13)

The Adaptive Path is a multi-scale adaptive network similar in structure to the ADC-UIB in Section 3.1. It utilizes parallel asymmetric dilated convolutions to target and extract fine-grained component features with varying distance dependencies from the fused multi-scale features, specifically for capsule assembly, as shown in Equation (14).

F_{a d a} = G a t e (F u s i o n (F_{l o c a l}, F_{m e d i u m}, F_{l o n g}))

(14)

Here,

G a t e

denotes the adaptive gating mechanism, which learns channel weights via global statistics, as shown in Equation (15). Fusion represents a pointwise convolution for channel fusion. The formulas for

F_{l o c a l}, F_{m e d i u m}, F_{l o n g}

follow Equation (1) in Section 3.1.

G a t e (F) = σ (C o n v 2 d (R e L U 6 (C o n v 2 d (G A P (F), \frac{N \times D}{8}, 1)), N \times D, 1)) ⨂ F

(15)

For these two paths, we balance their contributions using learnable weights, as shown in Equation (16). Here,

ω = [ω_{2}, ω_{3}]

is a learnable weight vector corresponding to the initial contributions of the main path and adaptive path, respectively. It is initialized to [0.7, 0.3] and dynamically adjusted during training to adapt to the feature distribution of malware image.

F_{c a p} = s o f t m a x (ω_{2}) \cdot F_{p r i} + s o f t m a x (ω_{3}) \cdot F_{a d a}

(16)

Furthermore, since standard primary capsule layers treat spatial positions indiscriminately, they tend to incorrectly associate unrelated features between adjacent rows in malware (e.g., padding bytes in different rows). To address this issue, we design a Row-Aware Asymmetric Spatial Bias. Utilizing non-learnable structured guidance, this mechanism ensures that capsules in the same row (corresponding to continuous instructions) possess similar biases, while capsules in different rows (corresponding to discrete code segments) have distinct biases. The bias initialization formula is shown in Equation (17), where

b_{n, h, w, d}

represents the bias value for the n primary capsule at spatial position

(h, w)

and dimension

d

, and

H

is the height of

F_{f u s i o n}

. The sine function generates a periodic bias pattern: for the same row (fixed

h

), the sine value remains constant, reinforcing intra-row continuity; for different rows (varying

h

), the b value fluctuates periodically, simulating the discrete inter-row structural properties of malware images. The coefficient 0.1 serves as a bias amplitude factor to prevent excessive bias from interfering with the features themselves.

b_{n, h, w, d} = 0.1 \times \sin (\frac{(n + 1) \times h \times π}{H} + 0.1 \times d)

(17)

Finally, we normalize the fused capsule features using the signature Squash activation function of capsule networks (where vector length represents the confidence of component existence, and direction represents the component feature pattern), as shown in Equation (18). The Squash function is defined in Equation (19), where

ϵ = 10^{- 8}

is introduced to avoid numerical instability.

v_{p r i} = s q u a s h (F_{c a p} + b, d i m = - 1)

(18)

s q u a s h (x) = (\frac{{‖x‖}^{2}}{1 + {‖x‖}^{2}}) \cdot (\frac{x}{‖x‖ + ϵ})

(19)

For qualitative visualization, we compute a primary capsule activation map (Figure 8) by averaging the vector norms of all primary capsules at each spatial position and applying min-max normalization. The resulting heatmap reflects where MC-Caps assigns stronger component-level activation.

3.3.2. Class Capsule Layer and Dynamic Routing

Proceeding to the Class Capsule Layer, its function is to aggregate component-level features output by the MC-AC layer into whole-level features via the dynamic routing algorithm. Unlike the fixed weight mapping mechanism of traditional fully connected layers, dynamic routing optimizes routing coefficients through multiple iterations. This allows primary capsules (parts) to intelligently transmit information to the best-matching class capsules (wholes). This mechanism adapts to the “Malware Picasso Problem,” where component positions are distorted but global semantics remain invariant. Even if critical code blocks (e.g., encryption routines) are spatially rearranged, dynamic routing can correctly aggregate them into the corresponding family’s class capsule features based on the relational associations between components. The specific process is as follows:

Dynamic routing first maps each primary capsule to a prediction vector for all class capsules via a learnable weight matrix, representing the potential association of the current component with a specific malware family, as shown in Equation (20). Here,

u_{i, j}

is the prediction vector from primary capsule i to class capsule

j

, with its direction encoding the component-family association pattern and its length representing association strength.

W_{j}

is the weight matrix for the

j

class capsule, initialized with small random values to avoid initial mapping bias;

v_{p r i, i}

is the i primary capsule, corresponding to a component-level feature.

u_{i, j} = W_{j} \cdot v_{p r i, i}

(20)

Subsequently, through three iterations during the training phase, primary capsules whose prediction vectors align with the direction of the class capsule vector are assigned higher routing weights, thereby dynamically adjusting the routing coefficients from primary to class capsules. Each iteration begins by initializing the routing logits

b_{i, j}

to zero, indicating equal initial contribution from all primary capsules. Then,

b_{i, j}

is normalized into routing coefficients

c_{i, j}

, ensuring that the sum of routing coefficients from all primary capsules to a specific class capsule equals 1, as shown in Equation (21).

c_{i, j} = s o f t m a x (b_{i, j}, d i m = 1)

(21)

Next, based on

c_{i, j}

, the prediction vectors of all primary capsules are weighted and summed to obtain the initial aggregated feature

s_{j}

for class capsule

j

, realizing the dynamic assembly from parts to wholes, as shown in Equation (22).

s_{j} = \sum_{i = 1}^{N} c_{i, j} \cdot u_{i, j}

(22)

Finally,

s_{j}

is normalized via the squash function to generate the final class capsule vector

v_{j}

. Its length represents the confidence that the sample belongs to the

j

malware class, while its direction encodes the core semantic features of the family. The routing logits are updated by calculating the consistency (dot product) between the prediction vector and the class capsule vector, completing the current iteration, as shown in Equation (23).

b_{i, j} = b_{i, j} + u_{i, j} \cdot v_{j}

(23)

A larger dot product implies higher consistency, resulting in higher weights in the next round. This mechanism ensures that highly matched components continuously reinforce their contribution to the corresponding whole, thereby achieving adaptive routing.

3.3.3. Loss Function Design and Summary

Both real-world scenarios and malware datasets suffer from severe class imbalance. To address this, we adopt a weighted margin loss, as shown in Equation (24). Here,

y_{b, j}

is the label for the

j

class of sample

b

;

m_{1} = 0.9

and

m_{2} = 0.1

are the margins for positive and negative classes;

λ = 0.5

is the weighting coefficient for negative classes; and

ω_{j}

is the class weight for the

j

class, adaptively computed based on the dataset’s class distribution.

L = \frac{1}{B} \sum_{b = 1}^{B} \sum_{j = 1}^{K} [y_{b, j} \cdot m a x {(0, m_{1} - ‖v_{b, j}‖)}^{2} + λ \cdot (1 - y_{b, j}) \cdot \max {(0, ‖v_{b, j}‖ - m_{2})}^{2}] \cdot ω_{j}

(24)

In summary, MC-Caps addresses the deficiencies of standard capsules in capturing long-range dependencies and adapting to anisotropy by introducing a dynamically fused dual-path and row-aware spatial bias in the primary capsule layer. It utilizes dynamic routing with unbiased initial iterations settings to adapt to component positional distortions caused by the “Malware Picasso Problem,” and employs weighted margin loss with class weights to handle dataset imbalance.

Framework Synergy: The anisotropic features extracted by ADC-UIB are further reinforced by the row-aware bias of MC-AC. The multi-scale features fused by AFPN are aggregated from parts to wholes across scales via Dynamic Routing. The three components form a sequential network of “Feature Extraction (ADC-UIB) → Feature Fusion (AFPN) → Spatial Inference (MC-Caps),” thereby addressing the core problems of spatial information loss, structural mismatch, and poor robustness against polymorphic variants in malware classification.

4. Experimental Setup

4.1. Datasets

To comprehensively validate the classification performance, generalization ability and obfuscation-resistant robustness of AD-CapsFPN, we select two representative malware image datasets: the Fusion dataset and the Androdex dataset. They cover large-scale multi-source malware scenarios and advanced obfuscation evasion scenarios, and contain class imbalance, diverse visual features and structural distortion caused by obfuscation, which can fully evaluate the effectiveness of our framework in anisotropy adaptation, multi-scale fusion and spatial reasoning under distortion.

The Fusion Dataset [47] is constructed by merging malicious samples from three well-known public malware datasets on Kaggle: Microsoft Big 2015 [48], Malimg [49] and MaleVis [50]. It contains 32,601 malware images covering 59 malware families, supports both grayscale and RGB image formats, and integrates structural features from different malware sources. This dataset not only presents the classic challenge of severe class imbalance, but also provides diverse visual information, which can effectively verify the adaptability of our framework to malware with different distributions and formats. Any duplicate samples originally present within the same class were removed before our re-stratified splitting. In addition, we verified via MD5 checksum that no identical samples exist across different classes in the original dataset, ensuring that the Fusion dataset contains no cross-split duplicates or data leakage.

The Androdex Dataset [51] is an obfuscated dataset released in 2024, focusing on evaluating the model’s robustness against Android malware obfuscation techniques. Samples are derived from well-known Android malware datasets such as Drebin, and all images are RGB images generated based on DEX files, with a total of 21,133 images. The dataset is divided into two independent subsets: Set1 and Set2. Set1 is constructed based on AVPass obfuscation technology, containing three categories: benign software, malware, and obfuscated malware, focusing on structural distortions caused by a single obfuscation technique. Set2 is constructed based on Obfuscapk obfuscation technology, with an additional category of obfuscated benign software, forming a 4-class complex scenario to test the model’s ability to distinguish obfuscation noise from malicious features. The core value of this dataset is to simulate the scenario where malware evades detection through code block rearrangement, instruction substitution and other means. The generated images exhibit a significant “Malware Picasso Problem”, which allows us to test the framework’s ability to capture core structural relationships under obfuscation interference. In our experiments, we follow the original partition of the dataset and conduct experiments on Set1 and Set2 respectively.

4.2. Evaluation Metrics

We conduct a comprehensive evaluation of AD-CapsFPN from two core dimensions: classification performance and model computational efficiency. For classification performance, we adopt four standard metrics widely used in the malware detection field: Accuracy, Precision, Recall and F1-Score. Given the severe class imbalance in the Fusion dataset and intra-family feature variability caused by polymorphic obfuscation, we take macro-averaged metrics as the core evaluation basis to ensure fair and unbiased performance assessment across all malware families. For computational efficiency, we introduce quantitative complexity metrics covering both training and inference stages: total parameters, floating-point operations (FLOPs), single-image inference latency, and training time per epoch, to fully verify the model’s computational overhead and practical deployment value in real-world detection scenarios.

4.3. Baselines

We selected four representative SOTA (State-of-the-Art) models from recent years for comparison. These models represent four mainstream technical routes in the current field of malware classification, constituting a comprehensive evaluation landscape:

VisMal [52]: Represents the image enhancement and preprocessing route. The core of this method lies in applying Contrast Limited Adaptive Histogram Equalization (CLAHE). By enhancing local texture contrast, it improves the visual similarity of images within the same malware family, thereby mitigating visual differences caused by intra-family variants.

MalSSL [53]: Represents the self-supervised contrastive learning route. This framework relies less on massive annotated data and instead designs data augmentation strategies specifically for malware images. By employing a Dual-stream Network for self-supervised contrastive training, it aims to mine robust latent feature representations.

MDC-RepNet [54]: Represents the large kernel convolution and re-parameterization route. Utilizing structural re-parameterization technology, this model introduces large-sized convolution kernels during the inference phase to drastically extend the effective receptive field. As one of the primary competitors to our work, it represents an attempt to capture long-range dependencies by “brute-forcing” the expansion of the receptive field.

TAEfficientNet [55]: Represents the transfer learning and multi-scale perception route. Based on the EfficientNet backbone for transfer learning, this method integrates the Atrous Spatial Pyramid Pooling (ASPP) module. It adopts the classic symmetric dilated sampling strategy to extract multi-scale contexts. This forms a distinct contrast to our proposed asymmetric design (ADC-UIB) and serves as a key baseline for verifying the necessity of the anisotropic design.

To ensure a fair comparison, we reimplemented or reproduced all baseline models under the same experimental environment, using the same dataset partitions, input resolution, batch size, optimizer, learning-rate schedule, and evaluation metrics whenever applicable. For baselines with method-specific preprocessing or augmentation strategies, such as the CLAHE-based enhancement in VisMal and the malware-specific contrastive augmentation in MalSSL, we retained their original preprocessing settings because these components constitute an essential part of the corresponding methods. Removing them would change the methodological assumptions of the baselines and lead to an unfair underestimation of their performance.

4.4. Implementation Details

We employed a training strategy combining long-cycle training with early stopping. The maximum number of epochs was set to 100, with early stopping triggered based on the validation set’s macro F1-Score with a patience of 10 epochs. The batch size was set to 32. We used the AdamW optimizer with a base learning rate of 1 × 10⁻⁴ and a Cosine Annealing scheduler. All datasets (Fusion, Androdex-Set1, and Androdex-Set2) were randomly split into training, testing, and validation sets in a ratio of 7:2:1. Before being input into the model, images were uniformly resized to 224 × 224 via bilinear interpolation and normalized. Experiments were conducted on the Ubuntu 24.04.2 operating system, utilizing Python 3.12 and PyTorch 2.3.0 (CUDA 12.1) to build the model, with training and testing completed on an NVIDIA A100 80GB PCIe Tensor Core GPU.

All performance measurements are conducted with the following settings. Training uses a batch size of 32, while inference is performed with batch size 1 to avoid batch-level parallelism affecting latency results. We run 50 warm-up iterations before inference to stabilize GPU performance, and average latency over 200 independent runs. Training metrics are averaged over 10 batch estimations. Automatic Mixed Precision (AMP) is used for both training and final evaluation, while inference runs in default FP32 precision. All reported latency values exclude image preprocessing steps. The GPU utilization during inference reaches 97%.

5. Results and Analysis

5.1. Comprehensive Performance Evaluation and Verification

To comprehensively evaluate the performance of the AD-CapsFPN framework, this section systematically compares our method with the MobileNetV4 baseline and four representative state-of-the-art (SOTA) methods from different technical routes, to answer three key research questions:

RQ1: Does the comprehensive performance of AD-CapsFPN significantly exceed the baseline and surpass existing SOTA methods?

RQ2: Does AD-CapsFPN maintain stable and competitive performance on a large-scale, multi-source, and highly class-imbalanced malware-image benchmark?

RQ3: How does AD-CapsFPN perform under two benchmark obfuscation settings with different degrees of structural distortion?

5.1.1. Overall Performance Comparison

We conducted comparative experiments on the Fusion and Androdex benchmark datasets. The full performance metrics of all models are presented in Table 2. To improve statistical reliability, each model was further evaluated over five independent runs, and the mean and standard deviation of Macro F1-score are reported in Figure 9.

According to the experimental results, AD-CapsFPN consistently and significantly outperforms all comparative models across all metrics on all datasets, achieving SOTA performance. Compared with the MobileNetV4 baseline, AD-CapsFPN achieves a greater performance gain in obfuscated scenarios than in general scenarios: the Macro F1-Score increases from 81.76% to 97.98% (+16.22%) on the Fusion dataset (general scenario), and rises from 74.62% to 92.45% (+17.83%) on Androdex-Set1 (strong obfuscation scenario), with the improvement margin expanded by 1.61 percentage points. This difference corroborates the value of our structure-aware design optimized for obfuscated scenarios, and preliminarily answers RQ1.

However, comparison with only the MobileNetV4 baseline is insufficient to fully answer RQ1. Existing CNN baseline architectures have been proven to be highly sensitive to elementary “salt noise” [56], while the obfuscation techniques used in this experiment are deeper semantic adversarial perturbations. Therefore, we further conduct a systematic comparison with multiple SOTA models to fully verify the performance advantage of AD-CapsFPN. Compared with other complex SOTA models with certain anti-obfuscation capabilities, AD-CapsFPN still shows significant advantages in class-imbalanced and obfuscated scenarios. This indicates that the design philosophy of deeply integrating malware domain prior knowledge into network architecture is more effective than simply increasing model complexity or relying on generic learning paradigms, which fully answers RQ1.

To further examine whether the relatively smaller improvement on Androdex-Set2 is statistically stable, we performed a two-sided Welch’s t-test between AD-CapsFPN and the strongest competing model on this dataset, TAEfficientNet, based on the five independent runs. The result shows a statistically significant improvement of AD-CapsFPN over TAEfficientNet (

t = 24.04, p = 1.475 \times 10^{- 6}

). This indicates that the performance gain on Androdex-Set2 is not caused by random fluctuation, despite the smaller absolute margin compared with Androdex-Set1. A similar significant difference is also observed when comparing AD-CapsFPN with MDC-RepNet on Androdex-Set2 (

t = 38.44, p = 9.597 \times 10^{- 7}

).

5.1.2. Benchmark-Level Generalization on Heterogeneous Malware Images

The Fusion dataset integrates three classic datasets, bringing a triple challenge: 59 malware families, heterogeneous sample sources, and severe class imbalance. The three smallest families, Simda, Skintrim.N, and Wintrim.BX, contain only 219 samples in total, with 9, 16, and 19 samples in the test set, respectively, whereas the three largest families contain 8369 samples in total. Under this long-tailed distribution, macro-averaged metrics are used to avoid masking the recognition difficulty of minority classes. Many existing studies [57,58,59] may obscure minority-class recognition defects through oversampling or non-macro metrics, which can lead to overfitting to specific data distributions. Therefore, in addition to Macro F1-score, we further examine these three minority families.

AD-CapsFPN correctly classifies all test samples from Simda, Skintrim.N, and Wintrim.BX, achieving 100% recall for each family. By contrast, MDC-RepNet misclassifies 4 out of 9 Simda samples, and TAEfficientNet misclassifies 3 out of 9 Simda samples. Most of these errors are absorbed by the majority family Ramnit, which contains 308 test samples, suggesting that competing models are more likely to bias minority Simda samples toward visually similar majority-class patterns.

In contrast, AD-CapsFPN achieves a Macro F1-Score of 97.98% without relying on any artificial balancing strategies, outperforming the runner-up MDC-RepNet (94.62%). These results provide benchmark-level evidence that AD-CapsFPN can maintain stable performance under heterogeneous sources and long-tailed class distributions, which answers RQ2 within the evaluated malware-image setting.

5.1.3. Robustness Analysis in Obfuscated Scenarios

The Androdex dataset is specifically designed to evaluate model performance under obfuscated malware-image scenarios. Set1 is based on AVPass obfuscation, which mainly introduces transformations such as inserting harmless APIs, using Java reflection to hide sensitive APIs, and injecting implicit flows. These operations can shift samples toward benign-like distributions and cause stronger disruption to spatially continuous code-related patterns. Set2 is based on Obfuscapk obfuscation, which focuses more on structural rewriting and noise addition, with a relatively weaker impact on the global image shape than Set1. Therefore, Set1 presents a higher classification difficulty than Set2 despite containing fewer categories. Together, the two subsets provide a comparative evaluation of model robustness under two representative obfuscation settings with different structural disruption patterns.

Under the more challenging Set1 scenario, the advantage of AD-CapsFPN becomes more evident. The model achieves Macro F1-scores of 92.45% and 99.09% on Androdex-Set1 and Set2, respectively, outperforming the compared methods under both AVPass-based and Obfuscapk-based obfuscation. In Set1, where the structural distortion is more severe, AD-CapsFPN exceeds the runner-up TAEfficientNet by 4.3 percentage points, which is larger than the 1.53 percentage-point margin observed on Set2. This difference suggests that the proposed structure-aware design is particularly beneficial when obfuscation causes stronger spatial disruption. Therefore, these results provide a positive answer to RQ3, demonstrating that AD-CapsFPN exhibits stronger robustness than the compared methods under spatial-structure-oriented obfuscation scenarios.

This advantage may be attributed to the complementary roles of ADC-UIB, AFPN, and MC-Caps. ADC-UIB extracts anisotropy-aware structural features, AFPN aggregates discriminative feature fragments distributed across multiple scales, and MC-Caps further models spatial relationships on the fused features. Compared with traditional CNN-based models that rely on global pooling, MC-Caps retains more spatial layout information and provides a more suitable inductive bias for aggregating spatially distributed malware-image features. This design helps explain why AD-CapsFPN performs better under the AVPass-based Set1 setting, where the spatial continuity of code-related patterns is more strongly disrupted. A more fine-grained evaluation of individual obfuscation operations and adaptive obfuscation attacks remains an important direction for future work.

5.2. Ablation Studies

5.2.1. Component Analysis and Synergistic Effects

We conducted a series of module-wise ablation experiments with MobileNetV4 as the baseline. ADC-UIB, AFPN, and MC-Caps were individually enabled or disabled to isolate their independent contributions and analyze their synergistic effects in the complete framework. The detailed results are shown in Table 3.

First, the single-module results show that ADC-UIB, AFPN, and MC-Caps all bring clear improvements over the MobileNetV4 baseline, but their contributions are not identical across different scenarios. Compared with the baseline, ADC-UIB improves the Macro F1-score from 81.76% to 88.43% on Fusion, from 74.62% to 87.55% on Androdex-Set1, and from 86.36% to 94.59% on Androdex-Set2. These gains indicate that anisotropic structural correction is beneficial for malware image representation. AFPN achieves the best single-module result on the Fusion dataset (91.29%), mainly because its multi-scale fusion ability is more effective for handling class imbalance and multi-granularity visual patterns. MC-Caps achieves the best single-module result on Androdex-Set1 (88.81%), where AVPass obfuscation causes stronger spatial distortion. This suggests that part-whole relationship modeling is particularly important for structurally disturbed malware images. Therefore, no single component consistently dominates all datasets, which is consistent with the “No Free Lunch” principle: different modules have different scenario preferences.

The dual-component results further show that the synergy between modules is conditional rather than a simple additive superposition. Combinations containing MC-Caps generally produce stronger performance than the ADC-UIB+AFPN combination without MC-Caps. For example, ADC-UIB+AFPN only improves Fusion from 91.29% to 92.12% compared with standalone AFPN, with a limited gain of 0.83 percentage points. This is because this configuration still relies on the traditional GAP+FC classification head, which compresses spatially structured features into structureless vectors and fails to explicitly model part-whole relationships. More critically, ADC-UIB+AFPN even underperforms standalone MC-Caps on Androdex-Set1 (88.39% vs. 88.81%), indicating that back-end spatial reasoning is more decisive than feature enhancement alone when facing severe structural distortion.

To further identify the component most responsible for the final performance improvement, we compare the complete model with the corresponding dual-component variants. Removing ADC-UIB from the full framework reduces the Macro F1-score from 97.98% to 95.61% on Fusion, from 92.45% to 89.93% on Androdex-Set1, and from 99.09% to 96.87% on Androdex-Set2. Removing AFPN leads to larger drops, with the scores decreasing to 93.15%, 88.83%, and 96.21%, respectively. The largest degradation occurs when MC-Caps is removed: the performance decreases to 92.12% on Fusion, 88.39% on Androdex-Set1, and 95.96% on Androdex-Set2. These drops indicate that MC-Caps is the most critical component in the complete AD-CapsFPN framework. However, this does not mean that MC-Caps is always the best standalone module. Instead, its main contribution lies in converting the structural features extracted by ADC-UIB and fused by AFPN into more discriminative spatial relationship representations.

A qualitative performance leap occurs when all three modules work synergistically. The full AD-CapsFPN framework achieves the best results on all three datasets: 97.98% on Fusion, 92.45% on Androdex-Set1, and 99.09% on Androdex-Set2. Compared with the best dual-component results, the full model further improves the Macro F1-score by 2.37 percentage points on Fusion, 2.52 percentage points on Androdex-Set1, and 2.22 percentage points on Androdex-Set2. These results validate the necessity of the complete pipeline: ADC-UIB extracts anisotropic structural features, AFPN aggregates multi-scale information, and MC-Caps performs spatial relationship reasoning. Among them, MC-Caps plays the most decisive role in the final performance improvement, while ADC-UIB and AFPN provide the structural and multi-scale feature foundations required for MC-Caps to fully function.

5.2.2. Classification Head Comparison and Summary

To further verify the effectiveness of MC-Caps, we conduct two groups of controlled experiments: (1) fixing the front-end ADC-UIB+AFPN modules and comparing GAP+FC, Common CapsNet, MC-Caps-1, and MC-Caps; (2) mounting the same classification heads on the original MobileNetV4 backbone. Here, Common CapsNet refers to the standard dynamic routing capsule network architecture proposed by Sabour et al. [39], and MC-Caps-1 denotes MC-Caps without the row-aware spatial bias. The results are shown in Table 4.

The results show clear differences among classification heads. When ADC-UIB+AFPN is fixed, Common CapsNet improves the Fusion dataset by 4.62 percentage points over GAP+FC, but only improves Androdex-Set1 by 1.83 percentage points. In contrast, MC-Caps further improves the performance to 97.98%, 92.45%, and 99.09% on the three datasets, respectively. This indicates that generic capsule routing is beneficial, but the malware-aware design of MC-Caps is more effective, especially under the strongly obfuscated Set1 scenario.

The row-aware spatial bias also contributes more clearly when MC-Caps is combined with the front-end structural modules. After removing this bias, the performance of ADC-UIB+AFPN+MC-Caps decreases from 97.98% to 97.21% on Fusion, from 92.45% to 91.96% on Androdex-Set1, and from 99.09% to 98.56% on Androdex-Set2. In contrast, when MC-Caps is directly mounted on the original MobileNetV4 backbone, the performance drop is relatively small. This suggests that the row-aware bias is not an isolated performance booster; instead, it becomes more effective when ADC-UIB and AFPN provide structurally corrected and multi-scale features.

Overall, Table 4 shows that MC-Caps outperforms both GAP+FC and Common CapsNet under the same front-end setting. The additional row-aware bias ablation further supports the necessity of injecting malware-specific spatial priors into capsule routing, while also showing that its benefit depends on the quality of the upstream feature representation.

5.2.3. Hyperparameter Sensitivity Analysis of ADC-UIB

To further examine the asymmetric dilation design in ADC-UIB, we conducted a hyperparameter sensitivity analysis on Androdex-Set1, which contains the most severe spatial structural distortion. AFPN and MC-Caps were kept unchanged, and only the dilation configuration of ADC-UIB was varied.

As shown in Table 5, all asymmetric dilation settings outperform the standard UIB baseline, indicating that vertical receptive-field expansion is beneficial for structurally distorted malware images. Among the tested asymmetric configurations, (2, 5, 9) achieves the best Macro F1-score of 92.45%. The smaller setting of (1, 3, 5) provides a relatively limited receptive field, while the larger setting of (3, 7, 11) brings no further improvement, probably because overly sparse sampling weakens local structural details. Therefore, (2, 5, 9) offers a better empirical trade-off between local detail preservation and long-range structural modeling.

The comparison between asymmetric and symmetric dilation further supports the anisotropic design of ADC-UIB. Under the same dilation scales, asymmetric dilation consistently outperforms symmetric dilation by 0.83, 2.16, and 3.18 percentage points for the (1, 3, 5), (2, 5, 9), and (3, 7, 11) settings, respectively. Moreover, the performance of symmetric dilation decreases as the dilation rate increases, and the (3, 7, 11) symmetric setting even falls below the standard UIB baseline. This suggests that enlarging the receptive field in both directions is not suitable for malware images, since horizontal dilation may disrupt intra-row byte continuity, whereas vertical dilation better captures inter-row structural dependencies. These results provide quantitative support for both the asymmetric convolution design and the selected (2, 5, 9) dilation configuration.

5.3. Complexity Analysis

We quantitatively analyze the computational cost introduced by MC-Caps using a controlled GAP+FC baseline, as shown in Table 6. This baseline shares the same ADC-UIB and AFPN modules with AD-CapsFPN, but replaces the capsule-based classification head with a conventional GAP+FC head. Since capsule routing involves iterative computation, we further report an additional variant, AD-CapsFPN-3R, which uses three routing iterations during inference. By default, AD-CapsFPN uses one routing iteration during inference.

Compared with the GAP+FC baseline, AD-CapsFPN increases the number of parameters from 28.93 M to 43.21 M, corresponding to a 49.3% increase. The additional parameters mainly come from the capsule transformation matrices. However, the FLOPs only increase from 22.54 G to 25.56 G, with a moderate overhead of 13.4%. The training time per epoch increases from 75.97 s to 96.54 s, indicating that MC-Caps introduces additional training overhead. During inference, AD-CapsFPN achieves 19.67 ms/img with one routing iteration, while AD-CapsFPN-3R requires 24.62 ms/img with three routing iterations. This confirms that reducing the number of routing iterations during inference is an important reason for the favorable latency of AD-CapsFPN.

As shown in Table 7, using one routing iteration during inference does not cause obvious performance degradation. Compared with AD-CapsFPN-3R, the default AD-CapsFPN changes the Macro F1-score by only +0.02, −0.13, and +0.14 percentage points on Fusion, Androdex-Set1, and Androdex-Set2, respectively. These small differences indicate that the one-iteration inference setting produces nearly consistent classification results with the three-iteration setting after the model has been trained with iterative routing.

Overall, MC-Caps introduces additional parameters and training cost, but the inference overhead remains controllable. More importantly, AD-CapsFPN achieves clear performance gains over the GAP+FC baseline, improving the Macro F1-score by 5.86, 4.06, and 3.13 percentage points on Fusion, Androdex-Set1, and Androdex-Set2, respectively. Therefore, the proposed framework provides a reasonable trade-off between capsule-based spatial reasoning and inference efficiency, and the use of one routing iteration during inference is explicitly treated as a deployment-oriented efficiency setting rather than a hidden advantage.

6. Conclusions

This paper investigates an inductive-bias mismatch between conventional deep learning architectures and malware image classification. In malware images generated from byte sequences, horizontal and vertical directions carry different structural meanings. Standard CNN-based models, which usually rely on isotropic convolution and global pooling, may be less suited to preserving such anisotropic structure, especially when obfuscation disrupts the spatial continuity of code-related patterns. To address this problem, we propose AD-CapsFPN, a malware-image-oriented framework that integrates asymmetric structural feature extraction, adaptive multi-scale fusion, and capsule-based spatial reasoning.

Experiments on the Fusion and Androdex benchmarks show that AD-CapsFPN achieves competitive and superior performance among the compared methods under the evaluated settings. The improvement is especially evident on the highly obfuscated Androdex-Set1 scenario, where AD-CapsFPN outperforms TAEfficientNet by 4.3 percentage points. Ablation studies further confirm that the performance gain does not come from a single isolated module, but from the coordinated Rectification–Fusion–Inference pipeline. Among the three components, MC-Caps plays the most decisive role in the final performance improvement, while ADC-UIB and AFPN provide the structural and multi-scale feature foundations required for effective spatial reasoning.

These results suggest that incorporating malware-specific structural priors into image-based classification models is a promising direction for improving robustness against obfuscation-induced structural distortion. Future work will focus on reducing the computational cost of capsule routing, evaluating the framework on larger and more heterogeneous malware datasets, and further validating its robustness and scalability under practical deployment conditions, such as unseen malware families, evolving obfuscation strategies, and resource-constrained environments.

Author Contributions

Conceptualization, L.W. and J.L.; methodology, L.W.; software, L.W.; validation, L.W.; formal analysis, L.W.; investigation, L.W.; resources, J.L.; writing—original draft preparation, L.W.; writing—review and editing, J.L., Y.S., Y.R. and Y.X.; visualization, L.W.; supervision, J.L.; project administration, Y.R.; funding acquisition, Y.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (Grant No. 62402521, 62531018), and the Shaanxi Provincial Natural Science Basic Research Program of China (Grant No. 2021JM-226).

Data Availability Statement

The datasets supporting the conclusions of this article are available in public repositories. The Fusion dataset is available in the Kaggle repository at https://doi.org/10.34740/kaggle/dsv/8189053 (accessed on 21 May 2026). The Androdex dataset is available in the Figshare repository at https://figshare.com/articles/dataset/Androdex_Images/23931204?file=41969277 (accessed on 21 May 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AD-CapsFPN	Asymmetric Dilated Convolutional Capsule Network with Feature Pyramid
ADC	Asymmetric Dilated Convolution
UIB	Universal Inverted Bottleneck
AFPN	Adaptive Feature Pyramid Network
MC-Caps	Malware-Aware Capsule Network

References

Wang, H.; Cui, B.; Yuan, Q.; Shi, R.; Huang, M. A Review of Deep Learning Based Malware Detection Techniques. Neurocomputing 2024, 598, 128010. [Google Scholar] [CrossRef]
Tan, H.; Wang, M.; Shen, J.; Vijayakumar, P.; Moh, S.; Wu, Q.J. Blockchain-Assisted Conditional Anonymous Authentication and Adaptive Tree-Based Group Key Agreement for VANETs. IEEE Trans. Dependable Secur. Comput. 2025, 23, 2664–2679. [Google Scholar] [CrossRef]
Sutrala, A.K.; Bagga, P.; Das, A.K.; Kumar, N.; Rodrigues, J.J.; Lorenz, P. On the Design of Conditional Privacy Preserving Batch Verification-Based Authentication Scheme for Internet of Vehicles Deployment. IEEE Trans. Veh. Technol. 2020, 69, 5535–5548. [Google Scholar] [CrossRef]
Kaspersky. 2025 Security Statistics Report. Available online: https://lp.kaspersky.com/global/ksb2025-number-of-the-year/#toc-2 (accessed on 21 May 2026).
Song, Y.; Zhang, D.; Wang, J.; Wang, Y.; Wang, Y.; Ding, P. Application of Deep Learning in Malware Detection: A Review. J. Big Data 2025, 12, 99. [Google Scholar] [CrossRef]
Zhang, R.; Cao, Z.; Yang, S.; Si, L.; Sun, H.; Xu, L.; Sun, F. Cognition-Driven Structural Prior for Instance-Dependent Label Transition Matrix Estimation. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 3730–3743. [Google Scholar] [CrossRef]
Zhang, R.; Tan, J.; Cao, Z.; Xu, L.; Liu, Y.; Si, L.; Sun, F. Part-Aware Correlation Networks for Few-Shot Learning. IEEE Trans. Multimed. 2024, 26, 9527–9538. [Google Scholar] [CrossRef]
Zhang, R.; Yang, B.; Xu, L.; Huang, Y.; Xu, X.; Zhang, Q.; Jiang, Z.; Liu, Y. A Benchmark and Frequency Compression Method for Infrared Few-Shot Object Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5001711. [Google Scholar] [CrossRef]
Wang, P.; Song, Y.; Xiang, Q.; Wang, X. Lifelong Learning Method for Intrusion Detection of Internet of Things with Class-Attention Fusion Strategy. Knowl.-Based Syst. 2026, 334, 115181. [Google Scholar] [CrossRef]
Wang, P.; Wu, X.; Song, Y.; Tian, D.; Wang, X. A Comprehensive Review of Time Series Classification: Traditional, Deep Learning, and Few-Shot Learning Methods. Comput. Sci. Rev. 2026, 61, 100953. [Google Scholar] [CrossRef]
Maniriho, P.; Mahmood, A.N.; Chowdhury, M.J.M. A Systematic Literature Review on Windows Malware Detection: Techniques, Research Issues, and Future Directions. J. Syst. Softw. 2024, 209, 111921. [Google Scholar] [CrossRef]
Wu, X.; Song, Y.; Hou, X.; Ma, Z.; Chen, C. Deep Learning Model with Sequential Features for Malware Classification. Appl. Sci. 2022, 12, 9994. [Google Scholar] [CrossRef]
Ranjani, B.; Chinnadurai, M. Sparse Attention with Residual Pyramidal Depthwise Separable Convolutional Based Malware Detection with Optimization Mechanism. Sci. Rep. 2024, 14, 24414. [Google Scholar] [CrossRef]
Du, Y.; Gao, C.; Chen, X.; Cui, M.; Xu, L.; Ning, A. Mobile Malware Detection Method Using Improved GhostNetV2 with Image Enhancement Technique. Sci. Rep. 2025, 15, 25019. [Google Scholar] [CrossRef]
Yu, Y.; Cai, B.; Aziz, K.; Wang, X.; Luo, J.; Iqbal, M.S.; Chakrabarti, T. Semantic Lossless Encoded Image Representation for Malware Classification. Sci. Rep. 2025, 15, 7997. [Google Scholar] [CrossRef]
Zhang, D.; Song, Y.; Xiang, Q.; Wang, Y. IMCMK-CNN: A Lightweight Convolutional Neural Network with Multi-Scale Kernels for Image-Based Malware Classification. Alex. Eng. J. 2025, 111, 203–220. [Google Scholar] [CrossRef]
Bhusal, D.; Rastogi, N. Adversarial Patterns: Building Robust Android Malware Classifiers. ACM Comput. Surv. 2025, 57, 1–34. [Google Scholar] [CrossRef]
Yan, S.; Ren, J.; Wang, W.; Sun, L.; Zhang, W.; Yu, Q. A Survey of Adversarial Attack and Defense Methods for Malware Classification in Cyber Security. IEEE Commun. Surv. Tutor. 2022, 25, 467–496. [Google Scholar] [CrossRef]
Zhu, Y.; Zhao, Y.; Hu, Z.; Luo, T.; He, L. A Review of Black-Box Adversarial Attacks on Image Classification. Neurocomputing 2024, 610, 128512. [Google Scholar] [CrossRef]
Chen, D.; Yan, H. Research on APT Groups Malware Classification Based on TCN-GAN. PLoS ONE 2025, 20, e0323377. [Google Scholar] [CrossRef]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Howard, A. MobileNetV4: Universal Models for the Mobile Ecosystem. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 78–96. [Google Scholar]
Han, K.S.; Lim, J.H.; Kang, B.; Im, E.G. Malware Analysis Using Visualized Images and Entropy Graphs. Int. J. Inf. Secur. 2015, 14, 1–14. [Google Scholar] [CrossRef]
Ye, Y.; Li, T.; Adjeroh, D.; Iyengar, S.S. A Survey on Malware Detection Using Data Mining Techniques. ACM Comput. Surv. (CSUR) 2017, 50, 1–40. [Google Scholar] [CrossRef]
Pan, Y.; Ge, X.; Fang, C.; Fan, Y. A Systematic Literature Review of Android Malware Detection Using Static Analysis. IEEE Access 2020, 8, 116363–116379. [Google Scholar] [CrossRef]
Yang, H.; Li, S.; Wu, X.; Lu, H.; Han, W. A Novel Solutions for Malicious Code Detection and Family Clustering Based on Machine Learning. IEEE Access 2019, 7, 148853–148860. [Google Scholar] [CrossRef]
Roseline, S.A.; Geetha, S.; Kadry, S.; Nam, Y. Intelligent Vision-Based Malware Detection and Classification Using Deep Random Forest Paradigm. IEEE Access 2020, 8, 206303–206324. [Google Scholar] [CrossRef]
Li, J.; Xue, D.; Wu, W.; Wang, J. Incremental Learning for Malware Classification in Small Datasets. Secur. Commun. Netw. 2020, 2020, 6309243. [Google Scholar] [CrossRef]
Abusitta, A.; Li, M.Q.; Fung, B.C. Malware Classification and Composition Analysis: A Survey of Recent Developments. J. Inf. Secur. Appl. 2021, 59, 102828. [Google Scholar] [CrossRef]
Nataraj, L.; Yegneswaran, V.; Porras, P.; Zhang, J. A Comparative Assessment of Malware Classification Using Binary Texture Analysis and Dynamic Analysis. In Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, Chicago, IL, USA, 21 October 2011; pp. 21–30. [Google Scholar]
Lin, W.C.; Yeh, Y.R. Efficient Malware Classification by Binary Sequences with One-Dimensional Convolutional Neural Networks. Mathematics 2022, 10, 608. [Google Scholar] [CrossRef]
Guven, M. Leveraging Deep Learning and Image Conversion of Executable Files for Effective Malware Detection: A Static Malware Analysis Approach. Aims Math. 2024, 9, 15223–15245. [Google Scholar] [CrossRef]
Cen, M.; Deng, X.; Jiang, F.; Doss, R. Zero-Ran Sniff: A Zero-Day Ransomware Early Detection Method Based on Zero-Shot Learning. Comput. Secur. 2024, 142, 103849. [Google Scholar] [CrossRef]
Belal, M.M.; Sundaram, D.M. Global-Local Attention-Based Butterfly Vision Transformer for Visualization-Based Malware Classification. IEEE Access 2023, 11, 69337–69355. [Google Scholar] [CrossRef]
Yang, J.; Liang, H.; Ren, H.; Jia, D.; Wang, X. SAC: Collaborative Learning of Structure and Content Features for Android Malware Detection Framework. Neurocomputing 2025, 637, 130053. [Google Scholar] [CrossRef]
Li, S.; Wang, J.; Wang, S.; Song, Y. PAFE: A lightweight visualization-based fast malware classification method. Heliyon 2024, 10, e35965. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975. [Google Scholar]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic Routing between Capsules. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Wang, S.W.; Zhou, G.; Lu, J.C.; Zhang, F.J. A Novel Malware Detection and Classification Method Based on Capsule Network. In Proceedings of the International Conference on Artificial Intelligence and Security, New York, NY, USA, 26–28 July 2019; Springer International Publishing: Cham, Switzerland, 2019; pp. 573–584. [Google Scholar]
Shelar, M.D.; Rao, S.S. Enhanced Capsule Network-Based Executable Files Malware Detection and Classification—Deep Learning Approach. Concurr. Comput. Pract. Exp. 2024, 36, e7928. [Google Scholar] [CrossRef]
Tao, J.; Zhang, X.; Luo, X.; Wang, Y.; Song, C.; Sun, Y. Adaptive Capsule Network. Comput. Vis. Image Underst. 2022, 218, 103405. [Google Scholar] [CrossRef]
Liu, Y.; Cheng, D.; Zhang, D.; Xu, S.; Han, J. Capsule Networks with Residual Pose Routing. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 2648–2661. [Google Scholar] [CrossRef]
Zhang, X.; Wu, K.; Chen, Z.; Zhang, C. MalCaps: A Capsule Network Based Model for the Malware Classification. Processes 2021, 9, 929. [Google Scholar] [CrossRef]
Zoph, B.; Le, Q.V. Neural Architecture Search with Reinforcement Learning. arXiv 2016, arXiv:1611.01578. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Salas, M.I.P. Fusion Malware Dataset: 59 Malware Families in PNG Format; Kaggle: San Francisco, CA, USA, 2024; Available online: https://www.kaggle.com/datasets/marcesalas/fusion-dataset-59-malware-families-in-png-format (accessed on 21 May 2026).
Ronen, R.; Radu, M.; Feuerstein, C.; Yom-Tov, E.; Ahmadi, M. Microsoft Malware Classification Challenge. arXiv 2018. [Google Scholar] [CrossRef]
Nataraj, L.; Karthikeyan, S.; Jacob, G.; Manjunath, B.S. Malware Images: Visualization and Automatic Classification. In Proceedings of the 8th International Symposium on Visualization for Cyber Security, Pittsburgh, PA, USA, 20 July 2011; pp. 1–7. [Google Scholar]
Bozkir, A.S.; Cankaya, A.O.; Aydos, M. Utilization and Comparision of Convolutional Neural Networks in Malware Recognition. In Proceedings of the 2019 27th Signal Processing and Communications Applications Conference (SIU), Sivas, Turkey, 24–26 April 2019; IEEE: New York, NY, USA, 2019; pp. 1–4. [Google Scholar]
Aurangzeb, S.; Aleem, M.; Khan, M.T.; Loukas, G.; Sakellari, G. AndroDex: Android Dex Images of Obfuscated Malware. Sci. Data 2024, 11, 212. [Google Scholar] [CrossRef]
Zhong, F.; Chen, Z.; Xu, M.; Zhang, G.; Yu, D.; Cheng, X. Malware-on-the-Brain: Illuminating Malware Byte Codes with Images for Malware Classification. IEEE Trans. Comput. 2022, 72, 438–451. [Google Scholar] [CrossRef]
Ismail, S.J.I.; Rahardjo, B.; Juhana, T.; Musashi, Y. MalSSL–Self-Supervised Learning for Accurate and Label-Efficient Malware Classification. IEEE Access 2024, 12, 58823–58835. [Google Scholar] [CrossRef]
Li, S.; Wang, J.; Song, Y.; Wang, S.; Wang, Y. A Lightweight Model for Malicious Code Classification Based on Structural Reparameterisation and Large Convolutional Kernels. Int. J. Comput. Intell. Syst. 2024, 17, 30. [Google Scholar] [CrossRef]
Xuan, B.; Li, J.; Song, Y. BiTCN-TAEfficientNet Malware Classification Approach Based on Sequence and RGB Fusion. Comput. Secur. 2024, 139, 103734. [Google Scholar] [CrossRef]
Roy, A.; Di Troia, F. Discriminative Regions and Adversarial Sensitivity in CNN-Based Malware Image Classification. Electronics 2025, 14, 3937. [Google Scholar] [CrossRef]
Joshi, C.; Kumar, J.; Kumawat, G. Detection of Unseen Malware Threats Using Generative Adversarial Networks and Deep Learning Models. Sci. Rep. 2025, 15, 34804. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.; Xing, S.; Ren, X. Efficient Windows Malware Identification and Classification Scheme for Plant Protection Information Systems. Front. Plant Sci. 2023, 14, 1123696. [Google Scholar] [CrossRef]
Ashawa, M.; McGregor, R.; Owoh, N.P.; Osamor, J.; Adejoh, J. Static and Dynamic Malware Analysis Using CycleGAN Data Augmentation and Deep Learning Techniques. Appl. Sci. 2025, 15, 9830. [Google Scholar] [CrossRef]

Figure 1. Comparison of Structural Inductive Bias between Natural Images and Malware Images.

Figure 2. The Malware Picasso Problem.

Figure 3. Overall Architecture of AD-CapsFPN.

Figure 4. Structure of the UIB Block.

Figure 5. Structure of the ADC-UIB Block.

Figure 6. Structure of AFPN.

Figure 7. Structure of MC-Caps.

Figure 8. Visualization of MC-Caps Primary Capsule Activation.

Figure 9. Macro F1-score comparison with mean and standard deviation over five independent runs.

Table 1. ADC Branch Configuration and Correspondence to Malware Semantic Structures.

Branch Type	Dilation Rate	Horizontal RF	Vertical RF	Corresponding Malware Structure
Original UIB Block	(1, 1)	3	3	Local Byte Patterns
Local Branch	(2, 1)	3	5	Basic Instruction Units
Mid-range Branch	(5, 1)	3	11	Intra-function Logic Blocks
Long-range Branch	(9, 1)	3	19	Cross-function Dependencies

Table 2. Comprehensive Performance Comparison with SOTA Methods on Three Datasets.

Model	Dataset	Accuracy	Precision	Recall	F1-Score
AD-CapsFPN (Ours)	Fusion	98.75	98.24	97.86	97.98
	Androdex-Set1	93.81	93.23	91.86	92.45
	Androdex-Set2	99.09	99.08	99.09	99.09
MobileNetV4	Fusion	89.56	82.44	82.95	81.76
	Androdex-Set1	79.93	74.49	74.79	74.62
	Androdex-Set2	86.73	87.65	86.54	86.36
VisMal	Fusion	90.19	87.22	85.95	86.13
	Androdex-Set1	88.63	86.19	86.64	86.40
	Androdex-Set2	94.90	95.32	94.66	94.84
MalSSL	Fusion	95.36	93.63	93.57	93.51
	Androdex-Set1	87.79	85.35	86.32	85.71
	Androdex-Set2	96.31	96.39	96.06	96.10
TAEfficientNet	Fusion	96.53	93.04	93.36	92.62
	Androdex-Set1	89.63	87.68	89.49	88.15
	Androdex-Set2	97.57	97.55	97.57	97.56
MDC-RepNet	Fusion	97.33	95.04	94.96	94.62
	Androdex-Set1	90.72	88.77	89.21	88.97
	Androdex-Set2	94.87	95.13	94.93	95.01

Table 3. Results of Incremental Ablation Experiments (Macro F1-Score, %).

Number	ADC-UIB	AFPN	MC-Caps	Fusion	Androdex-Set1	Androdex-Set2
1	√	×	×	88.43	87.55	94.59
2	×	√	×	91.29	87.22	95.53
3	×	×	√	89.68	88.81	95.14
4	×	√	√	95.61	89.93	96.87
5	√	×	√	93.15	88.83	96.21
6	√	√	×	92.12	88.39	95.96
7	√	√	√	97.98	92.45	99.09
8	×	×	×	81.76	74.62	86.36

Table 4. Performance Comparison of Different Classification Heads (Macro F1-Score, %).

Model	Classifier Head	Fusion	Androdex-Set1	Androdex-Set2
ADC-UIB+AFPN	GAP+FC	92.12	88.39	95.96
ADC-UIB+AFPN	Common CapsNet	96.74	90.22	98.05
ADC-UIB+AFPN	MC-Caps	97.98	92.45	99.09
ADC-UIB+AFPN	MC-Caps-1	97.21	91.96	98.56
MobileNetV4	GAP+FC	81.76	74.62	86.36
MobileNetV4	Common CapsNet	80.15	75.82	86.81
MobileNetV4	MC-Caps	89.68	88.81	95.14
MobileNetV4	MC-Caps-1	89.37	88.54	94.78

Table 5. Hyperparameter Sensitivity Analysis (Macro F1-Score, %).

Design	Vertical Dilation	Horizontal Dilation	Macro F1-Score	Improvement over Standard UIB
Standard UIB	-	-	89.93	-
Asymmetric	(1, 3, 5)	(1, 1, 1)	91.66	+1.73
Asymmetric	(2, 5, 9)	(1, 1, 1)	92.45	+2.52
Asymmetric	(3, 7, 11)	(1, 1, 1)	92.21	+2.28
Symmetric	(1, 3, 5)	(1, 3, 5)	90.83	+0.90
Symmetric	(2, 5, 9)	(2, 5, 9)	90.29	+0.36
Symmetric	(3, 7, 11)	(3, 7, 11)	89.03	−0.90

Table 6. Complexity Comparison.

Model	Total Params	FLOPs	Inference Speed	Training Time/Epoch
GAP+FC	28.93 M	22.54 G	20.31 ms/img	75.97 s
AD-CapsFPN	43.21 M	25.56 G	19.67 ms/img	96.54 s
AD-CapsFPN-3R	43.21 M	25.56 G	24.62 ms/img	95.26 s

Table 7. Performance Comparison (Macro F1-Score, %).

Model	Fusion	Androdex-Set1	Androdex-Set2
GAP+FC	92.12	88.39	95.96
AD-CapsFPN	97.98	92.45	99.09
AD-CapsFPN-3R	97.96	92.58	98.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, L.; Li, J.; Song, Y.; Ren, Y.; Xu, Y. AD-CapsFPN: An Asymmetric Dilated Convolutional Capsule Network with Feature Pyramid for Malware Classification. Electronics 2026, 15, 2355. https://doi.org/10.3390/electronics15112355

AMA Style

Wang L, Li J, Song Y, Ren Y, Xu Y. AD-CapsFPN: An Asymmetric Dilated Convolutional Capsule Network with Feature Pyramid for Malware Classification. Electronics. 2026; 15(11):2355. https://doi.org/10.3390/electronics15112355

Chicago/Turabian Style

Wang, Longcheng, Jin Li, Yafei Song, Yanbing Ren, and Yunfei Xu. 2026. "AD-CapsFPN: An Asymmetric Dilated Convolutional Capsule Network with Feature Pyramid for Malware Classification" Electronics 15, no. 11: 2355. https://doi.org/10.3390/electronics15112355

APA Style

Wang, L., Li, J., Song, Y., Ren, Y., & Xu, Y. (2026). AD-CapsFPN: An Asymmetric Dilated Convolutional Capsule Network with Feature Pyramid for Malware Classification. Electronics, 15(11), 2355. https://doi.org/10.3390/electronics15112355

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AD-CapsFPN: An Asymmetric Dilated Convolutional Capsule Network with Feature Pyramid for Malware Classification

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Asymmetric Dilated Convolutional UIB Module

3.2. Adaptive Feature Pyramid Network

3.3. Malware-Aware Capsule Network

3.3.1. Malware-Aware Primary Capsule Layer

3.3.2. Class Capsule Layer and Dynamic Routing

3.3.3. Loss Function Design and Summary

4. Experimental Setup

4.1. Datasets

4.2. Evaluation Metrics

4.3. Baselines

4.4. Implementation Details

5. Results and Analysis

5.1. Comprehensive Performance Evaluation and Verification

5.1.1. Overall Performance Comparison

5.1.2. Benchmark-Level Generalization on Heterogeneous Malware Images

5.1.3. Robustness Analysis in Obfuscated Scenarios

5.2. Ablation Studies

5.2.1. Component Analysis and Synergistic Effects

5.2.2. Classification Head Comparison and Summary

5.2.3. Hyperparameter Sensitivity Analysis of ADC-UIB

5.3. Complexity Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI