3.1. Asymmetric Dilated Convolutional UIB Module
As a backbone network, MobileNetV4 is lightweight, efficient, and suitable for multi-endpoint deployment. Its core innovation, the Universal Inverted Bottleneck (UIB) module, adopts depthwise separable convolutions determined via Neural Architecture Search (NAS) [
45]. The model structure, optimized on massive natural image data, lays the foundation for its superior performance across multiple public natural image datasets. The structure of the UIB module is shown in
Figure 4, where activation and normalization layers between pointwise and depthwise convolution layers are omitted for brevity. However, directly transferring this architecture to malware classification faces a significant inductive bias mismatch.
Limited Receptive Field: Standard UIBs employ fixed small convolution kernels (3 × 3 or 5 × 5), capable of capturing only local byte patterns but failing to cover cross-scale dependencies in malware, ranging from basic instructions to cross-function calls.
Isotropy Defect: Its symmetric receptive field assumes spatial information is equivalent in all directions, failing to adapt to the anisotropic structure of malware images, which are continuous within rows (instruction streams) but discrete between rows (physical truncation).
Noise Sensitivity: The UIB module treats all feature channels with equal weight, lacking noise suppression mechanisms for obfuscated code (e.g., code padded with junk bytes). This deficiency may cause noise to be amplified during the subsequent feature pyramid fusion process, thereby compromising the precision of spatial reasoning in the final capsule network.
Therefore, to address these issues while inheriting the efficiency of MobileNetV4, we designed the Asymmetric Dilated Convolutional UIB (ADC-UIB). This module enhances the model’s perception of malware spatial structures in a lightweight manner. Its core lies in enabling precise perception through “Asymmetric Dilated Sampling + Multi-Scale Parallel Branches + Adaptive Gating,” as shown in
Figure 5.
The design objective of ADC-UIB is explicitly defined as maintaining continuity in the row direction while expanding the receptive field in the column direction. By capturing local, mid-range, and long-range dependencies through parallel branches to achieve multi-scale coverage, and finally adaptively fusing features from different channels, it reinforces discriminative features while suppressing noise. We detail each component below.
First is the Asymmetric Dilated Convolution. For an input feature map X ∈ R^(B × C × H × W) (where B is batch size, C is channel count, and H, W are height and width), the calculation is defined as shown in Equation (1):
Here, k = 3 or 5, keeping the kernel size unchanged;
, meaning no dilation in the row direction to maintain continuity. This implies we maintain standard compact sampling within rows to ensure the complete capture of basic instruction semantics formed by continuous bytes, mitigating the fragmentation of instruction stream information caused by skip-sampling. In contrast, the vertical dilation rate is set as
in the three parallel branches, corresponding to local, mid-range, and long-range vertical contexts. This multi-branch setting is motivated by the structural prior that malware images exhibit strong horizontal continuity but weaker vertical adjacency under fixed-width folding. Specifically, horizontal dilation is avoided to preserve row-wise byte continuity, whereas vertical dilation is introduced to expand the receptive field along the folded sequence direction. The selected dilation rates provide progressively enlarged vertical receptive fields while keeping the horizontal receptive field compact. The influence of different vertical dilation configurations is further evaluated in the sensitivity analysis in
Section 5.2.3. Padding
is set to match the output resolution with the input, as shown in Equations (2) and (3), to avoid structural information loss.
To provide multi-range contextual coverage for malware images, we design three parallel ADC branches. The dilation rates, receptive fields (RF), and their correspondence to malware structures for each branch are presented in
Table 1. We embed multi-scale parallel perception into the building blocks of the backbone, maintaining sensitivity to fine-grained structures during feature extraction. Notably, the vertical RF reaches up to 19 (more than six times that of the original UIB), efficiently capturing function call relations spanning hundreds of pixels, while the horizontal RF remains at 3, preserving the continuity of instruction sequences.
Subsequently, the output features of the three branches are first concatenated along the channel dimension to obtain
, and then compressed to fuse characteristics, as shown in Equation (4), where
denotes Batch Normalization, followed by a ReLU6 activation function to stabilize training. This ensures the output dimension is consistent with the original UIB, maintaining compatibility with the MobileNetV4 data flow.
To dynamically screen discriminative features of malware images and filter out noise, we introduce channel-wise adaptive gating inspired by the SE (Squeeze-and-Excitation) module [
46]. First, global information is compressed to obtain the channel statistic vector
, as shown in Equation (5).
Then, channel weights
are learned through two pointwise convolution layers to highlight effective features and suppress noise channels, as shown in Equation (6), where
is the reduction ratio (set to 8) used to balance computation and representational capacity, and
is the Sigmoid function.
Finally, the weights
are multiplied element-wise with
across channels to obtain the final output
, as shown in Equation (7):
In the MobileNetV4 backbone, we do not replace all UIB blocks. Instead, we only replace the depthwise separable convolution layers in the UIB blocks within the last 60% of the deep network stages with the ADC-UIB modules we designed. This is because shallow network layers (the first 40%) are primarily responsible for learning low-level features such as edges, corners, and textures. In malware images, this corresponds to basic byte-level patterns, for which the original efficient UIB blocks are sufficient. As the network deepens, the semantic information of feature maps becomes richer, representing higher-level, more abstract concepts. For malware, this equates to recognizing entire functions, algorithmic structures (e.g., encryption routines), and control flow logic. It is in these stages that the directional awareness and multi-scale capabilities of ADC-UIB are most effective, enabling the interpretation of complex structural relationships from abstracted features. This “front-light, back-heavy” hybrid design intelligently allocates computational resources to the network stages that most require structured analysis, ensuring that the deep feature maps input to the subsequent Adaptive Feature Pyramid Network (AFPN) are already enriched with anisotropy-corrected multi-scale spatial information. Based on this, the next subsection will introduce how to integrate these rectified features distributed across different semantic levels into a unified representation that possesses both local details and a global perspective, addressing the challenge of dispersed malware discriminative information.
3.2. Adaptive Feature Pyramid Network
Following the processing by the ADC-UIB module, we extract anisotropy-rectified multi-scale features from the deep layers of the backbone network. However, discriminative features of malware—ranging from minute API call sequences (local details) to large-scale code block structures (global structural features)—are often scattered across different scales and locations in the image. Traditional FPNs typically take the shallowest features as input. Yet, the raw byte-level pixel details of malware images contain substantial invalid padding bytes and noise inserted via obfuscation. Crucially, the core of the downstream Capsule Network lies in inferring part-whole structural relationships (e.g., Instruction Block → Function → Malicious Behavior), which requires structured component features rather than pixel-level noise. If the Capsule Network erroneously learns low-level textures or even noise as components, structural inference accuracy will be severely compromised. Furthermore, traditional multi-scale feature fusion often employs simple concatenation or equal-weight summation. As previously mentioned, discriminative features of malware exhibit a non-uniform cross-scale distribution. Standard fusion methods dilute critical features and ignore the structural disparities among different malware families. To address these challenges, we designed the Adaptive Feature Pyramid Network (AFPN), the structure of which is shown in
Figure 6.
We first select feature maps from three distinct stages of the modified MobileNetV4 backbone (excluding the shallow layers), denoted as
. These feature maps possess varying spatial resolutions and semantic levels, collectively covering the “local-mid-long” range dependencies of malware. We explicitly discard
(the feature map from the first 40% of the backbone network), as it is redundant and its shallow features are incompatible with the downstream Capsule Network. Moreover, the receptive field of the
layer is extremely small, primarily capturing relationships between adjacent pixels. This hardly aids in understanding the semantics of instructions composed of multiple bytes and is prone to introducing high-frequency noise. Through a top-down pathway and lateral connections, we progressively upsample the deep, strong-semantic feature map (
) and element-wise add it to the shallow feature map (
) which has undergone dimensionality reduction via pointwise convolution. This pointwise convolution ensures that the channel count of the shallow features matches that of the upsampled deep features, reducing feature conflicts caused by dimension mismatch, as shown in Equation (8). This process generates a new set of pyramid feature maps
that are rich in semantic information across all scales, with channel counts unified to
.
Standard classification models typically use only the top-level feature of the pyramid, leading to the loss of valuable information from other scales. Conversely, simply concatenating or summing all pyramid layers ignores the fact that features at different scales may possess varying degrees of importance for specific malware samples. For instance, for a Trojan relying on specific API call sequences, the high-resolution layer may contribute most; whereas for ransomware with a distinct packed structure, the low-resolution yet semantically rich layer may be more critical. To allow the model to dynamically assess and fuse features from different scales based on the input data, we designed an Adaptive Weighted Fusion Module. This module accepts all pyramid layers output by the FPN as input, following the process below:
First, to resolve dimension mismatch, we unsampled
and
via Bilinear Interpolation to the same spatial dimensions
as the highest-resolution
, obtaining a set of size-aligned feature maps, as shown in Equation (9), where
and
represents the size-aligned feature of the
scale.
Next, we introduce a learnable scale weight vector
, where each element corresponds to the importance of a pyramid layer and is initialized to equal weights. To ensure the weights sum to 1 and remain positive, we normalize them using Equation (10), where
is the normalized weight for the
pyramid layer. During training, the model automatically learns the optimal weight distribution via backpropagation, thereby achieving adaptive selection of multi-scale features.
Then, we perform element-wise weighting of the normalized weights with their corresponding upsampled feature maps and concatenate them along the channel dimension, as shown in Equation (11).
Finally, to fuse the concatenated high-dimensional features into a more compact representation, we process them using a pointwise convolution layer to generate the final fused feature map, as shown in Equation (12). Here,
is the output channel count after fusion, and the pointwise convolution serves to integrate cross-scale channel information and eliminate channel redundancy.
not only synthesizes information from all scales but, through the adaptive weighting mechanism, highlights the feature scales most critical for the current classification task. The AFPN acts as a bridge, taking the anisotropy-corrected spatial feature maps unique to malware images from the ADC-UIB module as input, and outputting feature maps enriched with precise spatial layouts and powerful cross-scale semantic annotations. Thus, the next step is to design a classifier capable of directly assembling this structured information and performing advanced spatial relationship reasoning, ultimately solving the “Malware Picasso Problem” in malware image classification.
3.3. Malware-Aware Capsule Network
To avoid irreversible spatial information loss caused by global average pooling (GAP) in traditional CNNs, and realize end-to-end spatial relationship modeling for the “Malware Picasso Problem”, we design a Malware-Aware Capsule Network (MC-Caps) customized for malware images. It consists of a Malware-Aware Primary Capsule (MC-AC) layer and a Class Capsule layer with dynamic routing, as shown in
Figure 7.
3.3.1. Malware-Aware Primary Capsule Layer
The task of the primary capsule layer is to convert the convolutional features from into a set of primary capsules that represent basic components. Standard primary capsule layers face several major issues when processing malware images: (1) Locality Limitation: A single convolutional path struggles to capture the complex multi-scale dependencies in malware, ranging from local instructions to global logic. (2) Structure Agnosticism: The indiscriminate processing of spatial positions fundamentally conflicts with the anisotropic structure of malware images, easily leading to erroneous associations of unrelated features between adjacent rows of malware images. (3) Lack of Prior Guidance: Relying solely on data to learn spatial relationships results in slow convergence and vulnerability to noise interference. To address these issues and enable a profound understanding of the unique structure of malware, we design the Malware-Aware Primary Capsule Layer (MC-AC).
To balance the model’s basic feature extraction capability with malware-specific targeting, MC-AC employs a dual-path parallel processing strategy for the input feature maps.
The main path aligns with conventional capsule networks, extracting generic local features through a standard convolutional layer. This preserves the robust feature representation capability of traditional CNNs, serving as the cornerstone of model stability, as shown in Equation (13), where
is the number of primary capsules and
is the capsule dimension.
The Adaptive Path is a multi-scale adaptive network similar in structure to the ADC-UIB in
Section 3.1. It utilizes parallel asymmetric dilated convolutions to target and extract fine-grained component features with varying distance dependencies from the fused multi-scale features, specifically for capsule assembly, as shown in Equation (14).
Here,
denotes the adaptive gating mechanism, which learns channel weights via global statistics, as shown in Equation (15). Fusion represents a pointwise convolution for channel fusion. The formulas for
follow Equation (1) in
Section 3.1.
For these two paths, we balance their contributions using learnable weights, as shown in Equation (16). Here,
is a learnable weight vector corresponding to the initial contributions of the main path and adaptive path, respectively. It is initialized to [0.7, 0.3] and dynamically adjusted during training to adapt to the feature distribution of malware image.
Furthermore, since standard primary capsule layers treat spatial positions indiscriminately, they tend to incorrectly associate unrelated features between adjacent rows in malware (e.g., padding bytes in different rows). To address this issue, we design a Row-Aware Asymmetric Spatial Bias. Utilizing non-learnable structured guidance, this mechanism ensures that capsules in the same row (corresponding to continuous instructions) possess similar biases, while capsules in different rows (corresponding to discrete code segments) have distinct biases. The bias initialization formula is shown in Equation (17), where
represents the bias value for the n primary capsule at spatial position
and dimension
, and
is the height of
. The sine function generates a periodic bias pattern: for the same row (fixed
), the sine value remains constant, reinforcing intra-row continuity; for different rows (varying
), the b value fluctuates periodically, simulating the discrete inter-row structural properties of malware images. The coefficient 0.1 serves as a bias amplitude factor to prevent excessive bias from interfering with the features themselves.
Finally, we normalize the fused capsule features using the signature Squash activation function of capsule networks (where vector length represents the confidence of component existence, and direction represents the component feature pattern), as shown in Equation (18). The Squash function is defined in Equation (19), where
is introduced to avoid numerical instability.
For qualitative visualization, we compute a primary capsule activation map (
Figure 8) by averaging the vector norms of all primary capsules at each spatial position and applying min-max normalization. The resulting heatmap reflects where MC-Caps assigns stronger component-level activation.
3.3.2. Class Capsule Layer and Dynamic Routing
Proceeding to the Class Capsule Layer, its function is to aggregate component-level features output by the MC-AC layer into whole-level features via the dynamic routing algorithm. Unlike the fixed weight mapping mechanism of traditional fully connected layers, dynamic routing optimizes routing coefficients through multiple iterations. This allows primary capsules (parts) to intelligently transmit information to the best-matching class capsules (wholes). This mechanism adapts to the “Malware Picasso Problem,” where component positions are distorted but global semantics remain invariant. Even if critical code blocks (e.g., encryption routines) are spatially rearranged, dynamic routing can correctly aggregate them into the corresponding family’s class capsule features based on the relational associations between components. The specific process is as follows:
Dynamic routing first maps each primary capsule to a prediction vector for all class capsules via a learnable weight matrix, representing the potential association of the current component with a specific malware family, as shown in Equation (20). Here,
is the prediction vector from primary capsule i to class capsule
, with its direction encoding the component-family association pattern and its length representing association strength.
is the weight matrix for the
class capsule, initialized with small random values to avoid initial mapping bias;
is the i primary capsule, corresponding to a component-level feature.
Subsequently, through three iterations during the training phase, primary capsules whose prediction vectors align with the direction of the class capsule vector are assigned higher routing weights, thereby dynamically adjusting the routing coefficients from primary to class capsules. Each iteration begins by initializing the routing logits
to zero, indicating equal initial contribution from all primary capsules. Then,
is normalized into routing coefficients
, ensuring that the sum of routing coefficients from all primary capsules to a specific class capsule equals 1, as shown in Equation (21).
Next, based on
, the prediction vectors of all primary capsules are weighted and summed to obtain the initial aggregated feature
for class capsule
, realizing the dynamic assembly from parts to wholes, as shown in Equation (22).
Finally,
is normalized via the squash function to generate the final class capsule vector
. Its length represents the confidence that the sample belongs to the
malware class, while its direction encodes the core semantic features of the family. The routing logits are updated by calculating the consistency (dot product) between the prediction vector and the class capsule vector, completing the current iteration, as shown in Equation (23).
A larger dot product implies higher consistency, resulting in higher weights in the next round. This mechanism ensures that highly matched components continuously reinforce their contribution to the corresponding whole, thereby achieving adaptive routing.
3.3.3. Loss Function Design and Summary
Both real-world scenarios and malware datasets suffer from severe class imbalance. To address this, we adopt a weighted margin loss, as shown in Equation (24). Here,
is the label for the
class of sample
;
and
are the margins for positive and negative classes;
is the weighting coefficient for negative classes; and
is the class weight for the
class, adaptively computed based on the dataset’s class distribution.
In summary, MC-Caps addresses the deficiencies of standard capsules in capturing long-range dependencies and adapting to anisotropy by introducing a dynamically fused dual-path and row-aware spatial bias in the primary capsule layer. It utilizes dynamic routing with unbiased initial iterations settings to adapt to component positional distortions caused by the “Malware Picasso Problem,” and employs weighted margin loss with class weights to handle dataset imbalance.
Framework Synergy: The anisotropic features extracted by ADC-UIB are further reinforced by the row-aware bias of MC-AC. The multi-scale features fused by AFPN are aggregated from parts to wholes across scales via Dynamic Routing. The three components form a sequential network of “Feature Extraction (ADC-UIB) → Feature Fusion (AFPN) → Spatial Inference (MC-Caps),” thereby addressing the core problems of spatial information loss, structural mismatch, and poor robustness against polymorphic variants in malware classification.