3.1. Overall Architecture
The overall architecture of GDUFormer is shown in
Figure 2. Inspired by GeleNet [
8] and DKETFormer [
5], we adopt pvt v2-b2 [
10] as the encoder for salient features in GDUFormer. As a hierarchical Transformer, pvt v2 is capable of encoding global cues from the feature map at four different scales, with the output at each stage denoted as
. Specifically, these feature maps are downsampled to 1/4, 1/8, 1/16 and 1/32 of the input image, with 64, 128, 320 and 512 channels, respectively.
P is subsequently fed into the GDU-based decoder to generate scale-specific outputs. To enhance the interaction between adjacent scale feature maps, we input both
and
into the decoder layer corresponding to
. Each decoder layer is composed of a fixed number of GDUs, and the upsampled results of
and
are iteratively processed within them to obtain the decoder output
. Since the highest layer (
) does not interact with higher layers, we apply an additional
convolution to process it and obtain
. Notably, the dimensions of
and
are exactly identical before the final
convolution is performed; after that, the number of channels of all
is uniformly reduced to 32. The GDU suppresses noise, enhances key features, improves foreground-background distinction, and captures fine-grained details. Its structure is shown in
Figure 3. It can be observed that our main contributions, Full-Dimensional Gated Attention and Hierarchical Differential Dynamic Convolution, are both integrated into the first half of the GDU, which is the key factor behind its powerful performance. The second half of the GDU is composed of ConvFFN [
10] and involves the splitting and iterative process of the feature maps from the upper and lower layers.
are then fed into the predictor to obtain the predicted result
for ORSI-SOD. It is worth mentioning that, similar to previous work [
5,
8], we select the effective partial decoder [
40] as the prediction head, and like DKETFormer [
5], we add an extra branch to accommodate the four scale inputs from the decoder. The predictor primarily drives efficient fusion of multi-scale feature maps via dense element-wise multiplication and concatenation operations, and ultimately yields a prediction map with 1 output channel and a scale matching the input image, referred to as
in this paper. Finally,
obtains the guidance from the ground truth map under the supervision of the balanced cross-entropy (BCE) loss function and intersection-over-union (IoU) loss function. This process can be formulated as
where
denotes the ground truth map with binary annotations for the salient objects. The detailed explanation of the overall structure of GDUFormer is complete. Next, we will focus on discussing the FGA and HDDC components within the GDU.
3.2. Full-Dimensional Gated Attention
Given ORSIs’ large object scale variations and noise susceptibility, designing dedicated modules is essential. Previous methods tried to address such problems by introducing ViT encoders to extract global cues; nevertheless, this approach neither explored nor encoded the spatial and channel dimensions at the same time, nor did it filter and strengthen full-dimensional information. To this end, we put forward our solution—Full-Dimensional Gated Attention (FGA)—within the GDU, with its detailed structure illustrated in
Figure 4.
FGA is primarily composed of two branches, responsible for filtering spatial local information and channel relationships separately. The left branch first mines spatial local cues through multiple depthwise convolutions with different receptive fields, and then uses the grouping mechanism to filter local features. This process helps the model identify noise information in features and determines the key cues that should be retained in ORSI-SOD by performing weighted summation on the stacked receptive fields. Specifically, the receptive field stacking process can be formulated as
where
represents the concatenation operation,
denotes the activation function, and
stands for the
depthwise convolution.
X is the result obtained after concatenation and normalization of
and
. The obtained
contains abundant local spatial features. Combined with the fixed-scale global features carried in
and
, the spatial information is already highly sufficient. Nevertheless, PVT inherently lacks a filtering mechanism—it only regulates the richness of semantic information by adjusting scales. Meanwhile, depthwise convolutions merely extract and capture receptive field information within individual channels. For this reason, the direct combination approach will leave these spatial features still cluttered. This part was validated in our ablation experiments. Fortunately, for local regions, the stacked receptive fields can partition the noise still present in the global features into multiple candidate windows, and under the influence of the parameter matrix, the noise is appropriately stripped away. Given this, the focus of the current work has shifted from suppressing noise to better integrating and summarizing the candidate windows. This transition process not only filters out noisy pixels to some extent, but also integrates the global features in the encoder and the local information mined by depthwise convolutions, thus mitigating the issue of large spatial scale differences in ORSI. The construction of this solution unifies both situations. This idea is realized through an operation called the grouping mechanism. Specifically, before concatenation in Equation (
2), we expand each feature map’s dimensions and concatenate along these dimensions to obtain
. A
pointwise convolution then expands
’s parameter space; the result is activated with Softmax for weights, followed by weighted summation with
. This process can be formulated as
where
stands for pointwise convolution, and
denotes summation along dimension 1 with that dimension compressed. Notably, dimension expansion and
reconstruction aim to enable the grouping mechanism to fully utilize Softmax-derived normalized weights via the extra dimension, thus aggregating diverse receptive field feature representations in
.
weights are the ratio of the model’s total confidence resources (summing to 1) assigned to each candidate object according to its competitiveness. Through the element-wise multiplication in Equation (
3), the feature space of
expanded by the extra pointwise convolution can be applied back to
via the
weights, and the summation along dimension 1 enables the weights to complete feature integration, thus successfully achieving the objectives analyzed above. Through this operation,
integrates candidate clues from different local receptive fields without introducing a large number of parameters, thus achieving the goals of suppressing noise and alleviating large-scale differences. It is worth noting that the gating mechanism, as an attention mechanism, should not introduce an excessive number of parameters or computational burden, which is why the structure of the left branch is relatively simple. Additionally, the previous analysis of noise suppression and alleviating large-scale differences justifies the rationality and innovation of this design. For the right branch, we focus primarily on channel selection. Different channels can be viewed as copies of the feature map, so removing weak channels greatly enhances the effectiveness of the key features in the map. The specific structure of this operation is shown on the right side of
Figure 4.
To be specific, the right branch of FGA is constructed based on a multi-head linear self-attention mechanism, which is theoretically linked to Vision Mamba via the linear attention equivalence proposed in MLLA [
41]. As demonstrated by Han et al. [
41], ViM’s selective scanning mechanism can be approximated by linear self-attention with iterative feature updates, where the state transition of ViM corresponds to the key-value interaction in linear attention. To better explain this process, we formulate the selective scanning mechanism of ViM as
where
,
,
, and
D are parameter matrices that control the input
and output
. The first half of Equation (
4) describes how the state evolves (via Matrix
) according to the impact of the input on the state (via Matrix
), while the second half illustrates how the state is converted into the output (via Matrix
) and how the input directly influences the output (via Matrix
D). MLLA [
41] argues that
and
can be regarded as upper and lower iterative feature maps of linear self-attention,
and
can be abstracted as keys and values for matrix multiplication, and
can serve as query vectors. In addition, the original MLLA [
41] uses multiple positional encodings to replace the traditional forget gate mechanism (
), aiming to make the ViM-inspired linear attention more suitable for visual tasks.
However, we believe that applying the gating mechanism to the lower-layer feature map
is equivalent to implementing a controlled residual connection, which is not conducive to the filtering of channel relationships and makes it more difficult for the advanced reasoning and abstraction capabilities brought by ViM to play a role. For this reason, we position the gating mechanism on
—the outcome of matrix multiplication between the key and value vectors in the linear self-attention mechanism. This approach not only helps overcome the representational bottleneck of linear models but also promotes the achievement of a dynamic balance between memory and forgetting. For ease of understanding, we shift the perspective from the inspiration of ViM back to the multi-head linear attention mechanism in the following description. The result of the matrix multiplication performed by the key and value vectors is denoted as
, where
h represents the number of heads in the attention mechanism. In the last two dimensions of
, the relationship between each pair of channels has been explicitly encoded, making it highly suitable to deploy the gating mechanism for channel filtering in this process. To this end, we use an internal gating attention on the similarity matrix
. To avoid disrupting channel relationships, we employ pointwise convolution in the internal gating attention to mine the complementary relationships between heads, and use the GELU activation function to filter out poor channel relationship pairs. This process can be formulated as
where
can filter out most negative pixel values, thereby ensuring that active channel relationships are preserved. Following this operation,
undergoes activation using Softmax. After performing matrix multiplication with the query and establishing a residual connection with the value vector,
—the result of channel relationship filtering—is obtained. As far as we know, our work is one of the early efforts in computer vision to perform convolution and activation on the similarity matrix and feed the output back to it. This procedure improves the abstract reasoning, cue representation, and dynamic balance abilities of the ViM-inspired linear attention embedded in the module. Notably, during the residual connection with the value vector, we adopt pointwise convolution to implement a simple gating mechanism, which in turn enhances the function of the parameter matrix
D in Equation (
4). In fact, the ViM-inspired linear attention structure shares similar principles with the original Mamba architecture but differs greatly in structural design; in contrast, it has a close structural connection with the standard linear attention, yet there are distinct discrepancies in theoretical interpretation. Specifically, according to the theory of MLLA [
41], the scanning process of the ViM-inspired linear attention structure achieves the function of the selective scanning mechanism of the original Mamba architecture to a certain extent. Nevertheless, there are notable gaps between them in the detailed structures, including tensor operations, weight generation and feature application, which is why we define it as a ViM-inspired linear attention architecture. In comparison with the standard linear attention, the structural modifications we have made are relatively limited, and our main contributions focus on the innovative thinking and interpretation of the ViM-inspired linear attention method. By applying the gating mechanism to the Key-Value product, the rank of the similarity matrix can be raised to a certain extent, which further changes the representational strength of the linear model and enhances the dynamic balance of memory and forgetting. The operations we adopted have not been tried in prior ViM-inspired linear attention methods, and the extra interpretable processing of the similarity matrix itself makes certain theoretical and structural contributions. To sum up, the proposed ViM-inspired linear attention structure is distinctly different from the original Mamba architecture and the standard linear attention. Thanks to the internal gating attention in Equation (
5), weak channel relationships have been eliminated, and key object features have been strengthened across multiple functional layers of the selective scanning mechanism. The combination of
and
can be formulated as
where
denotes the batch normalization mechanism, and
is the activation function. Via the operations in Equation (
6), the filtered local spatial information and channel relationships are integrated. Pixels and channels with high noise, repetition, redundancy, or misidentification are also excluded from FGA’s information flow. At the same time, the module’s ability to represent, abstract, and balance key salient object features is further enhanced. Based on this, the obtained
can be transformed into an external-oriented overall gating mechanism via
and
, applying high-standard full-dimensional information to HDDC as attention. This design strictly suppresses the majority of abnormal cues in HDDC’s output, thereby ensuring the efficiency and reliability of GDU during decoding.
In summary, FGA’s dual-branch design, which distinguishes itself from single-dimensional attention mechanisms, achieves full-dimensional feature purification and lays the groundwork for subsequent feature enhancement based on HDDC. It should be noted that FGA is actually an auxiliary component of HDDC, whose main purpose is to control the results of HDDC. Therefore, HDDC functions as the core component responsible for information flow processing within GDU, underscoring its critical importance. Further details of HDDC will be provided in
Section 3.3.
3.3. Hierarchical Differential Dynamic Convolution
Before formally introducing HDDC, it is necessary to clarify its critical role within the GDU and, more broadly, within the entire decoder. For HDDC, its input consists of global features from adjacent scales of the encoder. Although global cues have already been extracted, these feature maps generally lack foreground-background distinction and tend to overlook the local details of small salient objects. Applying only conventional convolutions to capture local context may impair the salient cues embedded in the global features. Therefore, it is necessary to design a module structure that can effectively model long-range dependencies and, at the same time, focus on key positions in the feature map through weight parameter control.
Overlock [
42] proposes a dynamic convolution with context mixing capability. Its core idea is to characterize the correlation between a single token at the center of a region and its context by leveraging the token itself and the affinity values between it and all other tokens. Subsequently, these affinity values are aggregated, and a token-based dynamic convolution kernel is constructed through a learnable mechanism, thereby integrating context information into each weight of the convolution. While this is indeed an excellent and efficient method for context mixing, we must point out that its structural design still has some shortcomings for our needs. For one thing, although it constructs a token-based dynamic convolution kernel to meet the need for modeling long-range dependencies, the implementation of this operation still relies on encoding pairwise relationships between pixel pairs without introducing any differentiation, exploration, or recognition mechanisms. This easily leads to similar foregrounds and backgrounds being mistakenly identified as a single category (salient or non-salient). For another, despite the structure being called a convolution kernel, the foundation for deploying kernel weights is essentially token relationships. Under such circumstances, how to strengthen attention to specific key pixel positions, excavate potential salient objects, and improve the accurate recognition of details like small salient objects and salient object outlines has become another key aspect requiring consideration.
According to the aforementioned analysis, we impose attention weights that rely on distance decay and hierarchical intensity difference capture (HIDC) on ContMix’s [
42] kernel space. The goal is to capture pixel differences in local regions and strengthen the model’s feature discrimination and mining capabilities. The overall structure of HDDC with applied weights is illustrated in
Figure 5. Specifically, similar to ContMix [
42], we first re-split the concatenated encoder global features along the channel dimension. We allow the upper-layer features to store context information and the lower-layer features to enhance weight attributes. The purpose of this operation is that the decoder outputs results at the current level; thus, it is a more appropriate choice for the lower layer (i.e., the current layer) to amplify salient features. Subsequently, we perform matrix multiplication on the upper- and lower-layer global features processed by the weight matrix. After undergoing scale transformation, the result serves as the initial ContMix convolution kernel
, with a scale of
. In this case,
stands for the initial kernel size,
is the number of heads of the multi-head self-attention mechanism, and
N denotes the product of image width and height. Next, we generate two independent kernels to characterize the correlation between tokens and their context, so as to realize the subsequent aggregation of affinity values. This process can be formulated as
where
and
are the generated independent kernels,
refers to splitting the results along the kernel dimension, while
extends the original
to
. Through this operation, the two generated convolution kernels are relatively independent and can capture local information under different receptive fields. Additionally, as they are based on contextual relationships, they can effectively facilitate wide-range context mixing. During implementation, the kernel space scales of
and
are configured to
and
, respectively. After that, to balance the relationship between contextual information and kernel-space weights, we perform dimensionality reduction in
and
by averaging across spatial dimensions, yielding
and
, which serve as the foundation for generating kernel-space attention.
Specifically, inspired by RMT [
43], we plan to adopt the distance paradigm to influence the weight distribution of the kernel space. Given that
and
perform identical kernel space attention, we only detail the weight application of
for brevity. The spatial decay matrix based on Manhattan distance proposed by RMT introduces explicit spatial information before self-attention, thus providing the attention mechanism with a clear spatial prior. Considering that the
space of self-attention can achieve improved performance under the guidance of spatial prior, is it feasible to incorporate this prior mechanism into the kernel space? To tackle this problem, we first derive the formulated representation of RMT’s 2D spatial decay matrix as
where
and
denote the coordinates of any two points in the 2D space. Based on these 2D coordinates, the spatial decay matrix can represent the Manhattan distance between each pair of tokens via
, which ultimately acts on the self-attention mechanism. The critical reason for the feasibility of this approach lies in the fact that the
similarity matrix guarantees the encoding of the relationship between any two tokens. However, there is no
similarity matrix in the kernel space, and constructing such a matrix from scratch cannot act on the initial attention
, which has collapsed into
. For this reason, we convert the spatial decay matrix into a distance decay mechanism targeting the diagonals of the kernel space. We first construct a
correlation matrix, and then the 2D spatial decay matrix calculates the Manhattan distance between the
elements. Under this operation, the spatial prior leads to the highest weights on the diagonals of the kernel space. In each row, the greater the distance from the current diagonal pixel, the more drastic the weight decay. This grants the distance decay directional characteristics, facilitating strengthened capture of discriminative information and detailed cues on the diagonals. To balance the distance decay relationship, we also add this operation to the other diagonal of the kernel space. This process can be formulated as
where
denotes the transpose of
along the row dimension. Although the kernel space distance relationships on both diagonals are encoded, the module still lacks enhancement for the overall kernel space. For this reason, motivated by PidiNet [
44], we propose a hierarchical intensity difference capture mechanism. It can adaptively extract the pixel intensity difference at each level within the kernel window, and construct global kernel space weights accordingly. Specifically, drawing on APDC [
44], the mechanism takes the convolution kernel center as the origin and one pixel as the hierarchical division distance, computing the intensity difference between each pixel and its clockwise adjacent pixel. The convolution kernel difference results will be hierarchically heterogeneous in subsequent steps and further highlighted, so as to realize the perception of the overall relational status of the kernel space. The pixel intensity difference calculation for a specific level is formulated as
where
N stands for the number of pixels included in the current window level,
denotes the pixel value of the n-th point,
is the pixel connected clockwise to
within the hierarchical window, and
is the corresponding kernel weight. Similarly to APDC [
44], we transform the equation from
-dominated to
-dominated, enabling intensity difference capture for each window level without operating on feature map pixels.
Taking the
window size of
as an example, we first remove the central pixel of the window and divide it into three levels, containing 8, 16, and 24 pixels, respectively. Subsequently, we calculate the pixel intensity differences within each level and convert the calculation process into a kernel operation. So far, the weight difference expression for each level is still linear—that is, regardless of the level, the base of the intensity difference calculation is entirely consistent. However, according to the analysis in the previous section, introducing spatial prior is beneficial, and what the module currently lacks is precisely the enhancement of the overall kernel space relationships. It should be noted that this enhancement process needs to pay special attention to the discrimination between foreground and background and the capture of local details. To achieve this function, we introduce an adaptive hierarchical operator (AHO) and an intensity difference highlighting mechanism (IDHM), which can be formulated as
where
denotes the adaptive hierarchical operator, and each level of the kernel space window from inner to outer is assigned a different
. In practical operation, still taking the
window size as an example, the initial values of
for each level from inside to outside are set to 2, 1, and 0.75, respectively. The reason for this operation is that when the window moves to the current central point, to achieve accurate foreground-background discrimination, the pixels closest to the center are the ones that require the most attention and vigilance. Meanwhile, the importance of pixels farther away naturally decreases layer by layer. This aligns with the core idea of the distance decay mechanism analyzed in this section, which is also the reason for constructing a hierarchical structure for the kernel space initially. It is worth mentioning that
is learnable, allowing it to better adapt to the iteration and optimization of the convolution kernel.
is the outcome of the
-dominated calculation process, numerically identical to
.
represents the intensity difference highlighting mechanism, which can further expose important encodings in the kernel space and generate higher inner products accordingly. As weights, these will also produce higher activation responses, thereby promoting the discovery, excavation, and capture of local details. Further analysis shows that after the hierarchical construction of the convolution kernel, adaptive hierarchical heterogeneity, and intensity difference highlighting, the weight
that can characterize, activate, and enhance the overall kernel space relationships is ready. It should be noted that in Equation (
11),
represents the difference weight corresponding to the
level. Since hierarchical construction does not affect the overall use of weights, we integrate the weight results of each level after the above operations into a single weight, denoted as
. This result is first added to
of the corresponding kernel size, then undergoes residual connection with
, and finally acts on
after activation. This process can be formulated as
where
denotes the Sigmoid activation function, and
represents the dynamic convolution kernel after applying the kernel space weights. This dynamic convolution kernel not only achieves fine-grained recognition of objects and textures as well as accurate perception of local details under the combined action of
and
, but also urges the convolution operation to additionally focus on the detailed directional cues and relevant discriminative information along the kernel window diagonals. As a result, this allows the ContMix mechanism to exert stronger feature aggregation capabilities and be more suitable for the ORSI-SOD task. We refer to the entire weight extraction, construction, application, and convolution aggregation process as HDDC, which serves as the key information processing component in GDU. Notably,
performs the same operations as
, and their respective results will be concatenated in subsequent steps to augment GDU’s feature reserve under different local receptive fields.
The innovation of HDDC centers on task-tailored kernel space optimization, bridging long-range dependency modeling associated with Transformers and local detail capture typical of convolution, effectively tackling a major limitation faced by existing hybrid methods.