3.1. Wavelet–Edge Collaboration
Under degraded underwater conditions, key structural cues such as object contours, boundaries, and weak textures are prone to progressive attenuation during deep feature extraction, while complex background clutter can easily induce false responses unrelated to the target [
11]. Although directly embedding frequency-domain transformation or edge guidance into each basic unit may strengthen local structural responses, it also tends to entangle structural enhancement with basic representation learning, thereby increasing the complexity of the internal information flow and reducing optimization stability [
1]. The proposed WEC module therefore introduces structural enhancement only after the main feature transformation and fusion stages. This design preserves the integrity of the original representation learning process while enabling a progressive refinement of structure-sensitive information through frequency-domain compensation followed by spatial correction.
As shown in
Figure 2, WEC follows a CSP-style design [
40] with a shortcut branch and a transformation branch. The transformation branch is responsible for basic feature extraction and subsequent structural enhancement, whereas the shortcut branch preserves the original information flow. In this way, degraded textures and high-frequency details are first compensated in the frequency domain, and the resulting fused feature is then refined in the spatial domain to sharpen geometric boundaries.
The input feature is first projected by a convolution and then split into two paths. In the transformation branch, n lightweight local detail extraction (LDE) units are stacked to learn the basic representation. Each LDE unit consists of a convolution for channel reduction, two parallel convolutions with dilation rates of 1 and 2 for local detail and limited context modeling, respectively, and a subsequent fusion path composed of and convolutions for feature integration and channel restoration. This design provides a reliable feature basis for subsequent structural enhancement at relatively low computational cost.
To recover degraded high-frequency details in the frequency domain, a Subband Cooperative Guidance (SCG) module is introduced at the end of the transformation branch. This module does not alter the spatial resolution of the feature map. Instead, it injects frequency-domain priors into the main branch in a residual manner, producing an intermediate feature representation with enhanced texture compensation. The compensated main branch feature is then fused with the shortcut branch feature and passed to an Edge-guided Spatial Correction (ESG) module for further refinement. In ESG, the gradient response is computed on the fused feature representation after SCG-based subband compensation, enabling the spatial gating process to further refine wavelet-enhanced structural cues. Using explicit gradient responses as spatial gating signals, ESG sharpens geometric boundaries, suppresses background noise, and produces the enhanced output feature .
Through this serial design, WEC refines the representation from basic feature extraction to frequency-domain compensation and then to spatial correction. Unlike a parallel use of frequency and edge cues, ESG operates on the feature representation already compensated by SCG, allowing edge-guided refinement to act on more reliable structural responses.
Let the input feature tensor be
, where
B,
,
, and
denote the batch size, channel number, feature-map height, and feature-map width, respectively. WEC first applies a
convolution, denoted by
, to perform channel projection and obtain the transformed feature
. The resulting feature is then divided into a transformation branch and a shortcut branch. The transformation branch, denoted by
, extracts a basic representation
, while the shortcut branch, denoted by
, preserves the original information flow as
:
To account for the severe loss of high-frequency textures in underwater imaging, while low-frequency contour information remains relatively stable, we introduce the SCG module at the end of the transformation branch. Implemented in a residual manner, SCG injects frequency-domain priors into the main branch to compensate for the missing structural details in the basic representation
.
where
denotes the proposed SCG module, which explicitly performs frequency-band decoupling of feature representations through the discrete wavelet transform (DWT).
Specifically, given the input feature
, a 2D DWT is first applied to decompose it into one low-frequency subband (
), which preserves the main structural information, and three high-frequency subbands (
,
, and
), which encode edge and texture details:
In this implementation, DWT is fixed as a single-level 2D Haar wavelet decomposition. The decomposition is applied independently to each channel of the feature map, producing one low-frequency subband and three directional high-frequency subbands. The Haar filters are non-learnable, and no multi-level DWT or alternative wavelet family is used in the reported experiments. Since single-level DWT reduces the spatial resolution of each subband by a factor of two, the generated frequency attention maps are resized to the spatial size of the input feature before feature modulation.
The decomposed subbands provide complementary frequency-domain descriptions of the input feature. The
subband mainly retains coarse structural context and region-level appearance information, whereas the high-frequency subbands
,
, and
capture directional edge and fine-texture responses that are more sensitive to scattering and blur. Accordingly, SCG performs separate channel recalibration for the low-frequency and high-frequency components. This frequency-specific recalibration helps preserve the distinction between coarse structural context and degradation-sensitive detail responses during feature reintegration. In the high-frequency branch, the three directional subbands are first aggregated by a
convolution and then passed through a channel attention module to generate the weight map
, which is used to enhance texture details and boundary responses weakened by degradation. In parallel, the low-frequency branch applies a
convolution and a channel attention module to the
subband to produce
, which is used to preserve coarse structural context while reducing background interference:
where
denotes the channel attention operation,
denotes the Sigmoid activation function, and all convolution kernels in this stage are of size
.
To incorporate the generated frequency-domain priors into the spatial representation, the input feature
is projected into three components, namely an identity component, a high-frequency-guided component, and a low-frequency-guided component, which are modulated by the corresponding weights, as described in Equation (
5). The identity component preserves the original response, while the other two components introduce frequency-specific modulation for detail enhancement and structural context adjustment. The three components are then concatenated and fused by a
convolution, yielding the frequency-compensated feature
in Equation (
6).
Here,
,
, and
denote the identity component, the high-frequency-guided component, and the low-frequency-guided component, respectively.
,
, and
denote the
convolution-based channel-alignment mappings, and ⊙ denotes element-wise multiplication.
After the main branch is compensated, WEC fuses the compensated main branch feature
with the shortcut branch feature
to obtain the fused representation
y, which is formulated as follows:
To further strengthen geometric boundaries and fine-grained structural information, WEC introduces ESG at the fusion output stage. Given the fused feature
, ESG uses explicit gradient responses as spatial gating signals to perform boundary-aware spatial refinement. Specifically,
y is fed into two depthwise convolution branches initialized with horizontal and vertical Sobel kernels, respectively, to extract the directional gradient responses along the
x- and
y-directions [
41]. The resulting responses are further combined to obtain the gradient magnitude:
where DWConv
x(·) and DWConv
y(·) denote
depthwise convolution operators initialized with horizontal and vertical Sobel kernels, respectively. They are implemented with stride 1, padding 1, and the number of groups equal to the number of input channels. The Sobel kernels are used only for initialization and are updated during training.
G denotes the aggregated gradient magnitude response.
The edge response map
m is then used to modulate the fused feature
in an element-wise manner, yielding the ESG-refined output:
The frequency-domain and edge-guided components in WEC complement each other. Subband decomposition introduces structure-aware information from different frequency ranges, while ESG further sharpens geometric boundaries on the fused feature map through explicit gradient-based spatial refinement. As a result, the features delivered to the neck preserve clearer contours and boundary structures, which is beneficial for subsequent multi-scale aggregation.
3.2. Scale-Selective Fusion
Although WEC improves single-level structural representation, multi-scale fusion is still needed to handle small objects and large shape variations in underwater scenes. However, features from different levels do not contribute equally: shallow features preserve more details, whereas deeper features provide stronger semantics and broader context. If they are fused in a fixed or uniform manner, useful target information and background noise are likely to be propagated together. To address this issue, SSF first selects responses from branches with different receptive fields according to the input feature and then refines the fused result through channel and spatial recalibration. As illustrated in
Figure 3, the core of SSF lies in scale-selective aggregation, while a lightweight channel–spatial refinement is applied afterward to stabilize the fused representation.
In the partition-based modeling stage, let the input feature tensor be , where B, , , and denote the batch size, channel number, feature map height, and feature map width, respectively. The input feature is first transformed by a convolutional mapping and then evenly split along the channel dimension into two parts, and , where . The branch is sent to the pyramid aggregation path to extract multiple receptive fields, while is retained to preserve a fine-grained feature path.
The next stage performs scale-selective aggregation. SSF first applies a
convolution to
to obtain an intermediate representation
z. Based on
z, SSF then constructs
parallel branches, including one identity branch and three depthwise-separable dilated-convolution branches with different receptive fields:
where
preserves local detail information, while the other branches capture contextual responses over progressively enlarged spatial ranges.
To enable adaptive branch selection, a global descriptor is first extracted from
z by global average pooling, denoted by
, and then mapped by a lightweight multilayer perceptron, denoted by
, to a four-dimensional branch-weight vector. After Softmax normalization, the branch weights are obtained as
where
denotes the normalized weight assigned to the
i-th branch. Accordingly, the aggregated feature is computed by weighted summation over the four parallel branches, followed by a
convolution:
To further refine the selected multi-scale representation, SSF first combines the scale-aggregated feature
and the retained component
through concatenation followed by a
convolution, yielding a joint representation
F:
A channel–spatial dual recalibration fusion module (CDRF) is then applied to
F to enhance informative responses and suppress irrelevant interference. Specifically, the channel attention map is generated from the global descriptor of
F by global average pooling, followed by a
convolution and a Sigmoid activation:
Accordingly, the channel-refined feature is obtained as
On this basis, spatial recalibration is further applied to
. A spatial response map is generated by a depthwise
convolution followed by a
convolution and a Sigmoid activation:
The spatially recalibrated feature produced by CDRF is formulated as
For residual preservation, the original SSF input
x is added back through an identity shortcut. The final output of SSF is therefore given by
where
x denotes the input feature used for residual preservation.
Overall, SSF performs adaptive branch selection before channel–spatial recalibration and residual preservation. Compared with fixed concatenation or uniform aggregation, this selection-before-recalibration design allows the module to emphasize useful receptive-field responses while suppressing background interference during multi-scale fusion.