3.2. Cross-Spatial Global Perceptual Attention
Convolutional neural networks with inductive biases are better able to capture geometric and topological information. However, this also results in a limited receptive field for convolution, which can only focus on local information and lack the ability to model long-range dependencies [
30]. In summary, the aim is to enable the model to focus its attention on the region of interest when the objects cannot be distinctly distinguished, as in underwater images with low contrast and complex backgrounds, etc. We proposed the Cross Spatial Global Perceptual Attention(CSGPA), the structure of which is shown in
Figure 2; the detailed structures of two parts are shown in
Figure 3.
Firstly, by reconstruction of the structure of self-attention, mapping of the input feature map into
N sets
Q,
K, and
V. The formula for Self-Attention proposed by the Transformer architecture is as follows:
where
Q,
K, and
V are the query, key, and value of the input
X mapped through the parameter matrix,
. B is the batch size, and C represents the channel number. The width and height of the feature map are denoted by
w and
h. For multihead attention, with the aim of making each head responsible for learning independent attentional representations, the shapes of
Q,
K, and
V need to be reshaped according to (
3) as
,
is the number of channels each head is responsible for.
Similarly, let
I be the input feature map where (
m, n) represents a pixel on the feature map such that
. We equate the mapping of the input feature map through the parameter matrices to
Q,
K, and
V as a mapping after 1 × 1 convolutions and reconstruct the self-attention formula as follows:
where
,
, and
are parameter matrices. The feature map will then go down two separate paths, one path into the self-attention mechanism and the other path into the Efficient Multi-Scale Attention mechanism. For self-attentive paths, combining (
1) and (
2), the expression of self-attention on a feature map can be written as follows:
where
O is the output feature map of Self-Attention. Then, combine (
4) and (
5); in the case of a multihead self-attention system with
N heads, the output of the self-attention part
can be formulated as follows:
(
6) corresponds to the stream of
Figure 3a, and reconstruction of the self-attention mechanism enables the implementation of the structure shown in
Figure 3b.
The other path allows the model to have a richer feature representation while avoiding the loss of semantic information due to the reduction of channel dimensions when processing deep visual features. It is necessary to concatenate Q, K, and V along the N dimension and linearly transform them by a convolution to generate the feature map , which is input to the EMA module; indicates the number of weights after convolution kernel expansion.
As shown in
Figure 3b, the EMA module is mainly composed of four parts: Feature Grouping, Channel Processing, Spatial Processing, and Cross Spatial Learning. First, in the Feature Grouping stage, we group the feature maps which are fed into the EMA according to the number of heads
N to enable collaboration between EMA and self-attention and maintain the consistency of computation. That is, the number of channels each head is responsible for processing is
.
In the channel-processing stage, two one-dimensional global average poolings are used to encode channel information in the horizontal and vertical directions, respectively. As a result, channel relationships can be captured in one direction and precise position information can be retained in the other. Afterwards, after the two outputs have been spliced along the h-dimension and linearly transformed using a
convolution, the formula for this process is as follows:
Next, reconstruct it into two one-dimensional vectors representing the channel information in the horizontal and numerical directions. The channel attention weights in different spatial directions are acquired using a sigmoid function and weighted to the input feature map. The output of the channel process is as follows:
After that, the
convolution is used to capture multiscale spatial feature representations; it can be formulated as follows:
In the cross-space learning part, the outputs of the channel-processing stage and the spatial-processing stage are each global-average pooled and then passed into the softmax function to obtain the attention matrix; the formula for two-dimensional global average pooling is expressed as follows:
Afterwards, they are separately weighted to each other’s output feature maps to enable local features to guide global semantic information and enhance local feature representation, while global channel features guide local feature learning to mitigate global information loss. The resulting operation is described by the following equation:
Finally, the output of the CSGPA is obtained as follows:
where
and
are two learnable parameters used to dynamically adjust the weighting of the fusion of the two paths. As demonstrated in the CSGPA module, EMA and self-attention can work in parallel and reinforce each other through effective integration and streamlined processing. CSGPA not only builds long-range feature-dependent models to capture cross-regional information to better discriminate between objects and backgrounds; it also aggregates cross-spatial information of different spatial dimensions to enrich feature representations and enhance feature learning in semantic regions to improve the model’s focus on target regions in underwater degraded images.
3.3. Efficient Multi-Scale Weighting Feature Pyramid Network
In the CNN-based object-detection task, shallow feature maps contain extensive fine-grained information such as texture and edges due to their superior image resolution. The deep feature carries more abstract semantic information, such as the overall contour of the target. For this reason, it is necessary to effectively fuse different scales of feature information to improve the performance of the object-detection network [
31]. Feature Pyramid Network (FPN) fuses deep feature maps with shallow feature maps through simple up-sampling and concatenating operations to generate multiscale feature maps. The PANet in
Figure 4b adds top-down feature channels to the bottom-up channels to integrate richer features. More cross-connections are introduced in the structure of
Figure 4c, the approach resembles a residual network in concept, but this architecture may impose some constraints on the network. Although all of the above methods fuse feature information from different layers to some extent, one common disadvantage is that none of them incorporate feature information from the P2 layer into the network.
However, this layer is the backbone network with the feature map with the highest spatial resolution, making it essential for the detection of small and clustered objects. Thus, we propose the EMWFPN shown in
Figure 4d. In the initial stage of this network, we propagate the shallow features into the deeper layers of the network to help the model learn the features; in the second stage of feature fusion, we adopt as many connections as possible so that the feature maps of the bottom-up channel and the top-down channel can have more feature interactions, which enhances the network’s capacity for feature representation. EMWFPN enriches the fusion capability of multiscale feature maps, enables the network to autonomously enhance attention to more important scale features, improves the capacity of the model for feature expression, and reduces. computational redundancy.
Weighted Fusion. The conventional concatenation operation merely links two feature maps without distinguishing the relative importance of different features, which can lead to unbalanced information. To enhance the effectiveness of feature fusion, we propose to replace concatenation with weighted fusion. To illustrate this approach, consider the up-sampling stage and down-sampling stage of the P4 layer in
Figure 4d. The fusion process is outlined as follows:
where
and
are upsampling and downsampling operations, respectively. The four learnable parameters
, and
enable the dynamic adjustment of the proportion of weights when fusing feature maps at different scales. Parameter
is used to avoid the denominator being equal to 0; furthermore,
.
Lightweight Feature-Extraction Module. As the convolution neural network becomes deeper, a massive inference computation problem arises because there is too much repetitive gradient information in the network, which produces severe computational redundancy and leads to a loss of efficiency [
32]. Therefore, in EMWFPN, we adopt CSPStage [
33] as the feature-extraction module. It can effectively reduce gradient redundancy, improve computational efficiency, and lower model complexity. The structure of CSPStage is shown in
Figure 5. In order to minimize repetitive gradient computation, the input is first divided into two parts along the channel dimension and feature mapped separately using
convolutions. Subsequently, one part undergoes feature extraction by repeated applications of RepConv and
convolutions. RepConv incorporates a parallel structure of
convolution,
convolution, and cross-connection, enabling multiscale feature extraction. The other part bypasses this manipulation to avoid gradient redundancy. Eventually, the output is derived by splicing the two parts and passing them through a single
convolution for channel compression.
3.4. Occlusion-Robust Wavelet Network-Based Head
It is notable that underwater organisms frequently exhibit collective behavior and that the intricate nature of the underwater environment can result in object occlusion within underwater images. This phenomenon can lead to local aliasing and problems of missing features, with a subsequent impact on detection performance [
34]. To address these problems, processing of the feature map in the subfrequency domain is an effective approach [
35]. Therefore, we construct a frequency-sensitive detection head for occluded objects. This head divides the feature information into separate frequency domains for processing. This significantly reduces computational load while maintaining a large receptive field. Meanwhile, it achieve better results for the detection of occluded targets. The network structure is illustrated in
Figure 6.
First, due to the characteristics of wavelet transform, the use of
convolution kernels for the wavelet-transformed feature maps is equivalent to the use of
convolution kernels for the original feature maps, which greatly improves the receptive field of the network. The convolution process is shown in
Figure 6, with the equivalent formula as follows:
where
W is the kernel weight,
I is the input feature,
k represents the kernel size of the convolution and
,
. The wavelet convolution process commences with the wavelet transform of the input feature map, which results in the generation of four sub-bands: the low frequency component (LL), the horizontal high-frequency component (LH), the vertical high-frequency component (HL), and the diagonal high-frequency component (HH). These components can be represented as follows:
where
is defined as the wavelet coefficients,
is the direction index,
j represents the frequency scale,
and
are the position indexes, and
represents the wavelet basis function. The following formula is employed for the wavelet basis functions:
Subsequently, to ensure the stability of the feature-extraction flow and to improve its hierarchical nature, the
convolution kernel and the
convolution kernel are used for feature extraction at different scales. This approach is used to enhance the expressiveness of the model. In addition, the inverse wavelet transform is implemented to restore the image, as demonstrated in the subsequent equation:
where
is the inverse wavelet basis function and
represents the output of the wavelet convolution. Finally, following the implementation of a global average pooling layer and the transmission of data through two fully connected layers to obtain the attention weights, the model is adjusted to focus on the important features. The attention weights are then exponentially scaled to further enhance the salience of the important features, and the output is obtained after weighting to the input feature map. The formula is as follows:
where
denotes fully connected layers and
denotes an exponential scaling operation. It is evident that as a result, ORWNet is capable of processing features across a variety of frequency-domain scales. This, in turn, leads to a substantial reduction in computational redundancy and a concomitant expansion of the receptive field. The upshot of this is that the model is better able to comprehend the contextual relationship that exists between the occluded objects and the surrounding features. In addition, it enhances the model’s capacity to adapt to an occluded target.
3.5. Loss Function
There is a critical imbalance in the distribution of categories in the underwater dataset. Thus, if equal attention is given to hard and easy samples in the training stage, the model will focus more on the features of the easy samples and less on those of the hard samples. This will have a detrimental effect on the final performance of the model. In this paper, we introduce the improved slideloss function based on the Exponential Moving Average(EMA) [
36] approach to dynamically modify the focus of the model on hard versus easy samples to optimize the model performance.
Slideloss is a loss function that assigns greater weights to hard samples by setting a weighting function that allows the model to pay more attention to learning hard sample features. It can be written as follows:
where
x represents the Intersection over Union (IoU) value of the current sample predicted and
refers to the set IoU threshold. By comparing the values of
x and
, the sample can be classified as hard or easy. In Slideloss,
is plotted as the average of the IoUs of all samples and holds a fixed value.
It is noticeable that not all hard samples have higher weights. This is because we consider that samples with IoU values significantly lower than probably represent noise or background information or are just negative samples that are relatively easy to classify. Thus, paying more attention to these samples may not result in a greater optimization of the model. In contrast, samples with IoU values that are around are possibly samples that are prone to misclassification by the model and can be critical in affecting the performance of the model.
However, as we mentioned, the thresholds for hard and easy samples in Slideloss are set as the average of the IoUs of all the samples, whereas the mean IoUs of the samples in different training phases are not the same as the mean IoUs of all the samples. This approach may thus lead to biased classification of the hard and easy samples. Therefore, we adopt EMA to dynamically adjust the IoU threshold to achieve more accurate classification of hard and easy samples. The formulation for automatically modifying the threshold using EMA is as follows:
where
d is a decay factor dependent on the number of loss updates,
is a decay parameter,
i is the number of current loss updates, and
is a time constant controlling the exponential movement that is used to ensure the smoothness of the updated IoU. Then in (
23), by using the IoU mean
of the i-th training round and the IoU mean
of the previous round in the current training round, incorporation of the decay factor
d can enable the computation of an exponential moving average of IoU. With the aforementioned method, EMASlideloss effectively mitigates the sample-imbalance issue in underwater object detection.