Feature Fusion Network with Local Information Exchange for Underwater Object Detection

Liu, Xiaopeng; Ma, Pengwei; Chen, Long

doi:10.3390/electronics14030587

Open AccessArticle

Feature Fusion Network with Local Information Exchange for Underwater Object Detection

by

Xiaopeng Liu

¹,

Pengwei Ma

¹ and

Long Chen

^2,*

¹

College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China

²

Department of Medical Physics and Biomedical Engineering, University College London, London W1W 7TY, UK

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(3), 587; https://doi.org/10.3390/electronics14030587

Submission received: 5 January 2025 / Revised: 29 January 2025 / Accepted: 31 January 2025 / Published: 1 February 2025

Download

Browse Figures

Versions Notes

Abstract

When using enhanced images for underwater object detection, issues such as detail loss and increased noise often arise, leading to decreased detection efficiency. To address this issue, we propose the Feature Fusion Network with Local Information Exchange (FFNLIE) for underwater object detection. We input raw and enhanced images into the Swin Transformer in parallel for feature extraction. Then, we propose a local information exchange module to enhance the feature extraction capability of the Swin Transformer. In order to fully utilize the complementary information of the two images, our feature fusion module consists of two core components: the Discrepancy Information Addition Block (DIAB) and the Common Information Addition Block (CIAB). The DIAB and CIAB are designed by utilizing and modifying cross-attention mechanisms, which can easily extract image discrepancy information and common information. Finally, the fused features are fed into the object detector to perform object detection tasks. The experimental findings demonstrate that the FFNLIE exhibited exceptional performance across four underwater datasets.

Keywords:

Swin Transformer; underwater object detection; local information exchange; feature fusion

1. Introduction

As the economy and technology advance, land resources are increasingly insufficient to fulfill the demands of human production and daily life. So, the development of marine resources has received widespread attention [1]. In order to better utilize marine resources, various underwater tasks must be carried out, such as target localization, biometric identification, underwater archaeology, and environmental monitoring. In this context, underwater object detection technology plays a crucial role [2,3]. Nevertheless, underwater images frequently suffer from issues like color distortion, artifacts, and loss of detail due to the challenging underwater environment. These issues negatively impact the precision of underwater object detection.

To mitigate the impact of underwater environments, recent methods have employed existing enhancement algorithms to preprocess underwater images [4,5]. Subsequently, the enhanced image is input into a detection network for object detection. While enhanced images show improved visual appeal, they do not necessarily yield superior results in object detection tasks. As shown in Figure 1, we input both raw and enhanced underwater images into the FFNLIE for object detection, but the enhanced underwater images do not perform better. This suggests that the visual quality of enhanced images does not significantly influence the accuracy of object detection. Research [6,7] has shown that enhanced images might introduce subtle noise or unintentionally remove important details, potentially impairing the effectiveness of detection models. Certain studies [8,9,10] integrate image enhancement and object detection into a unified framework, directing enhancement processes to boost detection accuracy. Nonetheless, these methods often rely on paired datasets of underwater and clear images for training UIE models, which are challenging to acquire in practical scenarios.

Unlike previous methods, we acknowledge that both raw and enhanced underwater images provide unique and complementary benefits. Enhanced images enhance visual clarity, sharpening object boundaries for better distinction, while raw images maintain the scene’s natural characteristics, preserving intricate texture details. Thus, this paper focuses on combining features from both raw and enhanced images to enhance the precision of underwater object detection.

Specifically, we first extract features from both raw images and enhanced underwater images using Swin Transformer [11]. However, the attention mechanism in Swin Transformer is limited to capturing global dependencies within local windows, lacking the capacity for information exchange across different regions [12]. We design a local information exchange module to address this issue, which converts sequence features into image features and uses deep convolution on the image features to achieve information exchange within local regions, thereby improving the feature extraction capability of Swin Transformer. Additionally, we design a feature fusion module to integrate both the discrepancy and common information between raw and enhanced underwater images. By integrating two types of information, it is possible to maximize the utilization of information while reducing information loss, thereby improving the performance of underwater object detection. Considering the impressive ability of AutoAssign [13] in object detection, we finally feed the fused features into the AutoAssign object detector for object detection. The important contributions of this paper are outlined below:

We propose the Feature Fusion Network with Local Information Exchange, named FFNLIE, which improves object detection efficiency by fusing raw image features with enhanced image features.
We propose a local information exchange module that enhances the information exchange capability within the local region of Swin Transformer.
We introduce a feature fusion module designed to extract both the discrepancy and common information between raw and enhanced underwater images, enabling effective feature integration.
Comprehensive experiments validate the efficacy of the proposed method.

2. Related Work

2.1. Underwater Object Detection

In recent years, underwater object detection (UOD) has received widespread attention in fields such as ocean technology, deep-sea exploration, and environmental protection. The accuracy of underwater object detection is impacted by the inherent difficulties of underwater environments, including low contrast, diverse lighting conditions, and variations in water quality. Recently, a popular strategy has been to perform preliminary preprocessing on underwater images using existing enhancement algorithms [4,5]. Subsequently, the enhanced image is input into a detection network for object detection. While enhanced images often exhibit improved visual quality, they may not necessarily enhance object detection accuracy. Recent research [14] has further indicated the lack of a significant correlation between image visual quality and detection performance. We argue that this discrepancy arises because the objectives of enhancement and detection differ, potentially creating conflicts. Enhanced models might introduce subtle noise or inadvertently remove critical details, which could negatively impact the effectiveness of detection models.

Another approach involves jointly optimizing underwater image enhancement and object detection [8,9,10], aiming to improve detection accuracy through enhanced images while balancing visual restoration and detection performance. However, this method relies on paired datasets of underwater and clear images for training the UIE model, which are challenging to acquire in real-world scenarios. In contrast to these methods, we propose extracting and fusing features from both raw and enhanced underwater images, leveraging the natural characteristics of raw images and the enhanced object boundary details to boost underwater object detection accuracy.

2.2. Swin Transformer

Attention-based transformer mechanisms have shown exceptional performance in computer vision, leveraging their ability to capture global dependencies and enable parallelized computations [15]. ViT [16] was a pioneering effort to apply transformers to image processing. It divides an image into fixed-size patches, linearly embeds them, and processes the sequence through a Transformer encoder to capture global dependencies, replacing convolutional operations with self-attention mechanisms. Building on ViT, DeiT [17] represents a unique data enhancement approach and integrates a distillation token, enabling the model to acquire knowledge from both the raw data and a pre-trained teacher network.While ViT excels at capturing global context, its self-attention mechanism becomes computationally intensive with high-resolution images, reducing model efficiency. To mitigate this, Swin Transformer [11] employs a hierarchical structure and a local window-based self-attention mechanism, which divides the input image into multiple windows and calculates self attention within each window, greatly reducing computational complexity and improving processing efficiency. Furthermore, Swin Transformer excels at capturing fine-grained details in images, enhancing the model’s precision in object detection tasks [18,19]. However, Swin Transformer primarily captures global dependencies within each window and lacks mechanisms for information exchange between local regions. To overcome this limitation, in this work, we propose a local information exchange module to facilitate communication between local regions.

2.3. Feature Fusion

Feature fusion involves creating fused features that incorporate complementary information from various data sources [20,21,22]. In earlier approaches, convolutional neural networks (CNNs) were commonly employed for this purpose. For instance, Li et al. [23] introduced the DenseFuse framework to merge image features while preserving depth feature information. Similarly, Ma et al. [24] developed an model that leverages a prominent target mask to guide network training for feature fusion. Subsequently, generative adversarial network (GAN)-based feature fusion methods emerged. Ma et al. [25] pioneered the application of GANs to infrared and visible image fusion (IV-IF) tasks, whereas Li et al. [26] utilized a coupled generative adversarial network for the purpose of feature fusion. Despite the success of these methods, we observed that they primarily concentrate on capturing common information between data sources, leading to under-utilization of their discrepancy information and resulting in wasted insights. Thus, our feature fusion module is designed to extract both common and discrepancy information from the data, enabling more effective feature integration.

3. The Proposed Method

3.1. Overview

In this paper, we propose a feature fusion network with local information exchange for underwater object detection (FFNLIE), which is a new underwater object detection model that improves object detection accuracy by integrating complementary information between the raw image and the enhanced underwater image. As shown in Figure 2, our model has two inputs: the raw image and the enhanced image. In order to input the image into the Transformer, we performed block partitioning and linear transformation on the image through patch partition and linear embedding, transforming the image into appropriate dimensions. Considering that Swin Transformer can effectively extract and utilize different hierarchical features in images, we use Swin Transformer as our feature extractor. The FFNLIE has four different stages, each containing varying numbers of Swin Transformer blocks. Except for Stage 1, the remaining three stages are first downsampled through patch merging (PM), and then move on to the next stage. To enhance the ability of Swin Transformer to exchange information within local regions, we add a local information exchange module after each Swin Transformer block. Then, in each stage, we feed the extracted features of the raw image (F1) and the enhanced image (F2) into the feature fusion module, where we fuse the discrepancy and common information between the two. Finally, we input the fusion features obtained in the four stages into the object detector, and employ the object detector to generate results for both localization and classification.

3.2. Swin Transformer

Enhanced underwater images improve visual quality by offering sharper and more distinct object boundaries, while raw images retain the scene’s natural characteristics and provide detailed texture information of the objects. In order to link these two domains and utilize complementary information, we first used Swin Transformer for feature extraction. As shown in Figure 3, Swin Transformer has two inputs: the raw image

I_{1} \in R^{H \times W \times C}

and the enhanced image

I_{2} \in R^{H \times W \times C}

. Here, H, W, and C represent the height, width, and channel count of the input image. The input is initially split into

M \times M

non-overlapping patches through a patch partition layer, whose shape becomes

\frac{H W}{M^{2}} \times M^{2} \times C

, where

\frac{H W}{M^{2}}

is the number of patches.

Next, the linear embedding layer maps these features to the target dimensions, enabling their processing within the Swin Transformer block for feature extraction. Swin Transformer divides image feature extraction into four stages, with varying numbers of Swin Transformer blocks in each stage. Except for Stage 1, which first uses a linear embedding layer, the remaining three stages are downsampled using a patch merging (PM) layer. After passing through the patch merging layer, the height and width of the feature map will be halved, and the depth will be doubled.

The structure of Swin Transformer block is shown in Figure 3. When the image input enters the first block, it needs to pass through a LayerNorm layer and a Windows Multihead Self Attention (W-MSA) module, with a skip connection next to the two processing steps. The image continues to enter the LayerNorm layer and the Multilayer Perceptron (MLP) module, with a skip connection next to this path. At this point, the image has been processed through the first block and output to the second block. The second block is structurally similar to the first block, but it does not use W-MSA and instead uses the Shifted Window Multihead Self Attention (SW-MSA) module. The entire calculation process is as follows:

\begin{matrix} {\hat{z}}^{l} = W - M S A (L N (z^{l - 1})) + z^{l - 1} \\ z^{l} = M L P (L N ({\hat{z}}^{l})) + {\hat{z}}^{l} \\ {\hat{z}}^{l + 1} = S W - M S A (L N (z^{l})) + z^{l} \\ z^{l + 1} = M L P (L N ({\hat{z}}^{l + 1})) + {\hat{z}}^{l + 1} \end{matrix}

(1)

where

{\hat{z}}^{l}

and

z^{l}

denotes the output feature of the (S)W-MSA module and the MLP module for block l, respectively.

In order to process images of different sizes and reduce the computational complexity of the model, Swin Transformer segments the input features into multiple non-overlapping small regions called window partitions in W-MSA. As shown in Figure 4, W-MSA splits the feature map into separate windows according to the size of

M \times M

(

M = 4

), and then performs self-attention on each window separately. However, when using W-MSA, self-attention calculation is only performed within each window, so information cannot be transmitted between windows. To address this issue, Swin Transformer adds SW-MSA. As shown in Figure 4, we shift the window by

\frac{M}{2}

pixels from the top left corner to the right and bottom, respectively. After window translation, we first evenly divide the new area into four equal parts. Then, in these four equal parts, the previously connected areas are once again included in the same window. In this way, a total of nine new windows are formed. Then, we conduct self-attention on each of the nine windows. On the basis of W-MSA, SW-MSA performs a translation operation on the window, dividing some originally non-overlapping window regions into one window, so that more positional information can be considered when performing self-attention calculations. SW-MSA breaks the boundaries between windows to a certain extent and increases the model’s ability to capture global information.

3.3. Local Information Exchange

Although SW-MSA enhances the window information exchange capability of Swin Transformer, due to the shape and size of the window, Swin Transformer still has certain shortcomings in it local information exchange capability. Convolutional neural networks are based on locality, where convolutional filters only perceive local regions of the input image, i.e., the receptive domain. By stacking multiple layers, the perception area can be gradually expanded to capture the global features of the image. Taking inspiration from this, we add a local information exchange module to the Swin Transformer, whose structure is shown in Figure 2. In order to incorporate convolution operations into the model, we first add Seq2Img into the module to achieve the transformation between sequence and feature map. Next, the feature map is processed by a

1 \times 1

convolutional layer to expand its channel dimensionality, allowing the model to learn more diverse feature representations, thereby enhancing the model’s understanding and representation ability of complex image content. In order to maintain the channel independence of the feature map, we next use a

3 \times 3

deep convolution that convolves each input channel independently while achieving information exchange between adjacent pixels without being disturbed by information from other channels. This process can be expressed as follows:

\begin{matrix} z_{s} = f (S e q 2 I m g (z_{p}) ⊛ W_{1}) ⊛ W_{d} \end{matrix}

(2)

where

W_{d}

is a

3 \times 3

deep convolution,

W_{1}

is a

1 \times 1

convolution, and

z_{p}

represents the input sequence.

In order to obtain representative feature maps, we multiply the corresponding weight coefficients of the channels with the input features. The input features first pass through an average pooling layer. This operation compresses the spatial dimensions (height and width) of the features, retaining only the information in the channel dimension. In this way, each channel is assigned a global statistical value. Next, the features pass through a fully connected (FC) layer, which compresses the number of channels C to a smaller value (such as

C / r

, where r is the compression ratio) to reduce the number of parameters. Then, we apply the ReLU activation function to the compressed features to introduce non-linear characteristics into the network. The features activated by ReLU are fed into another fully connected layer to restore the channel count to its original value C. We use the Sigmoid activation function on the features, scaling the weights of each channel to a range between 0 and 1, and obtain the corresponding weights for each channel. Finally, each channel of the original feature map is multiplied by its corresponding weight to achieve feature weighting. In this way, important channels are enhanced while unimportant channels are suppressed. The calculation process can be expressed as follows:

\begin{matrix} z_{t 1} = F C_{1} (A P (z_{s})) \\ z_{t 2} = F C_{2} (R e L U (z_{t 1})) \\ z_{t 3} = S i g m o i d (z_{t 2}) \times z_{s} \end{matrix}

(3)

where

A P

represents average pooling,

F C_{1}

and

F C_{2}

represent fully connected layers,

z_{s}

represents the output feature of deep convolution,

z_{t 3}

represents the feature obtained by the channel weighting operation of

z_{s}

,

z_{t 1}

and

z_{t 2}

are intermediate variables in this process, and

R e L U

and

S i g m o i d

represent activation functions.

Finally, we use a

1 \times 1

convolutional layer to restore the number of channels in the feature map to its initial value and transform the feature map into a sequence through Seq2Img to participate in subsequent operations. To minimize information loss and improve model stability, we add the output sequence to the input sequence

z_{p}

as the final output. This process can be expressed as follows:

\begin{matrix} z_{u} = I m g 2 S e q (z_{t 3} ⊛ W_{1}) + z_{p} \end{matrix}

(4)

where

W_{1}

represents a

1 \times 1

convolution, and

z_{u}

is the final output of the local information exchange module.

3.4. Feature Fusion

During the earlier steps, features are independently extracted from both the raw and enhanced images. In the fusion phase, these features from the two domains are combined into a unified feature map. Our goal is to obtain a comprehensive feature map that includes clear object boundaries while preserving rich texture details. Therefore, effectively leveraging the discrepancy and common information within the features is crucial for optimizing fusion performance. The attention mechanism calculates the correlation scores between different elements in the input data and dynamically adjusts the weights of each element accordingly, thereby achieving effective extraction of key or common information. Therefore, we propose the Discrepancy Information Addition Block (DIAB) and Common Information Addition Block (CIAB). As shown in Figure 2, first we use a fully connected layer in CIAB to transform the raw image features and enhanced image features into query Q, key K, and value V, representing the three fundamental components of the attention mechanism. Linear projection can be expressed as follows:

\begin{matrix} Q = F C (F 2) \\ K = F C (F 1) \\ V = F C (F 1) \end{matrix}

(5)

where

F C

represents the fully connected layer and

F 1

and

F 2

denote the features of the raw image and the enhanced image, respectively.

To investigate common information between the raw and enhanced images, the FFNLIE uses dot product operation to calculate the similarity between Q and K. The purpose of this step is to quantify the degree of correlation between the two. Next, we use the softmax function to transform these similarities into corresponding weight coefficients. Then, the FFNLIE uses these weight coefficients to perform a weighted average on V. The purpose of this step is to give greater weight to V, which has a higher similarity to Q, in order to make the output vector reflect more of the common information between the two. This process can be expressed as

\begin{matrix} C O M_{Q V} = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V \end{matrix}

(6)

where

d_{k}

is a scaling factor that can stabilize the gradient of the softmax function by scaling the dot product of Q and K, preventing gradient vanishing or exploding. Then, we can easily obtain the discrepancy information between Q and V through subtraction operation. This process can be expressed as:

\begin{matrix} D I S_{Q V} = F C (V - C O M_{Q V}) \end{matrix}

(7)

In order to obtain complementary information from the raw image and the enhanced image, we add discrepancy information to Q, which can be expressed as:

\begin{matrix} z_{c o m p} = D I S_{Q V} + Q \\ z_{d i a b} = M L P (L N (z_{c o m p})) + z_{c o m p} \end{matrix}

(8)

where

L N

represents LayerNorm layers,

M L P

represents the Multilayer Perceptron, and

z_{d i a b}

is the output of DIAB.

After DIAB, we design the CIAB module to extract common information between the raw and enhanced images. Its structure is shown in Figure 2. The structures of CIAB and DIAB are very similar, the only difference being that we do not perform subtraction operations, but instead add the common information between Q and V to Q. Finally, we sum the discrepancy information and common information obtained through DIAB and CIAB as the output features for each stage. Integrating the common and discrepancy information of two images together can more accurately capture and utilize the intrinsic associations and differences between images compared to directly adding features. Not only can it avoid feature confusion and information redundancy that may arise from direct feature addition, but it can also highlight key similarities and differences, providing richer, more targeted, and valuable information for subsequent object detection tasks.

4. Experiments

4.1. Implementation Details, Evaluation Metrics and Datasets

Implementation Details: All experiments were conducted on the MMDetection [27] platform, utilizing AutoAssign [15] as the detector and the Water-MSR [18] model for underwater image enhancement. The model was optimized using the AdamW method with a weight decay of

0.0001

and a momentum of

0.9

. The initial learning rate was

2.5 \times 10^{- 3}

, reduced by

0.1

at the 27th and 35th stages. Training was performed with a batch size of 2 and 39 epochs on a single GeForce RTX 3090GPU. Only horizontal flipping was used for data augmentation.

Evaluation Metrics: We adopted average precision (AP) as the primary metrics for model accuracy evaluation. The

A P

,

A P_{50}

, and

A P_{75}

metrics measure detection accuracy at different IoU thresholds:

{0.5 : 0.05 : 0.95}

,

0.5

, and

0.75

, respectively. Additionally,

A P_{S}

,

A P_{M}

, and

A P_{L}

were used to evaluate detection effectiveness for small (

a r e a < 322

), medium (

322 < a r e a < 962

) and large (

a r e a > 962

) objects, respectively.

Datasets: We evaluated our method on four challenging underwater datasets: UODD [28], UTDAC2020 [29], RUOD [30], and UDD [31]. UODD contains 3194 real underwater images belonging to three categories (sea cucumber, sea urchin, and scallop), split into 2560 training, 128 validation, and 506 test images. UTDAC2020, from the 2020 Underwater Object Detection Algorithm Competition, includes 5168 training and 1293 validation images across four classes (echinus, holothurian, starfish, and scallop). RUOD contains 14,000 high-resolution images (9800 training and 4200 test) and ten object categories (diver, scallop, fish, etc.). UDD includes 2227 4K HD images (1827 training and 400 test) belonging to three categories (sea cucumber, sea urchin, and scallop).

4.2. Quantitative Results

We evaluated the FFNLIE against several existing methods across the four underwater datasets, with detailed results presented in Table 1, Table 2 and Table 3.

Results on UODD: Based on the UODD dataset, we compared our method with other detectors, including UIE+UOD and methods based on CNN or Transformer architectures. For UIE+UOD, we applied underwater enhancement techniques such as FunIE GAN [32] and ERH [33] as preprocessing steps, followed by detectors like Grid RCNN [34], CSAM [28], and Swin [11]. Among CNN-based methods, we evaluated FoveaBox [35], CSAM [28], AquaNet [31], and RFTM [5]. For Transformer-based methods, we included Pvtv1 [36], Pvtv2 [37], and GCC-Net [18] in our comparisons.

From Table 1, UIE+UOD performed poorly in detection tasks, suggesting that improved visual quality from image enhancement does not directly translate to higher detection accuracy. This is likely due to the introduction of noise during enhancement, which negatively impacts detection performance. CNN or Transformer-based detectors are currently popular methods. Our method still surpasses these types of detectors. According to the analysis in Table 1, the FFNLIE demonstrates exceptional performance, consistently ranking first or second across all six evaluation metrics. Notably, it achieves a

51.0 %

score in the

A P

metric. Compared with the Transformer method GCC-Net, FFNLIE achieved

2.1 % A P

enhancement (

48.9 % A P

vs.

51.0 % A P

). In addition, the FFNLIE has

0.2 %

more

A P

than the best performing CNN detector RFTM (

50.8 % A P

vs.

51.0 % A P

).

Results on UTDAC2020: As demonstrated in Table 2, we conducted an in-depth comparison of the FFNLIE with other two-stage object detection methods, such as DetectoRS [38], ERL [39], D2Det [40], and SABL [41], based on the UTDAC2020 dataset. The FFNLIE is significantly superior to existing methods. Compared with the best performing method ERL, the FFNLIE has

0.4 %

more

A P

than ERL (

48.3 % A P

vs.

48.7 % A P

). The results indicate that the FFNLIE has better precise localization and classification capabilities in underwater object detection compared to other methods. We also evaluated FFNLIE against single-stage object detection methods, including FSAF [42], RetinaNet [43], NAS-FCOS [44], and SSD [45]. Compared with the best method, NAS-FCOS, the FFNLIE has

2.9 %

more

A P

than NAS-FCOS (

45.8 % A P

vs.

48.7 % A P

).

Results on RUOD and UDD: In addition to evaluating against popular object detection methods, we also compared our method with specialized underwater object detection techniques, including RoIMix [46], Boosting [47], ERL [39], RoIAttn [48], GCC-Net [18], and RFTM [5], based on the UDD and RUOD datasets. The results are presented in Table 3. The FFNLIE shows outstanding performance, ranking first in five out of six metrics across two datasets. Specifically, in the UDD dataset, the FFNLIE achieved

29.7 % A P

, surpassing ERL by

0.1 %

(

29.6 % A P

vs.

29.7 % A P

). It is worth noting that in the RUOD dataset, the FFNLIE achieved the best results on any metric. Compared with the leading underwater detector GCC Net, the FFNLIE achieved a

0.6 % A P

improvement (

56.1 % A P

vs.

56.7 % A P

).

4.3. Ablation Studies

In this section, we perform ablation studies on the UODD dataset to assess the impact of our proposed method.

Analysis of Underwater Image Enhancement: To evaluate the effectiveness of the UIE method in UOD tasks, we tested three configurations: using only the raw image as the input (Raw + Null), using two raw images as the input (Raw + Raw), and combining the raw image with the enhanced image (Raw + Water-MSR). The results, presented in Table 4, show that fusing raw and enhanced image features (Raw + Water-MSR) significantly improves detection accuracy compared to Raw + Null. Additionally, comparing Raw + Raw and Raw + Water-MSR reveals that performance gains stem from effective feature fusion rather than simply increasing the number of input images or the network parameters.

Our method is highly adaptable and can be seamlessly integrated with various underwater image enhancement models. To validate this, we tested it with models such as GDCP [49], Five A+ [50], and SyreaNet [51]. As shown in Table 5, Five A+ achieved performance comparable to the Water-MSR model, while other models also delivered strong results. This highlights the scalability of our method, bridging underwater image enhancement and object detection. Its plug-and-play nature makes it a versatile and practical module for future research.

Analysis of Local Information Exchange Module: In this section, we will explore the role of the Local Information Exchange (LIE) Module. As shown in Table 6, Base represents the use of Swin Transformer to extract features from the raw and enhanced images, and then we simply add the two features together as the fused features. “LIE” stands for Local Information Exchange Module. This table clearly demonstrates that the experimental results with the addition of the LIE module are better. This is because Swin Transformer can often only capture global dependencies within the window, and the LIE module implements information exchange within local feature regions, thereby improving the feature extraction capability of Swin Transformer. This is of great significance for the feature fusion of the raw image and the enhanced underwater image.

To better understand the impact of the LIE module, we visualized the backbone network’s feature maps with and without the LIE module using Grad-CAM [52], as shown in Figure 5. In Grad-CAM, darker colors indicate higher model attention to specific regions. The visualization reveals that adding the LIE module directs the model’s focus more effectively toward the target objects, enhancing detection accuracy.

Analysis of Feature Fusion Module: This section investigates the functionality of our proposed Feature Fusion (FF) module, which represents the core contribution of our work. As shown in Table 6, “FF” represents the Feature Fusion module. “DIAB” and “CIAB” represent the Discrepancy Information Addition Block and the Common Information Addition Block, respectively. The inclusion of the FF module significantly improves the experimental results, demonstrating that effective feature fusion is highly beneficial for underwater object detection tasks. Enhanced images offer clearer object boundaries, while raw images retain natural scene features and detailed texture information. By combining these complementary features, the FF module enhances detection accuracy.

To further illustrate the principle of our FF module, we also validated the effectiveness of the Discrepancy Information Addition Block (DIAB) and the Common Information Addition Block (CIAB). From Table 6, it can be seen that in the process of feature fusion, features from different sources often contain unique and differentiated information. However, there may also be some common information between these features, which is equally important for subsequent analysis and processing. The integration of the two elements enriches the model’s input and boosts its capacity to represent intricate data, ultimately leading to a substantial enhancement in the model’s overall performance in object detection tasks.

4.4. Qualitative Results

In this section, we showcase qualitative results of the FFNLIE based on the UODD, UTDAC2020, and RUOD datasets. To highlight the FFNLIE’s performance, we compare it with the second-best detection method on each dataset.

Figure 6 displays qualitative results of the FFNLIE and RFTM based on the UODD dataset. The FFNLIE demonstrates high accuracy in challenging scenarios, with fewer omissions and errors compared to the second-best RFTM method. Its predicted bounding boxes align more precisely with the ground truth in terms of position and shape.

Figure 7 presents qualitative results of the FFNLIE and ERL based on the UTDAC2020 dataset. Both methods perform well on underwater images with diverse backgrounds. However, the FFNLIE outperforms ERL in challenging scenarios (third and fifth columns), exhibiting fewer omissions and false detections, which underscores its superior stability and robustness.

Figure 8 shows qualitative results of the FFNLIE and GCC-Net on the RUOD dataset. The FFNLIE demonstrates superior performance in underwater object detection. Notably, it excels on blurry images due to its integration of enhanced image features. For example, in the second and fifth columns of images, the GCC-Net method suffers from missed and false detections. However, benefiting from complementary feature fusion, our method performs better and eliminates false detections.

4.5. Error Analysis

To further investigate error types, we employed error analysis [53]. Figure 9 and Figure 10 display error analysis plots for the UODD and RUOD datasets. Each subplot includes precision-recall curves under various evaluation settings.

As illustrated in Figure 9, under the strict

I o U = 0.75

criterion, our method achieves

53.9 %

, surpassing the base method by

5.7 %

. At

I o U = 0.50

, our

A P_{50}

reaches

90.1 %

, compared to the base method’s

87.3 %

. When ignoring localization errors (Loc), our

A P

is

92.4 %

,

2.3 %

higher than the base method’s

90.1 %

. For the Oth metric, our

A P

improves to

93.5 %

, outperforming the base method’s

91.6 %

. Finally, after removing all background and category confusion errors (BG), our

A P

reaches

98.0 %

,

1.0 %

higher than the base method.

Based on the RUOD dataset, as illustrated in Figure 10, our method achieves

62.0 %

for

A P_{75}

, outperforming the base method by

1.5 %

. For

A P_{50}

, our method reaches

84.2 %

, a

1.0 %

improvement over the base method. Additionally, our method shows growth of

0.8 %

,

0.8 %

, and

0.07 %

for the Loc, Sim, and Oth metrics, respectively, compared to the base method.

The results demonstrate that our method excels in UOD tasks, performing consistently across diverse datasets. It exhibits strong generalization to various underwater environments and effectively mitigates the negative impact of image degradation on detection accuracy.

5. Conclusions

This paper proposes the innovative Feature Fusion Network with Local Information Exchange (FFNLIE) for underwater object detection, which addresses the significant impact of harsh underwater environments on object detection performance. The core of this method lies in the clever parallel introduction of the raw image and the enhanced image into the Swin Transformer framework, in order to simultaneously capturing natural underwater environmental features and clearer object boundary information. In order to further enhance the feature learning ability of the network, we have designed and introduced a local information exchange module, which significantly enhances the feature extraction capability of Swin Transformer. Finally, through a carefully designed feature fusion module, the FFNLIE achieves deep integration of the raw image features and enhanced image features. It not only fuses the common effective information between the two, but also cleverly utilizes the discrepancy information between them, greatly improving the accuracy and robustness of object detection. Based on four challenging underwater datasets (UODD [28], UTDAC2020 [29], UDD [31], and RUOD [30]), FFNLIE outperforms the suboptimal methods by

1.1 %

,

1.6 %

,

3.4 %

, and

1.0 %

, respectively, in terms of the

A P_{50}

metric. This indicates that the FFNLIE has strong competitiveness in underwater object detection.

Author Contributions

Conceptualization, Project administration and Methodology, X.L.; Investigation, Data Curation and Writing—Original Draft, P.M.; Writing—Review and Editing and Supervision, L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the China Postdoctoral Science Foundation under Grant (2019M652433), the Fundamental Research Funds for the Central Universities (30923011022), the Natural Science Foundation of Jiangsu Province (BK20231454), the National Natural Science Foundation of China (62176183), and Jiangsu Planned Projects for Postdoctoral Research Funds.

Data Availability Statement

The public datasets used in this paper are available at https://github.com/LehiChiang/Underwater-object-detection-dataset (UODD, accessed on 1 February 2025), https://aistudio.baidu.com/datasetdetail/215376 (UTDAC2020, accessed on 1 February 2025), https://github.com/dlut-dimt/RUOD (RUOD, accessed on 1 February 2025), and https://github.com/chongweiliu/UDD_Official (UDD, accessed on 1 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Krishna, S.; Lemmen, C.; Örey, S.; Rehren, J.; Pane, J.D.; Mathis, M.; Püts, M.; Hokamp, S.; Pradhan, H.K.; Hasenbein, M.; et al. Interactive effects of multiple stressors in coastal ecosystems. Front. Mar. Sci. 2025, 11, 1481734. [Google Scholar] [CrossRef]
Wu, D.; Luo, L. SVGS-DSGAT: An IoT-enabled innovation in underwater robotic object detection technology. Alex. Eng. J. 2024, 108, 694–705. [Google Scholar] [CrossRef]
Hasan, M.J.; Kannan, S.; Rohan, A.; Shah, M.A. Exploring the Feasibility of Affordable Sonar Technology: Object Detection in Underwater Environments Using the Ping 360. arXiv 2024, arXiv:2411.05863. [Google Scholar]
Zhou, J.; He, Z.; Lam, K.M.; Wang, Y.; Zhang, W.; Guo, C.; Li, C. AMSP-UOD: When vortex convolution and stochastic perturbation meet underwater object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 7659–7667. [Google Scholar]
Fu, C.; Fan, X.; Xiao, J.; Yuan, W.; Liu, R.; Luo, Z. Learning heavily-degraded prior for underwater object detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6887–6896. [Google Scholar] [CrossRef]
Li, Q.; Zhang, Y.; Fang, L.; Kang, Y.; Li, S.; Zhu, X.X. DREB-Net: Dual-stream Restoration Embedding Blur-feature Fusion Network for High-mobility UAV Object Detection. arXiv 2024, arXiv:2410.17822. [Google Scholar]
Wu, J.; Jin, Z. Unsupervised Variational Translator for Bridging Image Restoration and High-Level Vision Tasks. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 214–231. [Google Scholar]
Zhang, Y.; Wu, Y.; Liu, Y.; Peng, X. CPA-Enhancer: Chain-of-Thought Prompted Adaptive Enhancer for Object Detection under Unknown Degradations. arXiv 2024, arXiv:2403.11220. [Google Scholar]
Fan, Y.; Wang, Y.; Wei, M.; Wang, F.L.; Xie, H. FriendNet: Detection-Friendly Dehazing Network. arXiv 2024, arXiv:2403.04443. [Google Scholar]
Wang, B.; Wang, Z.; Guo, W.; Wang, Y. A dual-branch joint learning network for underwater object detection. Knowl.-Based Syst. 2024, 293, 111672. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Pan, L.; Chen, G.; Liu, W.; Xu, L.; Liu, X.; Peng, S. LDCSF: Local depth convolution-based Swim framework for classifying multi-label histopathology images. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkey, 5–8 December 2023; pp. 1368–1373. [Google Scholar]
Zhu, B.; Wang, J.; Jiang, Z.; Zong, F.; Liu, S.; Li, Z.; Sun, J. Autoassign: Differentiable label assignment for dense object detection. arXiv 2020. arXiv 2007, arXiv:2007.03496. [Google Scholar]
Li, C.; Zhou, H.; Liu, Y.; Yang, C.; Xie, Y.; Li, Z.; Zhu, L. Detection-friendly dehazing: Object detection in real-world hazy scenes. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8284–8295. [Google Scholar] [CrossRef]
Gao, Z.; Hong, B.; Zhang, X.; Li, Y.; Jia, C.; Wu, J.; Wang, C.; Meng, D.; Li, C. Instance-based vision transformer for subtyping of papillary renal cell carcinoma in histopathological image. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part VIII 24. Springer: Berlin/Heidelberg, Germany, 2021; pp. 299–308. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Dai, L.; Liu, H.; Song, P.; Liu, M. A gated cross-domain collaborative network for underwater object detection. Pattern Recognit. 2024, 149, 110222. [Google Scholar] [CrossRef]
Zhang, Y.; Li, T.; Li, C.; Zhou, X. A Novel Driver Distraction Behavior Detection Method Based on Self-Supervised Learning with Masked Image Modeling. IEEE Internet Things J. 2023, 11, 6056–6071. [Google Scholar] [CrossRef]
Jian, L.; Xiong, S.; Yan, H.; Niu, X.; Wu, S.; Zhang, D. Rethinking Cross-Attention for Infrared and Visible Image Fusion. arXiv 2024, arXiv:2401.11675. [Google Scholar]
Tang, M.; Cai, S.; Lau, V.K. Radix-partition-based over-the-air aggregation and low-complexity state estimation for IoT systems over wireless fading channels. IEEE Trans. Signal Process. 2022, 70, 1464–1477. [Google Scholar] [CrossRef]
Zhu, J.; Shi, Y.; Zhou, Y.; Jiang, C.; Chen, W.; Letaief, K.B. Over-the-Air Federated Learning and Optimization. IEEE Internet Things J. 2024, 11, 16996–17020. [Google Scholar] [CrossRef]
Li, H.; Wu, X.J. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 2018, 28, 2614–2623. [Google Scholar] [CrossRef]
Ma, J.; Tang, L.; Xu, M.; Zhang, H.; Xiao, G. STDFusionNet: An infrared and visible image fusion network based on salient target detection. IEEE Trans. Instrum. Meas. 2021, 70, 1–13. [Google Scholar] [CrossRef]
Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
Li, Q.; Lu, L.; Li, Z.; Wu, W.; Liu, Z.; Jeon, G.; Yang, X. Coupled GAN with relativistic discriminators for infrared and visible images fusion. IEEE Sens. J. 2019, 21, 7458–7467. [Google Scholar] [CrossRef]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
Jiang, L.; Wang, Y.; Jia, Q.; Xu, S.; Liu, Y.; Fan, X.; Li, H.; Liu, R.; Xue, X.; Wang, R. Underwater species detection using channel sharpening attention. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 4259–4267. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Fu, C.; Liu, R.; Fan, X.; Chen, P.; Fu, H.; Yuan, W.; Zhu, M.; Luo, Z. Rethinking general underwater object detection: Datasets, challenges, and solutions. Neurocomputing 2023, 517, 243–256. [Google Scholar] [CrossRef]
Liu, C.; Wang, Z.; Wang, S.; Tang, T.; Tao, Y.; Yang, C.; Li, H.; Liu, X.; Fan, X. A new dataset, Poisson GAN and AquaNet for underwater object grabbing. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 2831–2844. [Google Scholar] [CrossRef]
Islam, M.J.; Xia, Y.; Sattar, J. Fast underwater image enhancement for improved visual perception. IEEE Robot. Autom. Lett. 2020, 5, 3227–3234. [Google Scholar] [CrossRef]
Song, H.; Chang, L.; Chen, Z.; Ren, P. Enhancement-registration-homogenization (ERH): A comprehensive underwater visual reconstruction paradigm. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6953–6967. [Google Scholar] [CrossRef] [PubMed]
Lu, X.; Li, B.; Yue, Y.; Li, Q.; Yan, J. Grid r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7363–7372. [Google Scholar]
Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Li, L.; Shi, J. Foveabox: Beyound anchor-based object detection. IEEE Trans. Image Process. 2020, 29, 7389–7398. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Qiao, S.; Chen, L.C.; Yuille, A. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 10213–10224. [Google Scholar]
Dai, L.; Liu, H.; Song, P.; Tang, H.; Ding, R.; Li, S. Edge-guided representation learning for underwater object detection. CAAI Trans. Intell. Technol. 2024. [CrossRef]
Cao, J.; Cholakkal, H.; Anwer, R.M.; Khan, F.S.; Pang, Y.; Shao, L. D2det: Towards high quality object detection and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11485–11494. [Google Scholar]
Wang, J.; Zhang, W.; Cao, Y.; Chen, K.; Pang, J.; Gong, T.; Shi, J.; Loy, C.C.; Lin, D. Side-aware boundary localization for more precise object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 403–419. [Google Scholar]
Zhu, C.; He, Y.; Savvides, M. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 840–849. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Wang, N.; Gao, Y.; Chen, H.; Wang, P.; Tian, Z.; Shen, C.; Zhang, Y. NAS-FCOS: Fast neural architecture search for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11943–11951. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Lin, W.H.; Zhong, J.X.; Liu, S.; Li, T.; Li, G. Roimix: Proposal-fusion among multiple images for underwater object detection. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 2588–2592. [Google Scholar]
Song, P.; Li, P.; Dai, L.; Wang, T.; Chen, Z. Boosting R-CNN: Reweighting R-CNN samples by RPN’s error for underwater object detection. Neurocomputing 2023, 530, 150–164. [Google Scholar] [CrossRef]
Liang, X.; Song, P. Excavating roi attention for underwater object detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 2651–2655. [Google Scholar]
Peng, Y.T.; Cao, K.; Cosman, P.C. Generalization of the dark channel prior for single image restoration. IEEE Trans. Image Process. 2018, 27, 2856–2868. [Google Scholar] [CrossRef]
Jiang, J.; Ye, T.; Bai, J.; Chen, S.; Chai, W.; Jun, S.; Liu, Y.; Chen, E. Five A + Network: You Only Need 9K Parameters for Underwater Image Enhancement. arXiv 2023, arXiv:2305.08824. [Google Scholar]
Wen, J.; Cui, J.; Zhao, Z.; Yan, R.; Gao, Z.; Dou, L.; Chen, B.M. Syreanet: A physically guided underwater image enhancement framework integrating synthetic and real images. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 5177–5183. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]

Figure 1. The influence of various preprocessing methods on object detection results for the UODD dataset.

Figure 2. The overall framework of the FFNLIE.

Figure 3. The network of the Swin Transformer block.

Figure 4. The workflow of the window partition and shifted window partition.

Figure 5. Visual comparison of feature maps using Grad-CAM with and without (second row) the LIE module. The darker the color in the image, the stronger the model’s attention to the current area.

Figure 6. Qualitative comparison results on the UODD dataset.

Figure 7. Qualitative comparison results on the UTDAC2020 dataset.

Figure 8. Qualitative comparison results based on the RUOD dataset.

Figure 9. Error analysis plots comparing the base method and the proposed FFNLIE on the UODD dataset. The first figure presents the plot for the basic method (not using feature fusion module and local information exchange module) and the second figure presents that for our method.

Figure 10. Error analysis plots comparing the base method and the proposed FFNLIE on the RUOD dataset. The first figure presents the plot for the basic method (not using feature fusion module and local information exchange module) and the second figure presents that for our method.

Table 1. Comparison results with other methods on the UODD dataset. The red and green markings indicate the best and second-best results for each column, respectively.

Methods	Backbone	$AP$	${AP}_{50}$	${AP}_{75}$	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$
UIE+UOD:
FunIE+Grid RCNN	ResNetXT101	36.2	73.1	31.7	18.4	35.5	55.6
FunIE+CSAM	DarkNet-53	45.3	80.4	47.8	32.2	45.6	54.3
FunIE+Swin	Swin-B	48.3	85.1	49.7	32.2	45.2	58.2
ERH+Swin	Swin-B	48.8	86.0	48.5	30.6	48.6	61.3
CNN:
FoveaBox	ResNet101	45.6	85.1	43.5	32.4	45.5	57.2
CSAM	DarkNet-53	49.1	88.4	48.3	34.0	49.9	61.2
AquaNet	MNet	45.2	84.3	44.0	30.5	44.2	55.2
RFTM	ResNet50	50.8	89.0	53.6	33.6	50.9	62.8
Transformer:
PVTv1	PVT-Medium	45.8	85.4	43.6	32.0	45.8	56.9
PVTv2	PVTv2-B4	47.5	88.1	46.6	31.6	48.1	54.5
GCC-Net	Swin-B	48.9	88.8	48.1	35.1	48.8	58.6
Ours:
FFNLIE	Swin-B	51.0	90.1	53.9	36.9	50.6	61.4

Table 2. Comparison results with other methods on the UTDAC2020 dataset. The red and green markings indicate the best and second-best results for each column, respectively.

Methods	$AP$	${AP}_{50}$	${AP}_{75}$	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$
Two-Stage Detector:
DetectoRS	46.5	81.7	49.0	22.6	41.0	52.5
ERL	48.3	83.2	51.3	25.1	43.0	54.2
D2Det	43.5	80.2	42.6	18.3	37.8	49.5
SABL	47.8	81.4	50.7	21.4	41.5	54.4
Single-Stage Detector:
RetinaNet	41.4	76.6	39.6	16.9	34.8	48.0
SSD	40.0	77.5	36.5	14.7	36.1	45.1
FSAF	39.7	75.8	37.9	16.8	33.6	46.4
NAS-FCOS	45.8	82.8	46.1	21.7	40.1	52.4
Ours:
FFNLIE	48.7	84.8	51.5	19.9	43.2	55.2

Table 3. Comparison results with other methods on the UDD dataset and RUOD dataset dataset. The red and green markings indicate the best and second-best results for each column, respectively.

Methods	UDD			RUOD
Methods	$AP$	${AP}_{50}$	${AP}_{75}$	$AP$	${AP}_{50}$	${AP}_{75}$
RoIMix	28.0	64.9	18.8	54.6	81.3	60.3
Boosting	28.4	64.3	18.8	53.9	80.6	59.5
ERL	29.6	67.4	20.1	54.8	83.1	60.9
RoIAttn	28.3	64.5	16.9	52.9	81.7	57.3
GCC-Net	26.3	64.6	14.1	56.1	83.2	60.5
RFTM	28.8	66.1	19.4	53.3	80.2	57.7
FFNLIE	29.7	70.8	19.2	56.7	84.2	62.0

Table 4. Results on the model using different input data.

Domain 1	Domain 2	$AP$	${AP}_{50}$	${AP}_{75}$
Raw	Null	48.9	88.2	49.1
Raw	Raw	49.1	88.6	49.5
Raw	Water-MSR	51.0	90.1	53.9

Table 5. Results of using different underwater enhanced images in the model.

Domain 1	Domain 2	$AP$	${AP}_{50}$	${AP}_{75}$
Raw	GDCP	49.8	89.0	50.9
Raw	Five A+	51.1	90.0	53.1
Raw	SyreaNet	50.0	88.7	51.1
Raw	Water-MSR	51.0	90.1	53.9

Table 6. Ablation study of the UODD test set. The results marked in red represent the best results for each column.

Base	FF		LIE	$AP$	${AP}_{50}$	${AP}_{75}$
	DIAB	CIAB
✓				48.2	87.3	48.2
✓	✓			50.0	89.3	50.3
✓		✓		49.1	88.6	48.6
✓	✓	✓		50.6	90.0	51.8
✓			✓	50.0	89.4	50.9
✓	✓	✓	✓	51.0	90.1	53.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; Ma, P.; Chen, L. Feature Fusion Network with Local Information Exchange for Underwater Object Detection. Electronics 2025, 14, 587. https://doi.org/10.3390/electronics14030587

AMA Style

Liu X, Ma P, Chen L. Feature Fusion Network with Local Information Exchange for Underwater Object Detection. Electronics. 2025; 14(3):587. https://doi.org/10.3390/electronics14030587

Chicago/Turabian Style

Liu, Xiaopeng, Pengwei Ma, and Long Chen. 2025. "Feature Fusion Network with Local Information Exchange for Underwater Object Detection" Electronics 14, no. 3: 587. https://doi.org/10.3390/electronics14030587

APA Style

Liu, X., Ma, P., & Chen, L. (2025). Feature Fusion Network with Local Information Exchange for Underwater Object Detection. Electronics, 14(3), 587. https://doi.org/10.3390/electronics14030587

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Feature Fusion Network with Local Information Exchange for Underwater Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Underwater Object Detection

2.2. Swin Transformer

2.3. Feature Fusion

3. The Proposed Method

3.1. Overview

3.2. Swin Transformer

3.3. Local Information Exchange

3.4. Feature Fusion

4. Experiments

4.1. Implementation Details, Evaluation Metrics and Datasets

4.2. Quantitative Results

4.3. Ablation Studies

4.4. Qualitative Results

4.5. Error Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI