1. Introduction
Marine exploration is of great significance to humanity. It serves as an indispensable technical foundation for both marine resource exploitation and marine ecosystem conservation. With the rapid advancement of marine robotics technology, vision-based underwater object detection has emerged as a cost-effective and promising approach for marine exploration, drawing extensive attention from marine scientific research and engineering communities [
1,
2]. Nevertheless, the intricate underwater environment frequently leads to low image contrast and severe color distortion, which greatly hinders the in-depth utilization and analytical research of underwater images [
3]; moreover, the small proportion of target pixels as well as complex background interference also poses considerable challenges to underwater object detection [
4]. As illustrated in
Figure 1 [
5], the Jaffe–McGlamery theoretical model built upon linear superposition and aquatic medium modeling indicates that underwater imaging consists of reflected and scattered components [
6]. The scattering effect induced by suspended particles in water inevitably results in image blurring, obscuring critical details such as target edges and textures, and further degrading the recognition and localization accuracy of detection algorithms.
The selective absorption of light at different wavelengths by water bodies triggers obvious color deviation in underwater images. This distorts the color features of both targets and backgrounds, and further interferes with the accurate feature extraction of detection algorithms. The light attenuation characteristics in water are presented in
Figure 2. The absorption effect of water medium on light is positively correlated with wavelength, and becomes more prominent as water depth increases. Red light with longer wavelength and lower frequency is preferentially absorbed by water, so most captured underwater images generally exhibit an overall blue-green tone [
7].
Enhancing underwater images to correct color deviation can reduce feature matching errors caused by color distortion in object detection tasks. Improvements in image contrast, saturation and sharpness can strengthen the edge, texture and structural features of targets, making small and weak-textured objects more distinguishable and effectively mitigating the missed detection of tiny underwater targets. Preprocessing via image enhancement can lay a solid foundation for subsequent object detection. Traditional image enhancement methods such as histogram equalization [
8] and wavelet transform [
9] are derived from visual prior assumptions. Although these methods can improve image definition and contrast to some extent, they are incapable of effectively coping with illumination variations and severe color distortion under complex underwater background conditions [
10]. In recent years, deep learning-based image enhancement methods learn scene-adaptive features from ample training data. Such methods can restore fine details including image textures and edges, and effectively alleviate the color deviation problem of underwater images [
11]. However, these methods pay little attention to key targets such as sea cucumbers and sea urchins in the image enhancement procedure and ignore the detailed edge features of objects, which is unfavorable to the implementation of downstream tasks [
12]. This paper proposes a ResNet Block Enhanced-CycleGAN (RBE-CycleGAN). Built on the basic framework of CycleGAN, the proposed network embeds the Channel Attention and Spatial Attention Block (CASAB) attention mechanism into residual blocks. This design enables the model to concentrate on feature representation of key target regions, filter valid feature information and suppress redundant noise. By learning the mapping relationship between low-quality underwater images and high-quality clear ones from sufficient datasets, the method realizes end-to-end underwater image enhancement.
Most concerned objects in underwater detection tasks, such as sea cucumbers, sea urchins, scallops and starfish, are physically small in size and occupy only a tiny proportion of image pixels. In the image domain, targets with a resolution smaller than 32 × 32 pixels are generally defined as small objects [
13]. When detecting small objects, repeated convolution and downsampling operations will gradually reduce the spatial size of feature maps and weaken the characteristic information of targets. Insufficient visual cues will further hinder the extraction of discriminative features for object classification and localization [
14]. Complex underwater environment also creates severe interference for small object detection. For instance, similar tonal features between targets and backgrounds make small objects hard to identify, and analogous texture distributions lead to easy confusion with background regions. Meanwhile, occlusion in complicated underwater scenes further increases the difficulty of accurate object detection [
15].
Due to the low proportion of effective pixels of small targets, single-stage detectors are prone to missed detection and false detection. Therefore, this paper adopts the two-stage detection algorithm Faster R-CNN with MobileNetV2 as the backbone network for underwater small object detection. To address the loss of small target information caused by multiple downsampling operations, a multi-scale feature fusion strategy is adopted. The original backbone network only feeds the final-layer feature map into the detection head for proposal generation. Nevertheless, the excessive spatial shrinkage of the last feature map seriously limits the effective feature extraction of small targets. Considering that the edge and texture features of small targets mainly reside in middle-level features, the proposed Multi-Scale Feature Dilated Convolution Network (MSFDC-Net) fuses the third, fourth and fifth inverted residual features extracted from the backbone network. These layers contain abundant edge and texture clues of targets, which can supplement feature information for tiny objects and alleviate feature loss. The fused multi-scale features are subsequently delivered to the detection head for final detection.
To tackle the occlusion and background interference faced by small targets, as well as the limited receptive field of standard convolution in feature extraction, this paper designs a parallel dilated convolution module integrated with coordinate attention. This module adopts convolutions with different dilation rates to extract multi-scale surrounding features in a parallel manner, which expands the receptive field coverage and establishes effective spatial dependency between targets and their surrounding context. Meanwhile, the Coordinate Attention (CA) [
16] mechanism is introduced into the module. Through pooling operations along horizontal and vertical dimensions, CA performs cross-spatial information encoding on feature maps, and calculates the attention weight of each pixel in a channel-wise manner. This enables the network to adaptively focus on small target regions, suppress background interference and redundant features, and alleviate the problem that small targets are easily occluded by complex surroundings and submerged in irrelevant background information. The main innovations of this paper are summarized as follows:
Aiming at the blurring degradation and color deviation of small objects induced by complex underwater environment, this paper proposes an underwater image enhancement algorithm named RBE-CycleGAN. On the basis of CycleGAN, the generator architecture is optimally redesigned. By embedding the CASAB attention mechanism into residual connection blocks, the proposed method achieves effective sharpness enhancement and color restoration for key target objects.
To solve the problem of small target information loss caused by repeated downsampling, this paper proposes an underwater object detection algorithm called MSFDC-Net. It optimizes the feature extraction structure of the backbone network and fuses partial middle-layer and deep-layer features. This approach effectively compensates for the lost detailed information of small targets and improves the model’s perception capability for tiny underwater objects.
Aiming at the limited receptive field of single-scale convolution in the Region Proposal Network (RPN) Head, this paper proposes a parallel dilated convolution module integrated with coordinate attention. It enables effective capture of spatial context around targets and accurately models the spatial dependency between objects and their surrounding environment.
This paper is organized in accordance with the sequential order of two major tasks: underwater image enhancement and object detection. It first introduces the basic theories and improved strategies of the underwater image enhancement algorithm. Subsequently, the relevant theories and improvement schemes of underwater object detection are elaborated. Finally, comparative experiments and ablation experiments are conducted to fully verify the effectiveness of the proposed RBE-CycleGAN in correcting color deviation and improving image clarity, as well as the performance improvement effect of MSFDC-Net on underwater object detection.
2. Related Works
2.1. Underwater Image Enhancement Algorithm
Underwater image enhancement mainly aims to correct image blurring and color distortion caused by light scattering from suspended particles and selective light absorption in water. With preprocessing operations such as deblurring and enhancement of edge and texture details, the color distribution of underwater images can be restored to be closer to the normal imaging characteristics in the air environment. This method effectively improves image clarity and visual quality, facilitates the extraction of object feature information, and further boosts the performance of subsequent underwater object detection tasks [
17].
Most traditional underwater image enhancement methods rely on fixed mathematical and physical models to directly correct and process image pixel values. Huang et al. [
18] proposed a relative global histogram stretching algorithm with adaptive parameter updating. Aiming at the low contrast and color shift problems of underwater images, the method adaptively adjusts the stretching range of the histogram to improve visual visibility and restore scene details, while effectively avoiding artifacts and extra noise introduced during the enhancement process. Restricted by the processing logic of global transformation, this algorithm can hardly take into account the differences among local image regions, and presents limited refinement and enhancement effects on weak-texture areas and small-sized targets.
Drews et al. [
19] put forward an improved underwater dark channel prior algorithm to solve the inaccurate transmittance estimation existing in the conventional dark channel prior method for underwater scenes. This method removes the interference component of the red channel, and establishes the dark channel prior model merely using blue and green color channels. It greatly optimizes the calculation accuracy of light transmittance in underwater scenes. However, the algorithm highly depends on the inherent texture information of the captured scene, thus exhibiting unsatisfactory robustness in complex and dynamically changing underwater environment.
Chiang et al. [
20] presented a wavelength compensation and dehazing method. According to the differential attenuation rule of light wavelengths underwater, the approach compensates and restores severely degraded color channels via wavelength difference correction. Combined with the atmospheric scattering dehazing model, it inversely deduces the scattering components in water, so as to rectify the image quality degradation caused by turbid underwater conditions.
Li et al. [
21] proposed a GBR-based dehazing and red channel correction method. The scheme performs transmittance refinement and dehazing processing on blue and green channels respectively, and adopts independent color gain compensation for the red channel which suffers severe attenuation and distortion. This approach effectively alleviates low visibility and insufficient contrast in underwater images. Nevertheless, the method only conducts differentiated optimization at the channel dimension, while adopting a global unified correction strategy in the spatial dimension, which limits its enhancement performance for local image regions.
Deep learning-based image enhancement methods learn feature representations of diverse scenes from massive datasets. They can restore texture, edge and other fine details, correct color deviation of underwater images, and further improve the visual quality and practical applicability of underwater images for subsequent object detection tasks. Li et al. [
22] proposed the UWCNN algorithm for underwater image enhancement to address severe light scattering, overall greenish tone deviation and reduced clarity in underwater images. Constructed on the prior imaging of underwater scenes, the convolutional processing module of this method can effectively suppress color distortion and improve the sharpness of underwater images. However, the model is trained mainly with synthetic datasets, which leads to pixel over-compensation when dealing with real underwater images under extreme and severe attenuation conditions.
Wang et al. [
23] proposed UIEC
2-Net, a convolutional network with joint optimization of dual color spaces RGB and HSV. With a multi-branch convolutional structure, the network separately fulfills the tasks of image denoising, color deviation correction, as well as brightness and saturation adjustment. Among them, the RGB module is responsible for noise suppression and color bias removal; the global HSV adjustment module tunes image attributes through neural curve layers; and the attention mapping module fuses RGB and HSV outputs to generate visually clear images. Even so, this method delivers unsatisfactory enhancement results for underwater images with yellowish and bluish tones. When coping with severely color-shifted underwater images under extreme conditions, the H channel is prone to local color distortion.
Zhu et al. [
24] put forward the Cycle-Consistent Generative Adversarial Network (CycleGAN). Adopting a dual-generator and dual-discriminator architecture, the model leverages bidirectional adversarial learning together with cycle consistency constraints. It is capable of modeling both image degradation and enhancement processes with unpaired data, providing a viable unsupervised learning paradigm for underwater image enhancement tasks. Li et al. [
25] designed an unsupervised generative adversarial network named WaterGAN. The generator of this method is built upon the physical propagation model of underwater light. By modeling three key processes including light attenuation, scattering and camera imaging effect, it achieves color correction for underwater images under the condition of unpaired training data.
Liu et al. [
26] proposed a multi-scale feature fusion algorithm named MLFcGAN to solve the unsatisfactory enhancement performance caused by image enhancement networks merely relying on local features. In the encoder stage, the algorithm adopts residual structures to extract multi-scale local features and acquires global semantic information through gradual downsampling operations. By fusing global semantics with multi-scale local features, the method effectively improves the accuracy of color deviation correction and preserves texture details better for underwater images.
Synthesizing the aforementioned underwater image enhancement algorithms, two common limitations still remain: insufficient feature extraction capability and lack of effective attention to key target regions. Most network structures designed merely based on underwater scene priors do not introduce in-depth feature mining modules, making it difficult to accurately explore the deep semantic features of underwater images. Although UIEC2-Net incorporates an attention mechanism, it only applies attention to the global adjustment of color, brightness and saturation, failing to focus on salient target areas or enhance object details in a targeted manner, which is unfavorable to subsequent downstream detection tasks. To enhance the feature extraction ability of the enhancement network and strengthen its focus on key target regions, this paper constructs the model on the basis of cycle-consistent generative adversarial networks. The CASAB attention mechanism is embedded into residual connection blocks. Through joint modeling of channel and spatial attention, the semantic feature channels related to targets are enhanced, and key object regions can be automatically located while background interference is suppressed. Meanwhile, global consistency constraints are maintained to achieve high-quality underwater image enhancement.
2.2. Underwater Target Detection Algorithm
Compared with traditional methods that manually design and extract low-level visual features, deep learning maps raw image pixel information into high-level and abstract semantic features layer by layer. It exhibits superior feature representation ability and strong generalization robustness in underwater object detection tasks.
Zhang et al. [
27] proposed a lightweight underwater object detection algorithm based on YOLOv4 and multi-scale attention feature fusion. The algorithm adopts MobileNetV2 as the backbone network, and reconstructs the Neck and Head modules with depthwise separable convolution to reduce network parameters. Meanwhile, Mish activation function is used to replace ReLU6, which effectively enhances the nonlinear representation capability of the model. Li et al. [
28] presented a novel detection model named CME-YOLOv5. By replacing the original C3 module with the improved C3CA module and adding an extra detection branch, the model strengthens the feature extraction capability for small targets and achieves better detection performance. In addition, EIOU loss is adopted to replace GIOU loss, which optimizes the bounding box regression speed and positioning accuracy. Nevertheless, this algorithm is only optimized for dense small fish targets, and thus shows poor adaptability when transferred to other underwater detection scenarios. Wang et al. [
29] developed a lightweight underwater object detection algorithm named B-YOLOX-S. At the network input, Poisson blending is deployed to balance the quantity of training samples, and wavelet transform is introduced to restore image quality. In the Neck part, BiFPN-S is embedded to realize multi-scale features fusion and raise the utilization rate of multi-scale information, which further improves the detection performance for tiny underwater targets. Based on YOLOv5s, Hua et al. [
30] proposed an underwater object detection method with feature enhancement and progressive dynamic aggregation. This method designs a feature enhancement gating module. Combined with the attention mechanism, it can selectively strengthen valid feature information and suppress background noise. Meanwhile, dynamic weighted fusion is conducted between adjacent network layers to avoid the loss of feature details of small targets. Yang et al. [
31] proposed a novel underwater detection network based on structural reparameterization. The method designs a cone-rod module that integrates depthwise convolution and standard convolution to simulate the visual perception mechanism of the human eye. It combines channel and spatial attention to suppress the interference of underwater optical noise, and adopts WIoU loss to optimize the regression accuracy of bounding boxes. Ouyang et al. [
32] put forward a lightweight underwater object detection method with deformable upsampling. This work constructs a hybrid backbone that integrates CNN and Transformer structures. It can effectively strike a balance between global feature representation extraction and network lightweight optimization. A deformable upsampling module is newly designed to substitute the traditional fixed spatial upsampling scheme. This design effectively lowers deviation in multi-scale feature fusion and better accommodates geometric deformation characteristics of underwater targets. On this basis, the proposed method achieves obvious performance gains in the detection of tiny marine objects.
By summarizing existing research on small underwater target detection, two major challenges remain to be solved. Frequent downsampling operations easily erase feature details of tiny targets. It becomes difficult for networks to retain their fine-grained information. Standard convolutions also have narrow receptive fields. They cannot build effective spatial location modeling for small targets. As a result, models fail to lock key target areas and filter background noise. Localization accuracy drops significantly under such constraints. To address these problems, this work takes MobileNetV2 as the backbone and introduces a multi-scale feature fusion scheme. It compensates missing critical features of small targets from diverse scale dimensions. We further design an attention-embedded multi-branch parallel convolution module for the detection head. The module precisely locates underwater small targets, suppresses complicated background disturbances, and remedies the weak spatial representation ability of miniature underwater objects.
2.3. Underwater Image Feature Analysis
Suspended sediment and planktonic impurities in water damage the quality of underwater images. Backscattering from suspended particles blurs image details, while the selective light absorption of water bodies causes severe color distortion. As water depth increases, the overall image tone gradually changes from yellow to green and finally to blue. The complex image degradation restricts the performance of subsequent object detection tasks. Image blurring weakens the network’s ability to extract edge and texture features of targets. Color deviation makes foreground objects visually similar to the background, which makes it hard for the detection model to distinguish targets from background interference and ultimately causes a high missed detection rate of underwater objects.
Classical underwater image enhancement methods have limitations. Histogram equalization only adjusts the gray distribution of image pixels; wavelet transform is mainly used for simple noise reduction; and dark channel prior achieves image defogging based on physical imaging rules. Although these methods can improve image quality to some degree, they cannot effectively solve the inherent color deviation problem of underwater images. Moreover, all of them adopt global pixel correction schemes, without targeted optimization for edge contours and texture details of detection targets.
Most conventional algorithms for underwater small target detection still adopt a single-path feature extraction architecture, including Faster R-CNN, YOLOv4 and YOLOv5. Such methods cannot address the inherent problems of small targets, such as low proportion of pixels and feature loss induced by repeated downsampling operations. In addition, standard convolution has a constrained receptive field. When detecting small targets, it fails to fully capture spatial context information around target regions, making it difficult to separate foreground objects from complex background interference.
Most existing methods fail to address the defects in underwater image enhancement and object detection. To tackle the problems of color distortion and untargeted enhancement, this paper proposes RBE-CycleGAN, which embeds the CASAB mechanism into residual blocks. The model realizes end-to-end color correction of degraded underwater images by means of unpaired training data. Aiming at the low proportion of pixels of small targets and feature loss caused by repeated downsampling, this paper improves the classic Faster R-CNN framework. A multi-scale feature fusion strategy is introduced into the backbone network to make up for the lost detail information of small targets. Furthermore, the RPN detection head is optimized by adopting parallel dilated convolution combined with a coordinate attention module, which effectively solves the limited receptive field limitation of traditional 3 × 3-pixel convolution mask. The overall technical framework of this paper is illustrated in
Figure 3.
3. Underwater Image Enhancement and Small Target Detection Algorithm Based on RBE-CycleGAN and MSFDC-Net
3.1. Underwater Image Enhancement Algorithm Based on RBE-CycleGAN
Aiming at the pervasive color distortion problem in underwater object detection tasks, this paper proposes the RBE-CycleGAN algorithm, which enhances the residual block by introducing the CASAB attention mechanism. In the channel dimension, different weights are assigned to each feature channel to suppress noise interference and strengthen the network’s ability to extract features from key image regions. In the spatial dimension, the feature receptive field is expanded to capture richer contextual information. By learning the bidirectional nonlinear mapping between the low-quality domain and high-quality domain with sufficient unpaired underwater data, the proposed method achieves end-to-end underwater image enhancement.
3.1.1. Image Enhancement Basic Theory
Generative adversarial networks mainly consist of three core modules: generator, discriminator and loss function [
24]. The generator learns the degradation mechanism of underwater images from massive image data, and constructs a nonlinear mapping model from the low-quality domain to the high-quality target domain. It takes original low-quality underwater images as input and outputs enhanced high-quality results.
The discriminator is responsible for judging the domain attribution of samples, so as to distinguish real high-quality underwater images from synthetic samples generated by the generator. During the training process, the generator and discriminator form an adversarial iterative relationship. As the classification criteria of the discriminator become increasingly stringent, it reversely drives the generator to optimize the mapping model, making the generated samples continuously approximate the expected distribution of high-quality images, and finally realizing effective enhancement of low-quality underwater images.
The loss function measures the distribution difference between generated data and real data to evaluate the quality of synthesized images. It updates the parameters of the generator and discriminator through the back propagation algorithm, serves as the evaluation criterion for their performance, and guides the two modules to achieve continuous iterative optimization in the adversarial training process.
3.1.2. CASAB Attention Mechanism
The attention mechanism can focus on key information and filter redundant information during image feature extraction, thereby improving the efficiency of feature representation. Reasonable structural design of attention modules is of great significance for image generation tasks. Inspired by the Convolutional Block Attention Module (CBAM) [
33], this paper introduces a dual-branch attention mechanism integrating channel attention and spatial attention to strengthen critical feature channels and important spatial regions.
Considering that the attention mechanism is mainly used for feature selection, and residual connections should keep the identity features intact to prevent the loss of original information, the CASAB module is embedded into each residual block of the generator. It is placed after two convolutional layers and before the addition of residual skip connections, acting as a selective gate for the learned residual features. This attention mechanism enhances informative feature channels via the channel attention module and highlights critical spatial regions through the spatial attention module. The resulting attention-weighted features are then incorporated into the residual connection. During feature propagation, the skip connection forms a direct gradient highway in backpropagation, while the CASAB branch delivers attention-weighted gradients. Such a layout of the attention mechanism enables effective screening of specific channels and regions, and meanwhile guarantees stable model training.
In the CASAB channel attention branch, global average pooling is adopted to capture contextual information of feature maps, and global maximum pooling is used to acquire salient regional features. The pooled features are processed through fully connected (FC) layers and the Sigmoid activation function to generate channel attention weights.
For spatial attention, mean pooling, maximum pooling, minimum pooling and sum pooling are performed on each channel of the feature map to extract spatial feature information. All pooled results are concatenated along the channel dimension. Large-kernel depthwise convolution is applied to reduce computational complexity while expanding the receptive field, so as to capture richer spatial context information. Spatial feature weights are finally obtained through the activation function.
The original feature map is multiplied with channel attention weights and spatial attention weights respectively to generate channel-weighted and spatially weighted feature maps. The two weighted feature maps are then added element-wise to obtain the final output of the CASAB attention mechanism. The network structure of CASAB is shown in
Figure 4 [
34].
3.1.3. RBE-CycleGAN Network Structure
As shown in
Figure 5, the generator adopts a ResNet-based residual structure.
The input image is first subjected to edge padding to avoid the loss of edge information caused by convolution operations. Subsequently, a large-kernel convolution is applied to extract large-scale background and texture features of underwater images. Two downsampling operations with a convolution kernel size of 3 and a stride of 2 are performed to reduce the spatial size of the feature map and further extract middle-level features such as local edges, corner points and texture details. Nine ResNet residual blocks are stacked to capture high-level deep features including complete target contours and semantic structural information. Two deconvolution operations are used to restore the original image resolution and map the extracted high-level semantic features back to pixel space. Finally, a large-kernel convolution is adopted to compress the channel dimension of the feature map, and the Tanh activation function is employed to constrain the range of pixel values, yielding the final enhanced generated image.
As shown in
Figure 6, the discriminator follows the PatchGAN [
24] paradigm to expand the receptive field and maintain global consistency in patch-level discrimination. Instead of directly judging the authenticity of the entire image, PatchGAN performs multi-layer convolution and downsampling to extract deep features from the input image. The output feature map enables real-fake discrimination for each local patch within the image. The overall discrimination result is obtained by averaging the prediction scores of all patches.
3.1.4. RBE-CycleGAN Loss Function
The loss function consists of adversarial cycle loss, identity consistency loss and cycle consistency loss. The adversarial cycle loss ensures the adversarial training effect between the generator and discriminator by measuring the difference between original images and generated images, so as to optimize the two networks. The identity consistency loss constrains the difference between original images and the outputs generated by the intra-domain generator, which guarantees that the generator can learn the correct image mapping and avoid excessive generation for images within the same domain. The cycle consistency loss compares original images with the restored images reconstructed by the generator, ensuring that the generator does not excessively alter image content and preserve the original image structure. The total loss function is obtained by weighted summation of the above three loss functions, as shown in Equation (1):
where
and
denote the generators of the high-quality domain and low-quality domain, respectively;
represents the discriminator of the high-quality domain; and
denotes the discriminator of the low-quality domain.
is the adversarial cycle loss,
represents the cycle consistency loss, and
stands for the identity consistency loss.
and
are the weight coefficients corresponding to the above loss functions, which are set to
and
in the equation.
3.2. Underwater Target Detection Based on MSFDC-Net
Aiming at the problem of low proportion of pixels of underwater small targets and frequent missed detection in single-stage detectors, this paper adopts a two-stage detection framework for underwater small target detection. Based on Faster R-CNN, an improved network named MSFDC-Net is proposed in this work. It redesigns the feature extraction mode of the backbone network and adopts the FPN strategy. The middle-level features containing target edges and texture structures are selected from inverted residual blocks, and then fused with high-level features that carry global semantic information of targets. The fused multi-scale features are fed into the RPN Head to generate candidate detection boxes. Furthermore, the structure of the RPN Head is optimized to alleviate the limited receptive field caused by single-scale convolution. A parallel dilated convolution module embedded with coordinate attention is designed. The attention mechanism is used to screen potential target regions, while dilated convolutions with different dilation rates extract target features in a parallel manner. Feature fusion is subsequently conducted to further expand the receptive field of small target features.
3.2.1. Object Detection Basic Theory
As a classic two-stage object detection network, Faster R-CNN [
35] consists of four core components: backbone network, Region Proposal Network (RPN), ROI Pooling, and detection classification head. The backbone network encodes the input image to obtain deep feature maps with rich semantic information. The RPN generates a series of candidate regions containing potential targets on the feature maps, preliminarily distinguishes foreground from background, and completes coarse target localization. The ROI Pooling layer maps candidate regions of different scales to the feature maps and achieves accurate feature alignment, so as to retain effective feature information of small targets. Finally, the detection head performs category classification and precise bounding box regression on the features of candidate regions to output the final detection results.
3.2.2. Multi-Scale Feature Fusion Module
Shallow residual blocks extract corner points, edges and other detailed information of underwater targets; middle-level residual blocks capture local contours and shape fragments; and deep residual blocks learn complete target contours and global semantic features. To visualize the features extracted by convolutional layers at different levels, this paper adopts heat maps to mark the attention regions of multi-scale convolution layers, as illustrated in
Figure 7.
As for the original MobileNetV2 backbone, it implements five downsampling operations. After feature extraction via the backbone, a small object of 32 × 32 pixels is compressed into a single pixel. At this extreme scale, almost all discriminative cues such as object edges, textures and geometric shapes vanish. If deep features are utilized alone, the network can barely confirm the presence of targets but delivers inferior localization accuracy, resulting in severe missing detections of small objects.
To address the problem of small target feature loss caused by repeated downsampling, this paper fuses selected shallow and middle-level features with deep features to supplement missing information of small targets. The Feature6 layer corresponds to small targets mapped to a 4 × 4-pixel resolution, retaining shallow-level feature information such as target corners and edges; the Feature10 layer matches small targets at 2 × 2 pixels, carrying local object contours and abundant semantic information; and the Feature13 layer also corresponds to 2 × 2-pixel small targets with high-level semantic features that enable superior category classification capability. For instance, element-wise addition between Feature10 and Feature6 via the MSF module injects semantic information into Feature6. This empowers Feature6 to distinguish target edges from background textures within each 4 × 4-pixel region, thereby improving the network’s overall target discrimination performance. The specific implementation process is shown in
Figure 8.
First, a 1 × 1-pixel convolution mask is applied to the deepest feature map Feature 18 to adjust its channel dimension to 256. The processed feature map is then divided into two branches. One branch adopts a 3 × 3-pixel convolution mask for feature smoothing to obtain the output feature. The other branch adopts nearest-neighbor upsampling to align its spatial size with the upper feature map Feature 13. The two feature maps are added element-wise, followed by a 3 × 3-pixel convolution mask smoothing operation to generate the Fused Feature 13.
Deep features contain high-level semantic information of the image, while middle-level features retain texture, edge and contour details. This fusion strategy endows middle-level features with high-level semantic cues and facilitates the network in target category judgment.
The above operation is repeated for Feature 13. After channel adjustment via 1 × 1-pixel convolution mask and nearest-neighbor upsampling, Feature 13 is added element-wise with the upper feature map Feature 10 to obtain Fused Feature 10. This process enriches Feature 10 with multi-scale edge and texture information and improves the ability of the network to identify underwater targets.
Similarly, Feature 10 is upsampled and fused with Feature 6. This fusion introduces structured semantic information into shallow features and expands their receptive field, enabling the network to effectively distinguish background noise from real targets.
3.2.3. Parallel Dilated Convolution Fused with CA Module
CA is a hybrid attention mechanism that models attention in both channel and spatial dimensions. It explicitly embeds spatial coordinate information into channel attention, and simultaneously captures channel dependencies and long-range spatial dependencies of feature information.
Standard channel attention methods such as SENet perform global average pooling on input feature maps of size C × H × W to produce C × 1 × 1 feature vectors. This operation completely discards spatial position information of features and only learns the importance of individual channels, failing to inform the network of the spatial locations of target objects. In contrast, the coordinate attention assigns weights for each pixel by aggregating global statistical information from its entire corresponding row and column. Such a globally statistics-driven weighting mechanism behaves like a low-pass filter, which effectively suppresses local and discontinuous high-frequency noise.
The network structure of the CA attention mechanism is illustrated in
Figure 9a. Global average pooling is performed along the vertical and horizontal directions to encode directional feature information, which captures long-range dependencies in a single direction while retaining precise positional information in the other direction.
Feature maps from the two directions are concatenated in the spatial dimension, followed by convolution dimensionality reduction, feature splitting, convolution dimensionality elevation, and Sigmoid activation to generate coordinate attention weights. Finally, the generated attention weights are multiplied element-wise with the original input features to obtain the enhanced feature representation.
The Region Proposal Network (RPN) is a core component of Faster R-CNN, which is responsible for generating preliminary candidate regions on the shared feature map. The RPN Head undertakes feature extraction, as well as classification and bounding box regression tasks. The conventional RPN Head adopts a single 3 × 3-pixel convolution mask to extract image features. However, underwater small targets occupy only a small proportion of pixels. Limited by the fixed receptive field of single-scale convolution, it is difficult to capture contextual information around small targets. Consequently, small targets are easily interfered by complex background surroundings and eventually lead to missed detection.
To solve this issue, this paper designs a parallel dilated convolution module integrated with coordinate attention, whose structure is shown in
Figure 9b. The features output from the backbone are first enhanced by the coordinate attention module, which assigns pixel-level weights to suppress background noise interference. Relying on the long-range modeling capability in the spatial dimension, the module achieves preliminary perception of potential target regions.
Subsequently, two groups of convolutions with different dilation rates are adopted to extract features with diverse receptive fields at the same feature scale for small targets. The 3 × 3-pixel dilated convolution mask with a dilation rate of 1 captures features of small targets and their adjacent backgrounds to preliminarily distinguish foreground targets from background clutter. The 3 × 3-pixel dilated convolution mask with a dilation rate of 3 further expands the receptive field and captures long-range contextual information, which helps refine the separation between small targets and backgrounds and strengthens the perception ability of surrounding spatial information.
3.2.4. MSFDC-Net Network Structure
MSFDC-Net adopts the improved MobileNetV2 as the backbone feature extraction network. Through multi-scale feature fusion of middle-level and high-level features, the network achieves effective feature extraction for underwater small targets. The overall network structure is illustrated in
Figure 10, and its workflow is described as follows.
The input image first passes through a 3 × 3-pixel convolution mask to expand channel dimensions and extract shallow features such as corner points and texture details. Then the features are fed into inverted residual blocks to extract multi-scale information including edges, contours and global semantics. The fused shallow, middle and deep features are subsequently delivered to the detection head for target detection.
In the detection head, the coordinate attention mechanism assigns pixel-wise weights to the input feature maps. It focuses on key target information, suppresses background noise, highlights potential target regions, and completes the rough localization of small targets. Meanwhile, parallel dilated convolution is employed to extract image features and expand the receptive field, which fully captures spatial contextual information and assists the RPN in distinguishing foreground targets from complex backgrounds.
For the screened anchor boxes, non-maximum suppression (NMS) is adopted to eliminate redundant proposals. The remaining candidate boxes are mapped to the shared backbone feature maps, and ROI Pooling is used to unify the spatial size of different candidate regions. Finally, the detection branch and classification branch jointly complete the category recognition and precise location regression of underwater targets.
3.2.5. Object Detection Loss Function
The loss function of Faster R-CNN consists of two components: RPN loss and detection head loss. The overall loss is the weighted sum of these two parts, as shown in Equation (2):
where
and
are the classification loss and regression loss of the RPN network, which are used to distinguish foreground from background and generate preliminary candidate proposals.
and
denote the losses of the ROI Head, which are responsible for refining the position of candidate boxes and completing the final category classification.
4. Experimental Results and Analysis
This paper verifies and analyzes the performance of the improved RBE-CycleGAN and MSFDC-Net from two dimensions: subjective visual evaluation and objective quantitative indicators.
4.1. Experimental Setup
The network models are trained on the Windows operating system based on the PyTorch 12.8 framework. The CPU is Intel Core i7-14700KF with 32 GB of RAM, and the graphics card is NVIDIA GeForce RTX 5070Ti with 16 GB video memory.
4.2. Underwater Image Enhancement Results
The unpaired subset of the Enhancement of Underwater Visual Perception (EUVP) [
36] dataset is selected as the training set, containing 3205 low-quality underwater images and 3140 high-quality underwater images. The test set adopts the dataset for downstream underwater object detection tasks, and the enhancement performance is validated from both subjective and objective perspectives. From the objective evaluation dimension, several unsupervised image enhancement metrics are employed, including the Underwater Image Quality Measure (UIQM), Underwater Color Image Quality Evaluation (UCIQE), and Natural Image Quality Evaluator (NIQE).
The Adam optimizer is adopted for model training, and the learning rate is dynamically adjusted with the training epochs to achieve stable model convergence. The parameter settings for the generator and discriminator are listed in
Table 1. Specifically, the learning rate is set to 0.0002 during the first 100 training epochs, and then linearly decays to 0 in the subsequent 100 epochs. All other parameters use the default configuration.
The Underwater Robot Professional Contest (URPC) [
37] and Detecting Underwater Objects (DUO) [
38] datasets are adopted as the test set. Partial visual results of underwater image enhancement are shown in
Figure 11.
The left column shows original underwater images covering typical degradation scenarios such as blue tone deviation, green tone deviation and blurriness, while the right column presents the corresponding enhanced results. As can be seen from
Figure 11, the overall color distortion of enhanced underwater images is effectively corrected, which helps alleviate the background occlusion problem caused by color deviation.
For instance, in Group a and Group d, obvious improvements in color deviation can be observed. The images no longer exhibit a blue-green tint. In Group c, the enhanced image achieves improved clarity and contrast. The edge texture and detailed features of sea urchin targets become more distinguishable, providing strong support for accurate recognition of underwater small targets. In Group b, the original degraded image suffers from blurred starfish edges and similar color distribution between targets and background, which easily causes occlusion and missed detection. By contrast, the enhanced image exhibits clearer starfish contours and higher color contrast. It enables obvious separation between targets and background, and effectively improves the recall rate of underwater object detection.
The unsupervised image evaluation metrics of the URPC and DUO datasets are listed in
Table 2.
UIQM is adopted to comprehensively reflect the color richness and clarity of underwater images, with a higher value corresponding to better overall visual quality. On the public URPC and DUO datasets, the images enhanced by RBE-CycleGAN obtain superior UIQM scores compared with the original images. This verifies that the proposed underwater image enhancement method can effectively enhance the expressive ability of target edge and texture features.
UCIQE mainly evaluates color saturation and tonal deviation, and acts as a core metric to quantify the correction performance of underwater color distortion. The results on two test datasets demonstrate that RBE-CycleGAN can effectively eliminate the inherent color deviation of raw underwater images and achieve balanced tonal restoration.
NIQE is a natural image quality evaluation indicator; a lower NIQE value indicates less image distortion. Experimental results show that the NIQE values of images processed by RBE-CycleGAN are significantly reduced. It is validated that the proposed method can greatly improve underwater image quality and alleviate visual distortion caused by complex underwater optical environment.
4.3. Underwater Target Detection Effect
The URPC dataset is selected for model training, which contains images of four categories of underwater targets: sea cucumber, sea urchin, starfish and scallop under diverse underwater environment. The official bounding box annotations are provided in XML format, which can be directly applied to the training of object detection networks.
The training set of URPC includes 5543 images and the validation set contains 1200 images. For model testing, the URPC and DUO datasets are adopted. The URPC test set consists of 800 images, while the DUO test set contains 1111 images.
To evaluate the detection performance of the proposed model, comparative experiments are conducted with mainstream algorithms including Mask R-CNN [
39], SSD [
40], RetinaNet [
41], and the YOLO series.
Transfer learning is adopted in this work. MobileNetV2 pre-trained on the ImageNet dataset is used as the backbone initialization weight, and a freeze-then-unfreeze training strategy is applied. The backbone is frozen for the first 5 epochs, then trained for the next 200 epochs. The relevant trainable parameters are presented in
Table 3.
Mean Average Precision (mAP) is employed to evaluate the detection accuracy of the model, which reflects the accuracy of target localization and classification. Mean Recall (mRecall) is used to measure the model’s ability to detect targets and characterize its missed detection level. The detection results are shown in
Table 4.
All comparative experiments share identical dataset partitioning for training and validation sets, with consistent input resolution, training epochs, optimizer and learning rate decay strategy. All models are trained via transfer learning.
As shown in
Table 4, MSFDC-Net achieves the optimal performance in terms of both mAP and mRecall on the URPC and DUO datasets. This result fully verifies the effectiveness of MSFDC-Net in enhancing the feature representation capability of underwater targets. The proposed improvements can significantly strengthen the model’s ability to extract and perceive underwater object features, and further boost the detection performance.
The detection visualization results of several object detection algorithms on the URPC dataset are presented in
Figure 12.
It can be observed from
Figure 12 that the proposed MSFDC-Net detects and identifies more underwater targets with denser bounding boxes, which is consistent with the improved recall rate reflected in the quantitative experimental results.
To further verify the performance gain of the underwater image enhancement algorithm for the object detection task, comparative experiments are conducted in this paper. With the same detection model adopted, object detection is performed on original underwater images and enhanced images respectively.
The experimental results are presented in
Table 5. Underwater image enhancement can effectively correct color deviation and enlarge the color difference between target regions and background regions, thereby improving the recall rate of object recognition. Meanwhile, the overall clarity of images is improved, which helps the detection model fully extract detailed features such as edges and textures of small objects and further boost detection accuracy. The comparative experiments demonstrate that applying underwater image enhancement as a preprocessing method can effectively facilitate the subsequent object detection task.
4.4. Ablation Experiment
To further verify the effectiveness of the improved strategies in this paper, ablation experiments are designed on the URPC dataset to individually validate the performance gain of each module for the object detection network. Four comparative schemes are set in the experiments:
- (1)
Only introducing the multi-scale feature fusion strategy into the backbone network, denoted as FPN;
- (2)
Only embedding the parallel dilated convolution and coordinate attention module into the RPN Head, denoted as RPN;
- (3)
Only adopting the image enhancement preprocessing strategy, denoted as Enhanced;
- (4)
Integrating image preprocessing, multi-scale feature fusion and parallel dilated convolution modules simultaneously, denoted as MSFDC-Net.
The results of each ablation experiment are listed in
Table 6, where Base represents the baseline method of Faster R-CNN.
As can be seen from the experimental data in
Table 6, the mAP and mRecall indicators of experimental groups (1), (2) and (3) are all superior to those of the baseline network. This indicates that all the proposed improvements can effectively enhance the model’s ability to extract target features and optimize the performance of object recognition and localization.
Among them, the FPN experimental group provides multi-scale features for the detection head and effectively alleviates pixel loss caused by repeated downsampling. It equips deep features with edge information of mid-level features and enhances mid-level features with the semantic discriminative ability of deep features. It achieves the most prominent improvement in quantitative metrics, with the mAP reaching 0.67, an increase of 4.62% over the baseline model, and the mRecall reaching 0.74, a relative improvement of 5.71%.
The RPN module expands the convolutional receptive field and enriches the contextual information of feature space. Meanwhile, the coordinate attention mechanism is adopted to weight and strengthen the feature maps, highlight key regions containing object information, and improve the accuracy of initial anchor box generation. The improved quality of anchor candidates reduces invalid screening of subsequent proposals and lowers target missing rates at the anchor generation stage.
The Enhanced module corrects the color deviation of underwater images, effectively alleviating missed detection caused by the fusion of targets and complex background. It also enhances detailed features such as image edges and textures, which facilitates object recognition and further improves the detection accuracy of the model.
5. Discussion
This work integrates the CASAB attention module into RBE-CycleGAN to strengthen the network’s ability to focus on critical regions, thereby achieving targeted image enhancement. The performance gain of the improved model mainly stems from the embedded CASAB attention design. This module first adopts global average pooling on feature maps along the channel dimension to capture the average activation level of each channel, and leverages global max pooling to extract salient signals within individual channels. The two pooling outputs are further combined to generate adaptive weight coefficients for all channels. As for spatial attention branch, it conducts max pooling, min pooling, mean pooling and sum pooling across all channels at each pixel position to calculate the corresponding pixel weight. The joint action of channel and spatial attention enables the network to accurately locate dominant feature channels and key object pixel regions. Consequently, the modified model delivers more focused and visually superior results in underwater image enhancement tasks.
This paper proposes MSFDC-Net as a novel network for underwater small object detection. The overall performance improvement of the model mainly relies on the specially designed multi-scale feature fusion module. Underwater small objects inherently occupy an extremely low proportion of image pixels. Repeated downsampling operations in conventional networks tend to severely discard critical feature details of tiny targets, which inevitably degrades detection accuracy. To address this issue, this work adopts middle-level features that retain edge contours and texture information of small targets for cross-layer feature fusion and optimization. The integrated and complete feature representations are then fed into the detection head for final inference, effectively compensating for the insufficient feature characterization of underwater small objects. Ablation experimental results fully demonstrate that the proposed multi-scale feature fusion strategy can markedly boost the detection capability of the network. Both mAP and mRecall, the two core evaluation metrics, achieve considerable improvements after adopting this optimization scheme.
From the perspective of feature utilization, this paper optimizes the feature utilization mode of both the image enhancement network and object detection network, and achieves certain improvement in overall performance. Nevertheless, the present study still has several limitations. In the training of the image enhancement network, the adopted EUVP dataset can effectively mitigate color distortion and improve the visual quality of underwater images. Even so, the scene coverage of EUVP is still limited for object detection tasks, leading to unsatisfactory generalization capability, especially under extreme underwater environmental conditions. Moreover, although the backbone of the object detection network adopts a multi-scale feature fusion strategy, it does not make targeted improvements to the convolution operation itself. The network is still constrained by the insufficient receptive field size, which restricts further performance breakthroughs.
In follow-up research, we will explore datasets covering a wider range of underwater scenarios to further enhance the model’s environmental adaptability and generalization performance. Meanwhile, we will investigate feasible improvement schemes to expand the receptive field of convolution operations. This can enrich the feature information contained in each network layer and further strengthen the model’s recognition capability for underwater targets.
6. Conclusions
Aiming at the common problems of color deviation and blurriness in underwater small object detection tasks, RBE-CycleGAN is adopted to effectively alleviate color distortion, improve image clarity and the quality of edge and texture features. It lays a solid foundation for downstream object detection, and reduces missed detection and background interference caused by color distortion.
To address the challenges of tiny object pixels and complex background interference, based on Faster R-CNN, this paper improves the MobileNetV2 backbone. A multi-scale feature fusion strategy is created to fuse middle-level features containing edge and texture information of small targets, supplement object feature representation, and reduce the loss of small object information caused by downsampling.
Standard convolution has a limited receptive field in the region proposal network head. To solve this problem, we adopt the coordinate attention mechanism. It can assign adaptive weights to feature maps and restrain interference from background noise. This method also builds long-range spatial dependency for object information. Apart from that, the parallel dilated convolution strategy can further expand the spatial receptive field. It enriches spatial context information of targets, and boosts the model’s capability to recognize objects.
The original MobileNetV2 model has 82.37 million trainable parameters, with a weight file size of 628.64 MB and a peak GPU memory usage of 0.444 GB for single-image inference. In contrast, the improved MSFDC-Net achieves 20.21 million trainable parameters, a weight file size of 154.47 MB, and a peak GPU memory footprint of 0.213 GB during single-image inference. By optimizing the output channels of the backbone network, MSFDC-Net reduces the original 1280 output channels to 256 after feature fusion, which drastically cuts down the number of parameters in the RPN and ROI Head modules.
Experimental results show that RBE-CycleGAN is able to optimize underwater image quality. It can also well correct the existing color deviation. Ablation experimental findings further prove that image enhancement contributes positively to object detection performance. The designed MSFDC-Net reduces object information loss caused by continuous downsampling. It also enlarges the receptive field of standard convolution. The coordinate attention mechanism builds long-range spatial dependencies among points. It can supplement spatial location information for target regions, and help distinguish foreground objects from background noise. Such design finally boosts the overall detection accuracy of the constructed model. Quantitative experimental results demonstrate that our RBE-CycleGAN improves UIQM, UCIQE and NIQE by 10.06%, 9.43% and 12.29% on the URPC dataset, with respective gains of 12.44%, 9.62% and 15.69% on the DUO dataset. Meanwhile, the proposed MSFDC-Net achieves mAP and mRecall increments of 7.81% and 8.57% on URPC, as well as 3.08% and 5.56% on DUO.
Although the above improvements effectively promote the detection performance, the proposed method still has certain limitations. The image enhancement network is trained based on unpaired datasets. In addition, complex underwater environment leads to diversified image degradation varying with water depth and water area. Since our model is only trained on the EUVP dataset under limited underwater degradation scenarios, the generalization of the enhancement module is restricted. In future work, we will construct richer training datasets tailored to underwater detection tasks to further improve the generalization capability of the overall model.