Figure 1.
The challenges of underwater object detection. (a) The underwater environment often features low contrasts between objects and their backgrounds, leading to weak target perception. The lack of contrast makes it hard for detection models to distinguish weak targets, hindering accurate identification and localization. (b) The biodiversity of underwater organisms leads to significant scale differences, where the bounding boxes of larger objects may be several times larger than those of smaller objects. This disparity presents substantial challenges for object detection.
Figure 1.
The challenges of underwater object detection. (a) The underwater environment often features low contrasts between objects and their backgrounds, leading to weak target perception. The lack of contrast makes it hard for detection models to distinguish weak targets, hindering accurate identification and localization. (b) The biodiversity of underwater organisms leads to significant scale differences, where the bounding boxes of larger objects may be several times larger than those of smaller objects. This disparity presents substantial challenges for object detection.
Figure 2.
Overall architecture of SEANet. The whole detection model is divided into three parts: backbone, neck, and detection head. P2-P9 are the feature extraction layers of the network. D1, D2, and D3 refer to three detection heads. SE-FPN is the feature pyramid that we proposed.
Figure 2.
Overall architecture of SEANet. The whole detection model is divided into three parts: backbone, neck, and detection head. P2-P9 are the feature extraction layers of the network. D1, D2, and D3 refer to three detection heads. SE-FPN is the feature pyramid that we proposed.
Figure 3.
Demonstration of submodules. The architecture of our Multi-Scale Detail Amplification Module (MDAM) and the foundation work of RFB [
33]. MDAM includes five distinct branches, each equipped with different kernels, which enhance the ability to capture discriminative contexts at multiple scales. The five branches are denoted as Br1 to Br5.
Figure 3.
Demonstration of submodules. The architecture of our Multi-Scale Detail Amplification Module (MDAM) and the foundation work of RFB [
33]. MDAM includes five distinct branches, each equipped with different kernels, which enhance the ability to capture discriminative contexts at multiple scales. The five branches are denoted as Br1 to Br5.
Figure 4.
Comparison of various feature pyramids. represents a set of features extracted at various scales. is the set of features generated by the constructed feature pyramid. (a) FPN introduces a top-to-bottom pathway; (b) BiFPN and AWBiFPN have the same fusion pathway, but AWBiFPN replaces DWConv in BiFPN with ordinary conv; (c) SA-FPN proposes a scale-aware feature pyramid; (d) illustrates the feature fusion concept of our SE-FPN, where features from the current layer are concatenated with features from previous layers at different scales using a simple concatenation operation, without applying any weighting or complex fusion strategy.
Figure 4.
Comparison of various feature pyramids. represents a set of features extracted at various scales. is the set of features generated by the constructed feature pyramid. (a) FPN introduces a top-to-bottom pathway; (b) BiFPN and AWBiFPN have the same fusion pathway, but AWBiFPN replaces DWConv in BiFPN with ordinary conv; (c) SA-FPN proposes a scale-aware feature pyramid; (d) illustrates the feature fusion concept of our SE-FPN, where features from the current layer are concatenated with features from previous layers at different scales using a simple concatenation operation, without applying any weighting or complex fusion strategy.
Figure 5.
The detailed structure of SE-FPN. F represents the Contrast Enhancement Module (CEM). combines a convolution layer, a batch normalization layer, and the SiLU activation function. P4, P6, and P9 are three outputs from the 4th, 6th, and 9th layers of the feature extraction. D1, D2, and D3 represent the three outputs produced by the Neck network. Finally, D1, D2, and D3 are sent to the detection head for prediction.
Figure 5.
The detailed structure of SE-FPN. F represents the Contrast Enhancement Module (CEM). combines a convolution layer, a batch normalization layer, and the SiLU activation function. P4, P6, and P9 are three outputs from the 4th, 6th, and 9th layers of the feature extraction. D1, D2, and D3 represent the three outputs produced by the Neck network. Finally, D1, D2, and D3 are sent to the detection head for prediction.
Figure 6.
Illustration of FBC.⊗: Matrix multiplication; ⊖: vector difference; ⊙:element-wise product.
Figure 6.
Illustration of FBC.⊗: Matrix multiplication; ⊖: vector difference; ⊙:element-wise product.
Figure 7.
(a) presents the AP growth curve during training on the RUOD validation dataset for object detection algorithms. (b) represents the P-R curve of SEANet on the RUOD and URPC2021 validation datasets. It demonstrates the of our SEANet for each category on both datasets.
Figure 7.
(a) presents the AP growth curve during training on the RUOD validation dataset for object detection algorithms. (b) represents the P-R curve of SEANet on the RUOD and URPC2021 validation datasets. It demonstrates the of our SEANet for each category on both datasets.
Figure 8.
Visualization comparison results of the heat map. Column (a) presents the input image. Column (b) shows the heat map of the baseline, and column (c) shows the heat map of our SEANet.
Figure 8.
Visualization comparison results of the heat map. Column (a) presents the input image. Column (b) shows the heat map of the baseline, and column (c) shows the heat map of our SEANet.
Figure 9.
Comparison of partial results from the RUOD dataset. The first row shows the ground truth labels, the second row presents the baseline detection results, and the third row displays the detection results from our SEANet.
Figure 9.
Comparison of partial results from the RUOD dataset. The first row shows the ground truth labels, the second row presents the baseline detection results, and the third row displays the detection results from our SEANet.
Figure 10.
Comparison of partial results from the URPC2021 dataset. The first row shows the ground truth labels, the second row presents the baseline detection results, and the third row displays the detection results from our SEANet. The purple circles represent labels that are incorrectly identified as positive.
Figure 10.
Comparison of partial results from the URPC2021 dataset. The first row shows the ground truth labels, the second row presents the baseline detection results, and the third row displays the detection results from our SEANet. The purple circles represent labels that are incorrectly identified as positive.
Figure 11.
Qualitative comparison of our method with others. Each color of the annotation box corresponds to a specific organism, with the white circle representing a miscount. Missed detections are not marked with any special symbols in the image.
Figure 11.
Qualitative comparison of our method with others. Each color of the annotation box corresponds to a specific organism, with the white circle representing a miscount. Missed detections are not marked with any special symbols in the image.
Figure 12.
Visualization of an image under different levels of Gaussian noise and motion blur. The variable S denotes the severity level of the disturbance, with higher values indicating more severe degradation.
Figure 12.
Visualization of an image under different levels of Gaussian noise and motion blur. The variable S denotes the severity level of the disturbance, with higher values indicating more severe degradation.
Figure 13.
Performance comparison under different levels of Gaussian noise and motion blur. Subfigures (a–d) show the robustness of our method and the baseline under five levels of Gaussian noise, while (e–h) depict the robustness under five levels of motion blur. Specifically, (a,e) illustrate AP trend curves under Gaussian noise and motion blur, respectively. Subfigures (b–d) and (f–h) present bar charts comparing AP, , and between our method and the baseline across five degradation levels (S = 1 to S = 5), corresponding to Gaussian noise standard deviations of 10, 20, 30, 40, and 50 and motion blur kernel sizes of 5, 9, 13, 17, and 21, respectively.
Figure 13.
Performance comparison under different levels of Gaussian noise and motion blur. Subfigures (a–d) show the robustness of our method and the baseline under five levels of Gaussian noise, while (e–h) depict the robustness under five levels of motion blur. Specifically, (a,e) illustrate AP trend curves under Gaussian noise and motion blur, respectively. Subfigures (b–d) and (f–h) present bar charts comparing AP, , and between our method and the baseline across five degradation levels (S = 1 to S = 5), corresponding to Gaussian noise standard deviations of 10, 20, 30, 40, and 50 and motion blur kernel sizes of 5, 9, 13, 17, and 21, respectively.
Table 1.
The parameters used to analyze the experiments.
Table 1.
The parameters used to analyze the experiments.
Type | Setting | Type | Setting |
---|
Image size | 640 | Momentum | 0.937 |
Batch size | 16 | Weight decay | 0.0005 |
Optimizer | SGD | Initial learning rate | 0.01 |
Epochs | 300 | Seed | 0 |
Table 2.
Performance comparison between SEANet and other object detection methods on the RUOD dataset. The highest performance is indicated in bold, and the second highest is indicated with underlining.
Table 2.
Performance comparison between SEANet and other object detection methods on the RUOD dataset. The highest performance is indicated in bold, and the second highest is indicated with underlining.
Methods | Year | AP ↑ | | | R ↑ | P ↑ | F1 ↑ |
---|
Generic Object Detector: |
SSD [41] | 2016 | 43.4 | 73.4 | 45.4 | 70.3 | 71.2 | 70.7 |
Faster-RCNN [42] | 2016 | 52.8 | 81.8 | 57.5 | 78.7 | 79.4 | 79.0 |
Cascade-RCNN [43] | 2018 | 54.8 | 81.1 | 59.7 | 78.5 | 79.2 | 78.8 |
FreeAnchor [44] | 2019 | 55.0 | 82.4 | 59.8 | 79.2 | 80.1 | 79.6 |
NAS-FPN [45] | 2019 | 51.4 | 78.9 | 55.2 | 75.7 | 76.6 | 76.1 |
Libra-RCNN [28] | 2019 | 54.8 | 82.8 | 60.5 | 79.8 | 80.6 | 80.2 |
RepPoints [46] | 2019 | 55.4 | 83.7 | 60.4 | 80.7 | 81.6 | 81.1 |
Guided-Anchoring [47] | 2019 | 56.7 | 84.2 | 62.0 | 81.2 | 81.9 | 81.5 |
ATSS [48] | 2020 | 52.9 | 80.3 | 56.9 | 77.5 | 78.1 | 77.8 |
Dynamic-RCNN [49] | 2020 | 54.4 | 81.3 | 60.3 | 78.2 | 79.1 | 78.6 |
FoveaBox [50] | 2020 | 52.1 | 81.4 | 56.0 | 78.2 | 79.0 | 78.6 |
YOLOF [51] | 2021 | 50.1 | 80.0 | 53.8 | 77.0 | 77.9 | 77.4 |
Detecors [38] | 2021 | 57.8 | 83.6 | 63.6 | 80.1 | 81.0 | 80.5 |
YOLOv7 [39] | 2022 | 64.6 | 88.0 | 71.2 | 82.9 | 86.2 | 84.5 |
YOLOv8s 1 | 2023 | 63.6 | 86.3 | 69.9 | 80.2 | 86.5 | 83.2 |
YOLOv10s [40] | 2024 | 62.8 | 86.2 | 68.9 | 79.9 | 87.1 | 83.3 |
Underwater Object Detector: |
RFTM [52] | 2023 | 53.3 | 80.2 | 57.7 | - | - | - |
AMSP-UOD [22] | 2024 | 65.2 | 86.1 | 72.5 | 79.4 | 86.6 | 82.8 |
DJL-Net [53] | 2024 | 57.5 | 83.7 | 62.5 | - | - | - |
Dynamic YOLO [21] | 2024 | 63.7 | 87.0 | 69.8 | 81.1 | 86.1 | 83.5 |
GCC-Net [8] | 2024 | 59.4 | 85.6 | 65.6 | 81.2 | 86.0 | 83.5 |
SEANet(ours) | - | 67.0 | 88.4 | 73.9 | 82.0 | 87.6 | 84.7 |
Table 3.
Performance comparison between SEANet and other object detection methods on the URPC2021 dataset.
Table 3.
Performance comparison between SEANet and other object detection methods on the URPC2021 dataset.
Methods | Year | URPC | URPC Categories AP50 |
---|
AP ↑ | | | Ho ↑ | Ec ↑ | St ↑ | Sc ↑ |
---|
FoveaBox [50] | 2020 | 45.6 | 81.7 | 45.9 | 74.9 | 90.7 | 87.8 | 73.6 |
Dynamic-RCNN [49] | 2020 | 45.4 | 78.8 | 47.6 | 71.7 | 87.8 | 85.8 | 70.0 |
Double-Head R-CNN [54] | 2020 | 45.8 | 81.0 | 47.3 | 74.1 | 90.0 | 87.2 | 72.5 |
Detectors [38] | 2021 | 46.2 | 80.4 | 49.1 | 73.5 | 89.0 | 86.3 | 72.6 |
TOOD [55] | 2021 | 47.8 | 82.3 | 51.0 | 76.4 | 88.3 | 88.4 | 76.1 |
YOLOX [56] | 2021 | 43.8 | 80.2 | 43.3 | 69.3 | 90.0 | 86.3 | 75.0 |
YOLOv7 [39] | 2022 | 49.7 | 85.2 | 53.1 | 78.4 | 92.3 | 90.1 | 79.9 |
YOLOv8s | 2023 | 50.9 | 84.4 | 55.7 | 76.6 | 90.8 | 89.8 | 80.6 |
AMSP-UOD [22] | 2024 | 49.8 | 82.8 | 54.6 | 72.7 | 90.3 | 87.4 | 80.8 |
YOLOv10m [40] | 2024 | 51.2 | 84.8 | 56.5 | 76.3 | 91.2 | 89.7 | 81.8 |
Dynamic YOLO [21] | 2024 | 52.7 | 85.5 | 59.8 | 77.5 | 92.0 | 90.2 | 82.1 |
GCC-Net [8] | 2024 | 49.4 | 83.8 | 53.2 | 78.0 | 89.0 | 89.5 | 78.6 |
SEANet (ours) | - | 53.0 | 85.5 | 60.3 | 77.8 | 91.5 | 90.3 | 82.3 |
Table 4.
Performance comparison between SEANet and other detection methods on the DUO dataset.
Table 4.
Performance comparison between SEANet and other detection methods on the DUO dataset.
Methods | Params | DUO | DUO Categories AP |
---|
AP ↑ | | | Ho ↑ | Ec ↑ | Sc ↑ | St ↑ |
---|
Generic Object Detector: |
Faster R-CNN [42] | 41.17 | 61.3 | 81.9 | 69.5 | 61.4 | 70.4 | 41.9 | 71.4 |
Cascade R-CNN [43] | 68.94 | 61.2 | 82.1 | 69.2 | 61.9 | 69.0 | 41.9 | 72.0 |
DetectoRS [38] | 123.23 | 64.8 | 83.5 | 72.4 | 65.8 | 73.5 | 45.7 | 74.3 |
GFL [57] | 32.04 | 65.5 | 83.7 | 71.9 | 64.3 | 74.2 | 47.5 | 75.9 |
YOLOv7 [39] | 37.25 | 66.3 | 85.8 | 73.9 | 66.3 | 73.7 | 50.8 | 74.5 |
YOLO11m [58] | 20.1 | 71.3 | 86.8 | 78.4 | 70.7 | 77.9 | 56.8 | 79.7 |
Underwater Object Detector: |
RoIMix [13] | 68.94 | 61.9 | 81.3 | 69.9 | 63.0 | 70.7 | 41.7 | 72.4 |
Boosting R-CNN [14] | 45.95 | 63.5 | 78.5 | 71.1 | 63.8 | 69.0 | 46.8 | 74.5 |
ERL-Net [59] | 218.83 | 64.9 | 82.4 | 73.2 | 67.2 | 71.0 | 46.5 | 74.8 |
GCC-Net [8] | 38.31 | 69.1 | 87.8 | 76.3 | 68.2 | 75.2 | 56.3 | 76.7 |
RFTM [52] | 75.58 | 60.1 | 79.4 | 68.1 | - | - | - | - |
AMSP-UOD [22] | 10.36 | 68.5 | 84.8 | 76.7 | 65.6 | 78.0 | 53.3 | 77.3 |
DJL-Net [53] | 58.48 | 65.6 | 84.2 | 73.0 | - | - | - | - |
SEANet(ours) | 24.9 | 71.5 | 87.8 | 79.1 | 71.4 | 77.6 | 57.8 | 79.0 |
Table 5.
Ablation study on the impact of each module’s effectiveness on the RUOD dataset.
Table 5.
Ablation study on the impact of each module’s effectiveness on the RUOD dataset.
Methods | AP ↑ | | | R ↑ | P ↑ | F1 ↑ |
---|
base | 65.2 | 87.3 | 71.7 | 80.2 | 86.4 | 83.2 |
base + MDAM | 65.8 | 87.6 | 72.2 | 80.8 | 86.7 | 83.6 |
base + SE-FPN(all) | 66.1 | 87.7 | 72.7 | 80.8 | 87.3 | 83.9 |
base + MDAM + SE-FPN (w/o SCAM) | 66.7 | 88.0 | 73.3 | 81.6 | 87.4 | 84.4 |
base + MDAM + SE-FPN (w/o CEM) | 66.4 | 88.1 | 73.0 | 81.0 | 87.5 | 84.1 |
base + MDAM + SE-FPN (all) | 67.0 | 88.4 | 73.9 | 82.0 | 87.6 | 84.7 |
base + MDAM + BiFPN | 65.8 | 87.8 | 72.5 | 81 | 86.8 | 83.8 |
Table 6.
Ablation study on the impact of each module’s effectiveness on the DUO dataset.
Table 6.
Ablation study on the impact of each module’s effectiveness on the DUO dataset.
Methods | AP ↑ | | | Ho ↑ | Ec ↑ | Sc ↑ | St ↑ |
---|
base | 70.0 | 87.6 | 77.5 | 69.5 | 77.0 | 55.5 | 78.1 |
base + MDAM | 70.9 | 87.9 | 78.6 | 71.1 | 77.5 | 56.4 | 78.8 |
base + SE-FPN (all) | 71.0 | 87.5 | 78.7 | 70.9 | 77.4 | 56.7 | 78.9 |
base + MDAM + SE-FPN (w/o SCAM) | 71.1 | 87.5 | 78.8 | 71.0 | 77.5 | 57.4 | 78.5 |
base + MDAM + SE-FPN (all) | 71.5 | 87.8 | 79.1 | 71.4 | 77.6 | 57.8 | 79.0 |
Table 7.
Ablation study on the impact of different parameters in MDAM.
Table 7.
Ablation study on the impact of different parameters in MDAM.
Branch2 | Branch3 | Branch4 | AP ↑ | | | R ↑ | P ↑ | F1 ↑ |
---|
3 | 7 | 9 | 67.0 | 88.3 | 74.0 | 82.5 | 86.4 | 84.4 |
3 | 5 | 7 | 67.0 | 88.4 | 73.9 | 82.0 | 87.6 | 84.7 |
3 | 5 | 9 | 66.6 | 88.1 | 73.1 | 82.0 | 87.1 | 84.5 |
5 | 7 | 9 | 66.7 | 87.9 | 73.5 | 81.5 | 87.3 | 84.3 |