F3M: A Frequency-Domain Feature Fusion Module for Robust Underwater Object Detection

Wang, Tianyi; Wang, Haifeng; Wang, Wenbin; Zhang, Kun; Ye, Baojiang; Dong, Huilin

doi:10.3390/jmse14010020

Open AccessArticle

F3M: A Frequency-Domain Feature Fusion Module for Robust Underwater Object Detection

by

Tianyi Wang

^1,2

,

Haifeng Wang

^1,2,*

,

Wenbin Wang

^1,2,

Kun Zhang

³

,

Baojiang Ye

^1,2

and

Huilin Dong

²

¹

Yazhou Bay Innovation Institute, Hainan Tropical Ocean University, Sanya 572025, China

²

School of Computer Science and Technology, Hainan Tropical Ocean University, Sanya 572022, China

³

School of Information Science and Technology, Hainan Normal University, Haikou 571127, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(1), 20; https://doi.org/10.3390/jmse14010020 (registering DOI)

Submission received: 4 December 2025 / Revised: 16 December 2025 / Accepted: 17 December 2025 / Published: 22 December 2025

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

In this study, we propose the Frequency-domain Feature Fusion Module (F3M) to address the challenges of underwater object detection, where optical degradation—particularly high-frequency attenuation and low-frequency color distortion—significantly compromises performance. We critically re-evaluate the need for strict invertibility in detection-oriented frequency modeling. Traditional wavelet-based methods incur high computational redundancy to maintain signal reconstruction, whereas F3M introduces a lightweight “Separate–Project–Fuse” paradigm. This mechanism decouples low-frequency illumination artifacts from high-frequency structural cues via spatial approximation, enabling the recovery of fine-scale details like coral textures and debris boundaries without the overhead of channel expansion. We validate F3M’s versatility by integrating it into both Convolutional Neural Networks (YOLO) and Transformer-based detectors (RT-DETR). Evaluations on the SCoralDet dataset show consistent improvements: F3M enhances the lightweight YOLO11n by 3.5% mAP₅₀ and increases RT-DETR-n’s localization accuracy (mAP_50–95) from 0.514 to 0.532. Additionally, cross-domain validation on the deep-sea TrashCan-Instance dataset shows F3M achieving comparable accuracy to the larger YOLOv8n while requiring 13% fewer parameters and 20% fewer GFLOPs. This study confirms that frequency-domain modulation provides an efficient and widely applicable enhancement for real-time underwater perception.

Keywords:

Frequency-domain Feature Fusion (F3M); underwater object detection; YOLO models; transformer-based detection models; coral detection; cross-domain generalization; real-time detection

1. Introduction

1.1. Research Background and Significance

Coral reefs constitute one of the most biologically diverse ecosystems on Earth, harboring at least a quarter of all marine species despite occupying less than 0.1% of the ocean floor [1,2]. These ecosystems deliver immense ecological and economic value, ranging from fisheries support and tourism to critical coastal defense. Functioning as natural breakwaters, reefs dissipate wave energy by an average of ~97%, thereby significantly mitigating coastal erosion and storm surge impacts [3]. The ecosystem services provided by coral reefs—spanning food provision, livelihood support, and coastal protection—are estimated to yield global values in the tens of billions of dollars [4]. Consequently, over 500 million people rely on these systems for food security, economic income, and shoreline defense [2].

Despite their critical importance, coral reefs are facing a precipitous global decline driven by the nexus of local anthropogenic stressors and climate change. Widespread bleaching events (e.g., 1998 and 2015–2016) have resulted in the loss of over 15% of reef-building corals worldwide [5], while reef health continues to deteriorate due to rising ocean temperatures and acidification [6,7]. As heat stress becomes more frequent, reef assemblages are undergoing transformation, threatening the persistence of coral-dominated systems. Long-term longitudinal analyses indicate sustained regional regression; for instance, Indo–Pacific hard coral cover declined by approximately 1% annually even prior to recent mass bleaching events [8]. Given this urgent biodiversity crisis and economic imperative, efficient reef monitoring and conservation management are essential, including large-scale coral health assessment, long-term ecological surveillance, and data-driven decision-making to prioritize protection and restoration actions. In this context, automated coral detection and mapping offer a scalable solution, enabling large-scale and repeatable surveys of reef conditions that support conservation planning and management decisions [9,10].

1.2. Challenges of Underwater Object Detection

Automated in situ coral detection is impeded by the complex optical and environmental conditions inherent to the underwater domain [11]. Optical imagery suffers from severe absorption, scattering, and variable illumination, resulting in low-contrast, hazy visualizations characterized by significant color distortion. Detection is further complicated by dynamic background clutter; coral colonies often appear in dense, overlapping benthic arrangements, while transient noise from moving marine life (e.g., fish, algae) and suspended particulates (marine “snow”) introduces numerous false targets [12]. (See Figure 1 for an illustration of typical underwater degradation, where dense floating particulates, known as “marine snow,” reduce visibility and create numerous false targets. Additionally, low contrast and a blue–green color cast further degrade detection performance).

For instance, in turbid waters, small suspended particles and air bubbles are frequently misclassified as coral features [13]. A further challenge is domain shift: underwater imagery varies substantially across scenes in terms of water type, turbidity, illumination, and other environmental conditions, so models trained on specific datasets often generalize poorly to broader underwater scenarios [14]. Underwater environments exhibit substantial variability in visibility and color characteristics, posing challenges for vision-based analysis and motivating the development of robust models [13,14,15]. This issue is exacerbated by the scarcity of large-scale, high-quality annotated coral datasets, which predisposes models to overfitting. Additionally, class imbalance—where certain coral species are overrepresented while others are rare—can severely bias model predictions [16].

Object scale presents another significant hurdle: juvenile corals and recruits, often occupying only a few pixels, are prone to being overlooked by detectors optimized for larger targets [17]. Similarly, occlusions by other reef organisms (e.g., starfish, macroalgae) frequently lead to missed detections [18]. Conversely, monotonous benthic textures, such as sand or algae patches, often trigger false positives by mimicking coral micro-textures. (See Figure 2 for examples of monotonous benthic textures such as sand, rubble, and algae patches that often trigger false positives in coral detection. These uniform and repetitive patterns resemble coral micro-textures, causing detectors to misclassify these non-coral regions).

Finally, coral-reef monitoring increasingly relies on AI-driven analysis of imagery acquired from autonomous underwater vehicles (AUVs), remotely operated vehicles (ROVs), and diver-operated cameras [19]. In practice, many of these platforms are battery-powered embedded systems with restricted processing and memory budgets. High-precision architectures, such as two-stage CNNs (e.g., Faster R-CNN), are typically too computationally intensive for embedded applications, resulting in suboptimal real-time performance [20]. Furthermore, standard data augmentation techniques (e.g., geometric transformations, brightness shifts) are often insufficient to encompass the full diversity of underwater scenes, meaning models trained solely on augmented source data may fail in unseen target domains [21]. In summary, robust coral detection algorithms must simultaneously address image degradation, domain variability, class imbalance, and the hardware constraints inherent to underwater exploration.

1.3. Progress in Modern Detection Methods

The evolution of underwater object detection has advanced in tandem with strides in underwater imaging and artificial intelligence. Historically, methodologies progressed from heuristic, handcrafted feature extraction to sophisticated deep learning architectures. Early coral-classification pipelines relied heavily on manually designed texture, color, and geometric descriptors combined with conventional classifiers, but large within-class variations, complex inter-class boundaries, and inconsistent image clarity in underwater scenes made such handcrafted approaches difficult to generalize robustly [22]. The advent of deep learning precipitated a paradigm shift in coral monitoring. Convolutional Neural Networks (CNNs) and related deep architectures demonstrated the capability to automatically learn hierarchical representations directly from data, thereby substantially enhancing detection and classification performance. For instance, CNN-based models have been successfully applied to automated analysis of marine biota from ROV imagery in the Great Barrier Reef, achieving strong accuracy for deep-sea benthic and coral-associated taxa and enabling more scalable reef monitoring workflows [23].

In the realm of generic object detection, the YOLO series of one-stage detectors has gained prominence for its superior balance of real-time throughput and efficiency [24]. To adapt this general-purpose framework to the aquatic domain, researchers have implemented targeted architectural optimizations. Lu et al. proposed SCoralDet, a YOLO-based soft-coral detector that introduces a Multi-Path Fusion Block to enhance robustness to uneven illumination and blurring, together with an Adaptive Power Transformation label assignment strategy that improves anchor alignment, thereby outperforming YOLOv5/YOLOv8 baselines on the Soft-Coral dataset [25]. More recent YOLO-based underwater detectors integrate attention mechanisms and lightweight convolutions to better handle small objects under resource constraints. For example, a YOLOv8-based lightweight detector enhanced by attention mechanisms, GSConv, and WIoU improves the detection of overlapping and small underwater targets while reducing model parameters and computational complexity [26]. Similarly, Zhao et al. developed YOLOv7-CHS, which replaces the ELAN backbone with a high-order spatial interaction module and incorporates contextual and attention modules in the detection head; this design markedly cuts FLOPs while preserving or improving mAP, making it well suited for real-time underwater detection on resource-limited platforms [27].

Complementing CNN-based approaches, anchor-free paradigms and Transformer architectures have increasingly been explored for underwater object detection. Unlike traditional detectors that rely on predefined anchor boxes, anchor-free methods (e.g., FCOS, CenterNet) formulate detection as dense prediction from spatial locations by directly regressing object extents and categories, thereby removing the need for anchor design and hyperparameter tuning [28,29]. Such designs have demonstrated strong performance on benchmarks with large-scale variation and dense targets, making them attractive for coral scenes where colonies and recruits span a wide range of object sizes. Concurrently, Transformer-based detectors leveraging self-attention have emerged as powerful alternatives. Models such as DETR capture global context and long-range dependencies between objects and background [30], which is particularly appealing for reef imagery characterized by heavy occlusion and large size disparities. Notably, DETR-style models eschew post-processing steps such as Non-Maximum Suppression (NMS) by adopting set-based prediction with bipartite matching, thereby avoiding redundant candidate generation [30]. Advanced variants like Deformable DETR introduce deformable attention to focus on sparse sets of sampling points across multi-scale feature maps, achieving more accurate localization of small objects in cluttered scenes [31]. Building on these advances, recent real-time Transformer detectors such as RT-DETR further close the gap between accuracy and efficiency [32], creating a suitable backbone for specialized modules that explicitly address underwater degradations and coral-specific characteristics, as proposed in this work.

1.4. Motivation for Frequency-Domain Modeling in Underwater Detection

Underwater image degradation inherently manifests as frequency-dependent anomalies: high-frequency (HF) components, representing edges and fine textures, are severely attenuated by scattering, while low-frequency (LF) components are dominated by background illumination and color bias due to absorption. This physical characteristic suggests that object detectors should process HF and LF information through distinctive pathways, rather than relying on a single, undifferentiated feature stream. While prior literature supports this perspective, existing methods typically employ computationally intensive transforms or rigid decomposition schemes that lack detection-oriented adaptability.

For instance, Octave Convolution factorizes feature maps into “slow” (LF) and “fast” (HF) channel groups to reduce spatial redundancy and improve efficiency, but the HF/LF split ratio is architecturally prescribed rather than content-adaptive [33]. FcaNet reformulates channel attention using fixed discrete cosine transform (DCT) bases, injecting frequency priors through a multi-spectral channel attention framework, yet it does not explicitly enforce that HF components specialize in edge semantics while LF components model global style or illumination [34]. From a domain adaptation perspective, FDA swaps low-frequency amplitude spectra between source and target images via a global Fourier transform as a pre-processing step, which aligns style but prevents the detector itself from learning HF/LF cooperation within its feature hierarchy [35]. Similarly, Fast Fourier Convolution (FFC) introduces non-local receptive fields by manipulating features in the Fourier domain, but it does not explicitly impose clear “HF ≈ edges/LF ≈ illumination” semantics during detector training [36]. Wavelet-based designs such as WaveCNet and Adaptive Wavelet Pooling exploit frequency factorization to improve robustness and multi-scale representation [37,38], yet they typically rely on fixed wavelet filter banks that are not explicitly optimized for the discriminative demands of object detection.

Moreover, strict frequency decomposition often incurs prohibitive computational costs. For instance, the Wavelet-based Feature Enhancement Network (WFU) [39] utilizes discrete wavelet transforms (DWT) to decompose features into four sub-bands. Although WFU demonstrates strong capability in recovering fine details for super-resolution tasks, its reliance on channel expansion (increasing channel dimensions by 4×) and complex inverse transformations results in significant GFLOPs and memory consumption. This computational burden makes such rigorous wavelet-based methods ill-suited for real-time detection on resource-constrained underwater robots, highlighting the need for a more efficient ‘spatial approximation’ strategy.

Consequently, a critical gap remains: how to implement efficient HF/LF separation and, more importantly, adaptive fusion inside modern detectors—ensuring that HF features recover structural details while LF features calibrate environmental context. This motivates the design of our Frequency-domain Feature Fusion Module (F3M), a plug-and-play unit that integrates lightweight frequency separation with learnable gated interaction, trained end-to-end under standard detection supervision.

2. Frequency-Domain Feature Fusion Module (F3M)

2.1. F3M Module: Design and Architecture

The proposed Frequency-domain Feature Fusion Module (F3M) follows a “Separate–Project–Fuse” paradigm, which is extended with an optional spatial attention mechanism.

As illustrated in Figure 3, the module consists of two cascaded stages. Stage 1, the Frequency-Fused Feature Module, performs learnable high/low-frequency decomposition, projection, and gated fusion to capture frequency-specific cues. Stage 2, the Spatial Attention Module, is an optional extension (forming F3MWithSA) that applies CBAM-style spatial attention to suppress background noise and highlight salient coral structures.

Formally, given an input feature map

x \in ℝ^{B \times C \times H \times W}

, F3M explicitly decomposes the signal into low-frequency (LF) and high-frequency (HF) components, projects them into a compressed latent space, and adaptively fuses them back into the backbone stream via a gated residual connection.

2.1.1. Separate (Frequency Decomposition)

As depicted in Figure 4, the decomposition stage aims to isolate illumination cues from structural details. We employ a fixed 3 × 3 depthwise average pooling filter as a low-pass operator. This design relies on a common signal-processing assumption in vision tasks: spatially smooth variations mainly correspond to illumination and background components, whereas rapid local changes encode structural and edge-related information. Under this assumption, a local averaging operator effectively captures low-frequency illumination cues, while the residual emphasizes complementary high-frequency structural details. The low-frequency component

X_{l f}

and the complementary high-frequency component

X_{h f}

are computed as:

X_{l f} = {AvgPool}_{3 \times 3} (X), X_{h f} = X - X_{l f}

(1)

Concretely, the separation is implemented per-channel using depthwise average pooling with a 3 × 3 kernel and stride 1, and we keep the output spatial resolution consistent with X (i.e., the spatial resolution is preserved). The high-frequency elements are characterized by the residual magnitude

|X_{h f}|

, where

X_{h f} = X - X_{l f}

; larger residuals correspond to stronger local spatial variations (e.g., edges and fine textures).

Here,

X_{l f}

primarily preserves smooth variations such as background illumination and color cast, while the residual

X_{h f}

captures high-frequency signals including texture edges and fine coral structures, which are critical for discriminative object detection.

In Figure 5, the high-frequency response is visualized by overlaying a normalized version of the high-pass component

X_{h f}

onto the input image. This highlights the texture-rich coral boundaries and fine-scale morphological structures, demonstrating the module’s ability to emphasize key features that are essential for accurate detection.

2.1.2. Project (Adaptive Feature Projection)

To facilitate efficient interaction between frequency bands, both

X_{l f}

and

X_{h f}

are projected into a lower-dimensional space. This is achieved via learnable pointwise convolutions:

{\tilde{X}}_{l f} = P_{l f} (X_{l f}), {\tilde{X}}_{h f} = P_{h f} (X_{h f})

(2)

where

P_{l f}

and

P_{h f}

are 1 × 1 convolutions mapping from

C

channels to a smaller dimension

C^{'} = \max (8, ⌊r C⌋)

with a reduction ratio

r \in (0, 1]

. This step allows the module to learn how LF and HF information should be compressed and represented before fusion, without modifying the fixed decomposition itself.

2.1.3. Fuse (Feature Fusion + Gating)

The projected features are first aggregated to form a unified modulation signal. If the module operates on a downsampled scale (i.e., when stride ds > 1), an upsampling operation is applied to align spatial resolutions. The intermediate fused feature, denoted as

Y_{m i d}

in Figure 3, is obtained by:

Y_{m i d} = {Conv}_{1 \times 1} ({\tilde{X}}_{l f} + {\tilde{X}}_{h f})

(3)

To control the injection of these frequency cues into the main backbone stream, we design a content-aware gating mechanism. The final output

Y

is computed via a gated residual connection:

G = σ ({Conv}_{1 \times 1} ([X, Y_{m i d}])), Y = X + G ⊙ Y_{m i d}

(4)

where

[X, Y_{m i d}]

denotes channel-wise concatenation,

⊙

represents element-wise multiplication, and

σ (\cdot)

is the sigmoid function. The gate map

G \in {[0, 1]}^{B \times C \times H \times W}

enables the network to selectively emphasize F3M corrections in beneficial regions (e.g., coral boundaries) while suppressing them in homogeneous backgrounds. When the gate is disabled, the module reverts to a standard residual addition

Y = X + Y_{m i d}

. Notably, F3M introduces no auxiliary loss functions and is optimized solely through the backbone’s detection objectives.

2.1.4. Spatial Attention Extension (F3MWithSA)

To further suppress background clutter and enhance the saliency of coral structures, the F3M module can be extended with a lightweight spatial attention branch, denoted as F3MWithSA. Given the fused output

Y

from the F3M stage, the spatially refined output

\tilde{Y}

is computed as:

\tilde{Y} = Y ⊙ M_{spatial}

(5)

This CBAM-style spatial attention leverages both average and max-pooling statistics to generate a spatial attention map

M_{spatial}

, computed as [40]:

M_{spatial} = σ ({Conv}_{7 \times 7} ([AvgPool (Y), MaxPool (Y)]))

(6)

In our implementation, F3MWithSA is primarily deployed at the stem/early stages of the network where feature maps are high-resolution and rich in texture, whereas the standard F3M is used in deeper stages to maintain computational efficiency.

2.2. Baseline Architectures and Integration Strategy

To evaluate the generalization capability of F3M under diverse underwater conditions, we select two paradigmatic architectures as baselines: YOLO11 (representing one-stage CNNs) and RT-DETR (representing real-time Transformers). These models bracket the design space of modern detection: YOLO relies on local convolutions and multi-scale concatenation, whereas RT-DETR leverages global self-attention. Demonstrating consistent gains across these fundamentally different paradigms verifies that F3M addresses intrinsic frequency-domain distortions (e.g., haze, edge attenuation) at the representation level, independent of the specific backbone architecture.

2.2.1. One-Stage CNN: YOLO11

Architecture Overview.

The YOLO series represents the state-of-the-art in real-time object detection [24]. We adopt YOLO11 as our primary CNN baseline. It employs a backbone–neck–head structure with CSP-based cross-stage partial networks to balance gradient flow and computation. For underwater scenes, YOLO’s strong multi-scale fusion (PANet) provides a solid substrate for detecting small marine organisms. However, standard convolutional features remain susceptible to the frequency bias of underwater degradation—where low-frequency illumination dominates and high-frequency edge details are suppressed—motivating the integration of our frequency-aware module.

2.: Integration Strategy.

We introduce F3M at two strategic locations to probe its efficacy at different semantic levels:

Stem Stage (Early Insertion): We replace the second convolutional block (immediately following the first spatial downsampling) with F3MWithSA. At this high-resolution stage, feature maps are rich in primitive textures and edges. F3M acts here as a frequency-based “cleaner,” decoupling the dominant illumination haze (LF) from the attenuated structural details (HF) before they are compressed by subsequent layers.
Neck Stage (Late Insertion): We insert a lightweight F3M (without spatial attention to save GFLOPs) before the Spatial Pyramid Pooling (SPPF) module. Here, features are semantically rich but spatially coarse; F3M helps reinject high-frequency boundary cues that may have been smoothed out during deep aggregation.

To verify that the stem-stage F3M operates as intended, we visualize the feature transformation in Figure 6. We compute the absolute difference between the output (

F_{out}

, corresponding to Figure 6c) and input (

F_{in}

, corresponding to Figure 6b) feature maps, formulated as

|F_{out} - F_{in}|

and filter significant changes using a Canny edge mask with a top-15% intensity threshold.

As shown in the difference maps (Figure 6d,e), the red-highlighted regions concentrate mainly on coral ridges, branch tips, and other texture-rich boundaries, whereas the homogeneous water-column background remains largely unaltered. This observation indicates that the stem-stage F3MWithSA tends to introduce stronger feature modulation in structurally informative areas while suppressing unnecessary responses in smooth regions. Such behavior is consistent with the design motivation of F3M, i.e., enhancing attenuated high-frequency structural cues while mitigating low-frequency background bias under underwater degradation.

3.: Overall Architecture Configuration.

To complete the system design, Figure 7 illustrates the full topology of the proposed YOLO11-F3M. We adopt an asymmetric configuration to balance performance and efficiency between the two insertion points:

Early Stage (Stem): We employ the full F3MWithSA module with a higher channel reduction ratio (r = 0.33) and the gating mechanism enabled (Gate = T). This maximizes the preservation of fine-grained textures when the feature map is large.
Late Stage (Pre-SPPF): We use the standard F3M (without spatial attention) and a lower reduction ratio (r = 0.125) with the gate disabled (Gate = F). This “lite” configuration reduces computational overhead (GFLOPs) while still providing necessary frequency-domain correction at the deepest layer.

2.2.2. Real-Time Transformer: RT-DETR

Architecture Overview

RT-DETR reframes detection as a set prediction problem, utilizing a hybrid encoder and query selection to eliminate anchors and Non-Maximum Suppression (NMS) [32]. Unlike CNNs, RT-DETR exploits global context via self-attention, which is advantageous for distinguishing camouflaged marine life in cluttered reefs. However, the attention mechanism itself relies on the quality of input tokens; if the input features are degraded by scattering blur, the attention map may fail to focus on correct object boundaries.

2.: Integration Strategy

We integrate F3M into the ResNet/HGNetv2 backbone of RT-DETR, specifically focusing on the early high-resolution stages (P2, P3). As illustrated in Figure 8, we insert F3MWithSA blocks (purple modules) immediately after the High-level Gate (HGBlock) modules in Stage 1 and Stage 2, while the dashed lines indicate the original multi-scale feature paths.

Rationale: Small polyps and fine texture details are encoded earliest in the network. Once these high-frequency cues are lost due to downsampling, the Transformer encoder cannot recover them. By placing F3M early, we refine the “tokens” before they enter the heavy attention stack, improving input quality without altering the computational cost of the Transformer encoder or decoder.
Configuration: To maintain real-time performance, we avoid adding F3M to the deeper, low-resolution stages (P4, P5). The rest of the pipeline, including the hybrid encoder and the detection head, remains unmodified.

3. Experiments and Analysis

3.1. Dataset

To comprehensively validate the proposed method, we conduct evaluations on two distinct datasets that represent contrasting underwater environments: SCoralDet (controlled aquarium soft corals) and TrashCan-Instance (unconstrained deep-sea debris).

3.1.1. SCoralDet Dataset

We first evaluate our detectors on the SCoralDet dataset proposed by Lu [25]. for underwater soft-coral detection. The dataset contains 646 aquarium images collected at the Coral Germplasm Conservation and Breeding Center of Hainan Tropical Ocean University and annotated in COCO object-detection format with six species: Euphyllia ancora, Favosites sp., Platygyra sp., Sarcophyton sp., Sinularia sp., and Waving Hand coral (Xenia sp.). To train YOLO-style detectors, we convert the COCO bounding-box annotations into YOLO text files and keep an 8:1:1 split into training, validation and test sets. We will release the converted labels and split lists together with our code to facilitate reproducibility, while the original images remain provided by the SCoralDet authors.

Figure 9 illustrates one representative example for each of the six species. Euphyllia ancora exhibits dense, anchor-shaped tentacles and strong blue–green color cast; Favosites shows tightly packed corallites with fine, repetitive textures; Platygyra has meandering “brain-like” valleys with bright fluorescent ridges; Sarcophyton forms large mushroom-shaped soft colonies with thousands of thin polyps; Sinularia displays branching, tree-like lobes under strong purple lighting and motion blur; and Waving Hand coral features highly dynamic, hand-like polyps that sway continuously in the current. These diverse morphologies, color casts and blur patterns result in complex backgrounds and subtle inter-class texture differences, making SCoralDet a challenging testbed for soft-coral detection.

3.1.2. TrashCan Dataset

To assess cross-domain generalization, we adopt the TrashCan-Instance 1.0 dataset, a large-scale benchmark for underwater debris detection sourced from the J-EDI deep-sea image library (JAMSTEC). Unlike the controlled environment of SCoralDet, TrashCan consists of 7212 images captured by ROVs in deep-sea scenarios, characterized by severe turbidity, variable lighting, and unconstrained backgrounds.

The dataset includes 22 instance-level categories, which are aggregated into four semantic super-classes in this study: (1) Marine Debris (e.g., bottles, plastic bags, nets, wreckage), (2) Biological Organisms (e.g., fish, crabs, starfish), (3) Plants (e.g., marine vegetation, floating biological fragments), and (4) ROV Components (e.g., ROV parts, man-made structures). Due to the absence of an official split, we strictly adhere to prior protocols by randomly partitioning the dataset into an 80/10/10 split (train/val/test) using a fixed random seed.

Figure 10 shows typical annotated examples from the TrashCan-Instance dataset, categorized by semantic groups. Figure 10a presents various animals, where deformable body shapes and low-contrast boundaries against the seafloor make detection difficult. Figure 10b shows sessile and filamentous plants that resemble background textures. Figure 10c depicts different types of marine litter, including cans, bags, and unknown fragments, often partially buried or camouflaged by sediments. Figure 10d illustrates the ROV platform, where strong specular highlights and on-screen telemetry further challenge robust localization.

3.2. Experimental Details

All experiments are conducted using Python 3.10.11, PyTorch 2.3.1 and CUDA 11.8 on a single NVIDIA RTX 4060 Ti GPU (16 GB VRAM). All detectors are trained under a unified protocol across both datasets: images are resized to 640 × 640 with a batch size of 16, and data augmentation includes a 50% probability of horizontal flipping, HSV color augmentation (hue 0.015, saturation 0.7, value 0.4), random scaling in the range of 0.5–1.0, translation up to 10% in each direction, mosaic augmentation in the early epochs, as well as RandAugment and random erasing with a probability of 0.4.

We use Adam-based optimization (optimizer = auto) with an initial learning rate of 0.01, momentum 0.937 and weight decay 0.0005, and adopt the default YOLO detection losses. To ensure reproducibility, all experiments are run with a fixed random seed of 42. Except for the number of training epochs, which is adjusted according to the scale of each dataset, all other hyper-parameters are kept identical on SCoralDet and TrashCan to guarantee fair comparison.

3.3. Evaluation Metrics

This study focuses on F3M-based YOLO variants and RT-DETR detectors, and adopts standard metrics that balance accuracy and efficiency: GFLOPs, Precision (P), Recall (R), mAP₅₀, and mAP_50–95.

GFLOPs (floating-point operations per forward pass)

Used to measure the theoretical compute cost at a given input resolution:

GFLOPs = 10^{- 9} \times \sum_{l = 1}^{L} f_{l}

(7)

where

f_{l}

denotes the FLOPs of layer

l

. Here, FLOPs refer to the number of floating-point arithmetic operations (primarily multiplications and additions) required for a single forward pass. The total GFLOPs are obtained by summing the per-layer FLOPs across the network and normalizing by 10⁹. This is a standard practice for estimating theoretical computational complexity. GFLOPs directly reflects how different insertion points/width settings in F3M-YOLO and RT-DETR affect computational cost.

2.: Precision and Recall

Precision (P) measures the proportion of true positive detections out of all positive predictions, computed as:

P = \frac{T P}{T P + F P}

(8)

Recall (R) measures the proportion of true positive detections out of all actual positives, computed as:

R = \frac{T P}{T P + F N}

(9)

where

T P

,

F P

,

T N

,

F N

are true positives, false positives, true negatives and false negatives.

3.: mAP₅₀ (mean Average Precision at IoU = 0.5)

Let the Intersection over Union (IoU) be defined as:

IoU = \frac{|b \cap g|}{|b \cup g|}

(10)

for a predicted box

b

and ground truth

g

.

For class

c

at

τ = 0.5

, AP is the integral of the non-increasing precision envelope

{\hat{p}}_{c} (r; τ)

over recall r:

A P_{c} (τ = 0.5) = \int_{0}^{1} {\hat{p}}_{c} (r; τ) d r \approx \sum_{n = 1}^{N - 1} (R_{n} - R_{n - 1}) {\hat{p}}_{c} (R_{n}; 0.5)

(11)

and the mean across all

C

classes is

m A P_{50} = \frac{1}{C} \sum_{c = 1}^{C} A P_{c} (0.5)

(12)

4.: mAP_50–95 (mean AP across thresholds)

To more strictly capture localization quality, we average over

τ \in T = {0.50, 0.55, \dots, 0.95} (step = 0.05)

:

m A P_{50 - 95} = \frac{1}{|T|} \sum_{τ \in T} (\frac{1}{C} \sum_{c = 1}^{C} A P_{c} (τ))

(13)

mAP_50–95 is more sensitive to overlap fidelity and is well-suited for evaluating fine-grained localization of F3M-YOLO and RT-DETR in underwater degradation scenarios.

3.4. Comparison with Existing Detectors on the SCoralDet Dataset

To evaluate the effectiveness of the proposed F3M module under realistic underwater imaging conditions, we conduct a series of comparative experiments on the SCoralDet dataset using a diverse set of representative detectors. The evaluated baselines first include standard lightweight models (YOLO11n, YOLO12n, YOLOv8n) and the transformer-based RT-DETR-n, serving as general-purpose benchmarks.

Comparison with Specialized Underwater Architectures. Beyond general detectors, we benchmark against state-of-the-art architectures specifically engineered for the underwater domain to validate domain adaptability. LUOD-YOLO [41] enhances YOLOv8 with dynamic feature fusion and dual-path rearrangement to mitigate underwater distortion. Our previously published S²-YOLO [42] optimizes marine debris recognition through improved spatial-to-depth convolutions. We also compare against SCoralDet [25], a YOLOv10-based detector explicitly customized for soft-coral detection tasks.
Construction of the Frequency-Domain Baseline. Crucially, to strictly evaluate the proposed spatial approximation against rigorous frequency decomposition, we constructed a heavy-weight frequency baseline, denoted as YOLO11n-WFU. Drawing inspiration from the UGMB algorithm [43]—which proposes a multi-scale architecture by embedding defect-specific enhancement modules into dense residual blocks—we adopted a parallel integration strategy. Specifically, we replaced the standard Bottleneck unit within the YOLO11n’s C3k2 module with the Wavelet Feature Upgrade (WFU) block [39]. This design enables the network to perform explicit wavelet-based frequency decomposition during deep feature extraction, facilitating a direct comparison between “heavy-weight frequency transforms” and our “lightweight F3M” without altering the macro-topology of the host network.

All detectors are trained using an identical training pipeline, with unified hyper-parameters and augmentation strategies to ensure fair comparison. Performance is evaluated on the SCoralDet test split, and the results are summarized in Table 1.

Quantitative results indicate that integrating the proposed F3M module consistently improves accuracy across different architectures while introducing minimal computational overhead. Among all evaluated models, YOLO11n-F3M achieves the best overall performance, reaching 0.861 Precision, 0.797 mAP₅₀, and 0.539 mAP_50–95. Relative to the vanilla YOLO11n baseline, our method yields substantial gains of +3.5% mAP₅₀ and +2.6% mAP_50–95, with parameters increasing by only 0.03 M and GFLOPs by 0.2. This demonstrates that F3M effectively enhances the feature representation capability of the lightweight backbone without compromising its real-time efficiency.

Beyond the standard CNN baseline, the comparison with other architectural families highlights the robustness of our design. While the recently released YOLO12n achieves a high Precision of 0.859, it suffers from a noticeably lower Recall (0.634), suggesting a tendency to miss subtle targets. In contrast, YOLO11n-F3M maintains a balanced Recall of 0.708, indicating superior capability in detecting faint or obscured coral features. Furthermore, the benefits of F3M extend to transformer-based architectures. The F3M-enhanced RT-DETR-n improves mAP_50–95 from 0.514 to 0.532 compared to its baseline. This suggests that the explicit frequency-domain modulation provided by F3M is complementary to the global semantic modeling of self-attention mechanisms, effectively mitigating the impact of input degradation on token quality.

Significantly, YOLO11n-F3M also outperforms specialized models specifically engineered for the underwater domain. As shown in Table 1, LUOD-YOLO and S²-YOLO achieve mAP_50–95 scores of 0.518 and 0.516, respectively. These models leverage complex mechanisms, such as dual-path rearrangement or SPD-Conv, to handle underwater distortions. Additionally, we extended our comparison to the recent SCoralDet, which was explicitly tailored for this soft-coral dataset. Although the original study reports superior metrics, we encountered reproducibility challenges due to the omission of the specific Wasserstein loss definition in the official code release. Despite our efforts to reconstruct the model using a standard public implementation of this loss within our unified pipeline, the reproduced performance remained limited, yielding an mAP₅₀ of 0.724 and mAP_50–95 of 0.483. Specifically, YOLO11n-F3M outperforms the reproduced SCoralDet by approximately +5.6% in mAP_50–95.

Most critically, within the frequency-domain paradigm, our method demonstrates superior efficacy compared to the rigorous wavelet-based baseline, YOLO11n-WFU. Despite incorporating the theoretically complete Discrete Wavelet Transform (DWT), the WFU-enhanced model achieves an mAP_50–95 of only 0.475 and mAP₅₀ of 0.761, falling short of the vanilla YOLO11n baseline (0.513 mAP_50–95). We attribute this performance degradation to the rigid nature of fixed Haar wavelet bases, which lack the learnable adaptability required to capture the subtle, non-uniform texture blurring of soft corals. Furthermore, the WFU module incurs a significant computational penalty, elevating the model’s GFLOPs to 8.1 due to channel expansion. In stark contrast, F3M achieves a substantially higher mAP_50–95 of 0.539 (+6.4% over WFU) while maintaining a lightweight footprint of 6.5 GFLOPs. This result empirically validates that for real-time underwater detection, our learnable spatial approximation strategy offers a far more efficient and robust inductive bias than heavy-weight mathematical frequency decompositions.

Consequently, our proposed method surpasses all these specialized baselines by a clear margin. This superiority implies that directly addressing the physical root of underwater degradation—specifically, the separation and enhancement of high-frequency structural cues via F3M—is more effective and robust than purely spatial or channel-based architectural modifications or complex loss formulations for recovering fine-grained coral morphologies.

3.5. Ablation Study of F3M Placement and Attention

To validate the architectural decisions behind F3M and isolate the specific contributions of its components, we conducted a systematic ablation study on the SCoralDet dataset using the YOLO11n backbone. As detailed in Table 2, we evaluate four distinct configurations:

Baseline: The vanilla YOLO11n model.
onlyF3M (Deep-only): Integrating the standard F3M (without spatial attention) solely at the neck stage (pre-SPPF) to enhance semantic aggregation.
onlyF3MWithSA (Stem-only): Integrating F3MWithSA solely at the early stem stage to denoise high-resolution features.
F3M(Dual-site): The proposed full architecture employing both the early attention-enhanced stem and the deep frequency-aware neck.

Starting from the plain YOLO11n, which attains 0.763 precision, 0.686 recall, 0.762 mAP₅₀ and 0.513 mAP_50–95, replacing selected convolutional blocks with F3M (onlyF3M) already yields a noticeable gain, improving precision to 0.843, recall to 0.719, mAP₅₀ to 0.776 and mAP_50–95 to 0.517. Further enabling the attention-enhanced stem (onlyF3MWithSA) slightly trades recall for better localization quality, achieving 0.850 precision, 0.678 recall, 0.769 mAP₅₀ and 0.522 mAP_50–95. The full YOLO11n-F3M configuration provides the best overall balance, reaching 0.861 precision, 0.708 recall, 0.797 mAP₅₀ and 0.539 mAP_50–95. Compared with the baseline YOLO11n, this corresponds to gains of +3.5% mAP₅₀ and +2.6% mAP_50–95 with only a modest increase in parameters (from 2.58 M to 2.61 M) and computation (from 6.3 to 6.5 GFLOPs).

To further investigate how frequency-domain modulation impacts specific coral morphologies, we analyze the per-class localization quality (mAP_50–95) as detailed in Table 3.

The results indicate that the improvements are closely correlated with the structural and environmental characteristics of each species. For structurally complex species rich in fine-grained details, such as Euphyllia ancora and Sarcophyton sp., the proposed full F3M architecture achieves the most significant gains, improving mAP_50–95 by +4.9% and +7.1% relative to the baseline, respectively. This substantial boost confirms that simultaneously preserving initial texture cues at the stem stage and refining semantic boundaries at the neck stage is crucial for recovering intricate morphologies. Furthermore, on the most challenging class, Sinularia sp., which is characterized by severe low-contrast blending with the background, the full F3M module effectively improves performance from 0.382 to 0.397. This demonstrates that the module’s “Separate–Project–Fuse” mechanism successfully decouples the object from the dominant low-frequency background haze, enhancing detection even when visual cues are ambiguous.

However, the ablation also reveals distinct sensitivities for specific coral types. Interestingly, Platygyra sp. (Brain Coral) achieves its peak performance (0.705) with the Stem-only configuration, outperforming the full F3M (0.685). This suggests that for its unique, high-contrast ridge patterns, preserving high-resolution edge information early in the network is more critical than deep semantic fusion. Conversely, for Waving Hand coral, which suffers from strong motion blur due to water currents, the Deep-only variant yields the best results (0.428). This indicates that when high-frequency texture details are corrupted by motion, the network benefits more from high-level semantic context aggregation than from early-stage texture enhancement. Despite these class-specific variations, the full F3M configuration offers the most robust overall performance across the dataset, effectively balancing the recovery of static high-frequency textures and the localization of dynamic or low-contrast targets.

3.6. General Applicability of F3M Across Detector Architectures

To further evaluate the generalization capability of F3M under diverse underwater conditions, we insert the module into three representative detector families—YOLOv8n, YOLO11n, and RT-DETR-n—and compare each baseline with its F3M-enhanced counterpart on the SCoralDet test set (Table 4). Across all three architectures, F3M brings consistent gains in precision and mAP_50–95 at the cost of minimal additional parameters (around 1%) and at most a 7.6% increase in GFLOPs.

For the lightweight convolutional backbone, YOLO11n-F3M achieves the largest improvements: precision increases by +9.8% (from 0.763 to 0.861), recall by +2.2%, mAP₅₀ by +3.5%, and mAP_50–95 by +2.6%. These improvements come with only a small increase in parameters (by 0.03 M) and GFLOPs (by 0.2). This demonstrates that F3M enhances YOLO11n’s performance while maintaining its computational efficiency.

The transformer-based RT-DETR-n also benefits from F3M, with mAP₅₀ and mAP_50–95 improving by +3.3% and +1.8%, respectively, and precision increasing by +3.6%. This indicates that the frequency-domain modulation introduced by F3M is complementary to the global self-attention mechanism, helping RT-DETR-n mitigate input degradation and retain high-quality token features.

On YOLOv8n, F3M leads to a notable +6.5% increase in precision and a slight +0.3% improvement in mAP_50–95, although recall decreases by 6.2%. This suggests that F3M introduces a trade-off toward more conservative but higher-quality predictions, enhancing precision while sacrificing recall in this specific architecture.

Overall, these results demonstrate that F3M serves as a plug-and-play enhancement block that can be seamlessly integrated into both CNN-based and transformer-based detectors. F3M consistently improves high-IoU performance (mAP_50–95) with negligible overhead, supporting the claim that explicit frequency-domain feature modulation is a broadly applicable design principle rather than a backbone-specific trick.

3.7. Cross-Dataset Generalization on the TrashCan Dataset

To evaluate the domain robustness of the proposed F3M module, we extended our assessment to the TrashCan-Instance dataset. Unlike the controlled aquarium environment of SCoralDet, this deep-sea benchmark introduces significant domain shifts, such as severe turbidity and unconstrained backgrounds (as detailed in Section 3.1), serving as a rigorous testbed for generalization. We adopted the same training configuration and hyperparameters as used for SCoralDet. Table 5 summarizes the comparison across representative detectors.

On this challenging deep-sea dataset, the YOLO11n-F3M model achieves a peak mAP₅₀ of 0.908, matching the performance of the significantly more computationally expensive YOLOv8n (mAP₅₀ = 0.908) while maintaining 13% fewer parameters and 20% fewer GFLOPs. This demonstrates that F3M provides a substantial boost in accuracy with minimal computational overhead, reinforcing the argument that frequency-domain modulation is a highly efficient enhancement for underwater object detection. Compared to its direct backbone YOLO11n, YOLO11n-F3M improves mAP₅₀ by 1.6% and mAP_50–95 by 0.6%, further confirming that F3M effectively addresses the challenges posed by texture-degraded domains without sacrificing real-time performance.

The comparison with specialized underwater detectors highlights the distinct advantages of F3M. LUOD-YOLO, a model designed specifically for underwater detection, achieves a lightweight design (1.63 M parameters) but sacrifices accuracy, lagging behind YOLO11n-F3M by 2.2% in mAP₅₀. Similarly, S²-YOLO, a model we previously developed for marine debris detection, attains the highest mAP_50–95 (0.689) thanks to its SPD-Conv modules, but YOLO11n-F3M outperforms it in detection accuracy (mAP₅₀ of 0.908 vs. 0.904) while being approximately 4.7% lighter. A similar trend is observed with SCoralDet. While optimized for underwater scenarios, it exhibited limited generalization capability when facing the severe domain shifts of the deep-sea TrashCan dataset, yielding an mAP₅₀ of 0.879 (as shown in Table 5). Furthermore, its computational cost (GFLOPs) appeared disproportionate to the marginal performance gains observed in our reproduction. Conversely, YOLO11n-F3M, leveraging its lightweight frequency modulation, not only surpassed SCoralDet in accuracy (reaching 0.908 mAP₅₀) but also maintained a minimal computational footprint. This suggests that F3M is particularly beneficial for real-time deployments on resource-constrained ROVs, where high detection accuracy and computational efficiency are prioritized over pixel-perfect bounding box precision or dataset-specific overfitting. These results underscore that frequency-domain enhancement—particularly through the separation and enhancement of high-frequency features—is more effective than complex spatial or channel-based modifications in handling underwater degradations.

Most critically, the comparison with the frequency-domain baseline, YOLO11n-WFU, provides compelling evidence for the efficiency of our design choice. While the WFU-enhanced model improves over the vanilla YOLO11n (0.902 vs. 0.892 mAP₅₀), validating that frequency-domain information is indeed valuable for suppressing deep-sea turbidity, it comes at a steep computational cost, raising GFLOPs to 8.1 due to the channel-expanding wavelet transform. In contrast, YOLO11n-F3M not only surpasses WFU in detection accuracy (0.908 mAP₅₀ and 0.679 mAP_50–95 compared to WFU’s 0.902 and 0.676) but does so with significantly lower computational overhead (6.5 GFLOPs vs. 8.1 GFLOPs). This result reinforces that the lightweight, learnable spatial approximation employed by F3M is more robust to the heterogeneous clutter of marine debris than rigid, mathematically complex wavelet decompositions. By avoiding the heavy overhead of strict mathematical transforms, F3M achieves a superior trade-off between domain generalization and inference speed, making it the optimal candidate for energy-constrained deep-sea exploration.

Furthermore, the comparison across different architectural families reveals the limitations of transformer-based models like RT-DETR. Despite its high computational cost (19.8 GFLOPs), RT-DETR fails to outperform CNN-based models, including YOLO11n-F3M, on the TrashCan dataset, with mAP₅₀ of 0.878. This suggests that the global attention mechanism of transformers, while effective in some settings, struggles to extract meaningful features from complex, turbid, low-contrast deep-sea imagery without the inductive bias introduced by frequency-aware convolutions. The evaluation also reveals that YOLO12n, despite being a newer model, falls short of YOLOv8n, further validating that architectural upgrades alone are insufficient compared to F3M’s domain-specific frequency modulation for underwater perception.

Overall, YOLO11n-F3M strikes the optimal balance between accuracy and efficiency, outperforming both heavier models and specialized detectors in terms of real-time deployment suitability. It generalizes robustly to deep-sea debris detection, demonstrating that F3M not only enhances model performance across diverse architectures but also proves to be a viable solution for autonomous underwater vehicles, where computational efficiency and detection accuracy are critical.

4. Conclusions

In this study, we addressed the persistent challenge of feature degradation in underwater object detection by proposing the Frequency-domain Feature Fusion Module (F3M). Unlike traditional approaches that rely on heavy restoration preprocessing or complex backbone modifications, F3M introduces a lightweight “Separate–Project–Fuse” paradigm that explicitly decouples high-frequency structural cues from low-frequency illumination artifacts. By embedding this mechanism into the deep learning pipeline, we enable detectors to actively recover fine-grained morphological details—such as coral textures and debris boundaries—that are typically attenuated by aquatic scattering and absorption.

Extensive quantitative evaluations on the SCoralDet and TrashCan-Instance datasets substantiate the effectiveness and robustness of our method. On the soft-coral dataset, the F3M-enhanced YOLO11n achieved substantial gains of +3.5% mAP₅₀ and +2.6% mAP_50–95 over the baseline, surpassing both heavier transformer-based models (RT-DETR) and specialized underwater detectors (LUOD-YOLO, S²-YOLO). Crucially, F3M demonstrated exceptional cross-domain generalization. On the unconstrained deep-sea TrashCan dataset, YOLO11n-F3M matched the accuracy of the larger YOLOv8n while requiring 13% fewer parameters and 20% fewer GFLOPs. These results confirm that frequency-domain modulation serves as a highly efficient inductive bias, offering a superior trade-off between detection accuracy and computational cost, making it suitable for real-time deployment.

A pivotal finding of this work lies in the trade-off between mathematical rigor and detection efficiency. Our comparative analysis with the Wavelet Feature Upgrade (WFU) module reveals that the property of invertibility—while vital for image restoration tasks—becomes a computational liability in downsampling-based detection architectures like YOLO. The rigorous discrete wavelet transform (DWT) requires significant channel expansion to preserve reconstruction information, which contradicts the lightweight design principles of real-time detectors. By relaxing this constraint and adopting a learnable spatial approximation, F3M eliminates redundant information flow. It successfully demonstrates that, for object detection, capturing discriminative frequency features is more critical than maintaining mathematical reversibility, achieving higher accuracy (+6.4% mAP_50–95 over WFU) with significantly lower latency.

While F3M has shown promising results, several avenues for future research remain:

Hardware Deployment and Optimization: Although our theoretical GFLOPs and parameter counts indicate suitability for edge devices, we plan to deploy and test the F3M-YOLO architecture on physical underwater platforms (e.g., Jetson Orin or FPGA-based AUVs) to evaluate real-world inference latency and energy consumption under varying battery constraints.
Learnable Frequency Decomposition: Currently, F3M utilizes fixed pooling operators for frequency separation. Future iterations could explore learnable spectral filters or adaptive wavelet transforms to dynamically adjust the frequency cutoff based on the turbidity levels of different water bodies.
Video-Based Detection: Given the dynamic nature of marine environments (e.g., swaying corals, moving fish), extending F3M to utilize temporal information in video streams could further suppress background noise and improve the consistency of detection tracks.

Author Contributions

Conceptualization, T.W., H.W. and W.W.; methodology, T.W.; software, T.W.; validation, T.W., H.W. and W.W.; formal analysis, T.W.; investigation, T.W.; resources, H.W. and W.W.; data curation, H.D.; writing—original draft preparation, T.W.; writing—review and editing, T.W. and W.W.; visualization, B.Y.; supervision, H.W.; project administration, T.W.; funding acquisition, H.W. and W.W.; critical advice, K.Z.; idea exchange, B.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key Research and Development Program of Hainan Province (Grant No. ZDYF2024GXJS034); the Major Science and Technology Program of Yazhou Bay Innovation Institute, Hainan Tropical Ocean University (Grant No. 2023CXYZD001); the School-Level Research and Practice Projects of Hainan Tropical Ocean University (RHYxgnw2024-12); the Sanya Science and Technology Special Fund (Grant No. 2022KJCX30); and the University-Level Graduate Innovation Research Projects of Hainan Tropical Ocean University (Grant Nos. RHDYC-202514 and RHDYC-202515).

Data Availability Statement

The authors declare that the data supporting the findings of this study are from publicly available datasets. Further inquiries can be directed to the corresponding author according to reasonable requirements.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Moberg, F.; Folke, C. Ecological goods and services of coral reef ecosystems. Ecol. Econ. 1999, 29, 215–233. [Google Scholar] [CrossRef]
Souter, D.; Planes, S.; Wicelowski, C.; Obura, D.; Staub, F. (Eds.) Status of Coral Reefs of the World: 2020; GCRMN/UNEP: Nairobi, Kenya, 2021. [Google Scholar]
Ferrario, F.; Beck, M.W.; Storlazzi, C.D.; Micheli, F.; Shepard, C.C.; Airoldi, L. The effectiveness of coral reefs for coastal hazard risk reduction and adaptation. Nat. Commun. 2014, 5, 3794. [Google Scholar] [CrossRef] [PubMed]
Spalding, M.D.; Burke, L.; Wood, S.A.; Ashpole, J.; Hutchison, J.; zu Ermgassen, P. Mapping the global value and distribution of coral reef tourism. Mar. Policy 2017, 82, 104–113. [Google Scholar] [CrossRef]
Zeng, K.; He, S.; Zhan, P. Inevitable global coral reef decline under climate change-induced thermal stresses. Commun. Earth Environ. 2025, 6, 827. [Google Scholar] [CrossRef]
Hoegh-Guldberg, O.; Mumby, P.J.; Hooten, A.J.; Steneck, R.S.; Greenfield, P.; Gomez, E.; Harvell, C.D.; Sale, P.F.; Edwards, A.J.; Caldeira, K.; et al. Coral Reefs Under Rapid Climate Change and Ocean Acidification. Science 2007, 318, 1737–1742. [Google Scholar] [CrossRef]
Hoegh-Guldberg, O. Coral reef ecosystems under climate change and ocean acidification. Front. Mar. Sci. 2017, 4, 158. [Google Scholar] [CrossRef]
Bruno, J.F.; Selig, E.R. Regional Decline of Coral Cover in the Indo-Pacific: Timing, Extent, and Subregional Comparisons. PLoS ONE 2007, 2, e711. [Google Scholar] [CrossRef]
Beijbom, O.; Edmunds, P.J.; Roelfsema, C.; Smith, J.; Kline, D.I.; Neal, B.P.; Dunlap, M.J.; Moriarty, V.; Fan, T.Y.; Tan, C.J.; et al. Towards Automated Annotation of Benthic Survey Images: Variability of Human Experts and Operational Modes of Automation. PLoS ONE 2015, 10, e0130312. [Google Scholar] [CrossRef]
Raphael, A.; Dubinsky, Z.; Iluz, D.; Benichou, J.I.C.; Netanyahu, N.S. Deep neural network recognition of shallow water corals in the Gulf of Eilat (Aqaba). Sci. Rep. 2020, 10, 12959. [Google Scholar] [CrossRef]
Shen, S.; Wang, H.; Chen, W.; Wang, P.; Liang, Q.; Qin, X. A novel edge-feature attention fusion framework for underwater image enhancement. Front. Mar. Sci. 2025, 12, 1555286. [Google Scholar] [CrossRef]
Soom, J.; Pattanaik, V.; Leier, M.; Tuhtan, J.A. Environmentally adaptive fish or no-fish classification for river video fish counters using high-performance desktop and embedded hardware. Ecol. Inform. 2022, 72, 101817. [Google Scholar] [CrossRef]
Fu, C.; Liu, R.; Fan, X.; Chen, P.; Fu, H.; Yuan, W.; Zhu, M.; Luo, Z. Rethinking general underwater object detection: Datasets, challenges, and solutions. Neurocomputing 2023, 517, 243–256. [Google Scholar] [CrossRef]
Liu, R.; Fan, X.; Zhu, M.; Hou, M.; Luo, Z. Real-World Underwater Enhancement: Challenges, Benchmarks, and Solutions Under Natural Light. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 4861–4875. [Google Scholar] [CrossRef]
Ditria, E.M.; Lopez-Marcano, S.; Sievers, M.; Jinks, E.L.; Brown, C.J.; Connolly, R.M. Automating the analysis of fish abundance using object detection: Optimizing animal ecology with deep learning. Front. Mar. Sci. 2020, 7, 429. [Google Scholar] [CrossRef]
Buda, M.; Maki, A.; Mazurowski, M.A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018, 106, 249–259. [Google Scholar] [CrossRef] [PubMed]
Mirzaei, B.; Nezamabadi-pour, H.; Raoof, A.; Derakhshani, R. Small Object Detection and Tracking: A Comprehensive Review. Sensors 2023, 23, 6887. [Google Scholar] [CrossRef]
Ruan, J.; Cui, H.; Huang, Y.; Li, T.; Wu, C.; Zhang, K. A review of occluded objects detection in real complex scenarios for autonomous driving. Green Energy Intell. Transp. 2023, 2, 100092. [Google Scholar] [CrossRef]
González-Rivero, M.; Beijbom, O.; Garcia, R.; Rodriguez-Ramirez, A.; Bryant, D.E.; Ganase, A.; Gonzalez-Marrero, Y.; Herrera-Reveles, A.; Kennedy, E.V.; Kim, C.J.; et al. Monitoring of coral reefs using artificial intelligence: A feasible and cost-effective approach. Remote Sens. 2020, 12, 489. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Mahmood, A.; Bennamoun, M.; An, S.; Sohel, F.; Boussaid, F.; Hovey, R.; Kendrick, G.; Fisher, R.B. Coral classification with hybrid feature representations. In Proceedings of the IEEE International Conference on Image Processing (ICIP 2016), Phoenix, AZ, USA, 25–28 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 519–523. [Google Scholar] [CrossRef]
Deo, R.; John, C.M.; Zhang, C.; Whitton, K.; Salles, T.; Webster, J.M.; Chandra, R. Deepdive: Leveraging Pre-trained Deep Learning for Deep-Sea ROV Biota Identification in the Great Barrier Reef. Sci. Data 2024, 11, 957. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar] [CrossRef]
Lu, Z.; Liao, L.; Xie, X.; Yuan, H. SCoralDet: Efficient real-time underwater soft coral detection with YOLO. Ecol. Inform. 2024, 85, 102937. [Google Scholar] [CrossRef]
Cai, S.; Zhang, X.; Mo, Y. A lightweight underwater detector enhanced by Attention mechanism, GSConv and WIoU on YOLOv8. Sci. Rep. 2024, 14, 25797. [Google Scholar] [CrossRef]
Zhao, L.; Yun, Q.; Yuan, F.; Ren, X.; Jin, J.; Zhu, X. YOLOv7-CHS: An emerging model for underwater object detection. J. Mar. Sci. Eng. 2023, 11, 1949. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 9626–9635. [Google Scholar] [CrossRef]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6568–6577. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Li, L.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations (ICLR 2021), Virtual, 3–7 May 2021. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Chen, Y.; Fan, H.; Xu, B.; Yan, Z.; Kalantidis, Y.; Rohrbach, M.; Shuicheng, Y.; Feng, J. Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks With Octave Convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 3434–3443. [Google Scholar] [CrossRef]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. FcaNet: Frequency Channel Attention Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 763–772. [Google Scholar] [CrossRef]
Yang, Y.; Soatto, S. FDA: Fourier Domain Adaptation for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 4084–4094. [Google Scholar] [CrossRef]
Chi, L.; Jiang, B.; Mu, Y. Fast Fourier Convolution. In Proceedings of the NeurIPS 33, 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 6–12 December 2020; pp. 4479–4491. [Google Scholar]
Li, Q.; Shen, L.; Guo, S.; Lai, Z. WaveCNet: Wavelet Integrated CNNs to Suppress Aliasing Effect for Noise-Robust Image Classification. IEEE Trans. Image Process. 2021, 30, 7074–7089. [Google Scholar] [CrossRef] [PubMed]
Wolter, M.; Garcke, J. Adaptive wavelet pooling for convolutional neural networks. Proc. Mach. Learn. Res. 2021, 130, 1936–1944. Available online: https://proceedings.mlr.press/v130/wolter21a.html (accessed on 16 December 2025).
Li, W.; Guo, H.; Liu, X.; Liang, K.; Hu, J.; Ma, Z.; Guo, J. Efficient Face Super-Resolution via Wavelet-based Feature Enhancement Network. In Proceedings of the ACM Multimedia (MM 2024), Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 4515–4523. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Lv, C.; Pan, W. LUOD-YOLO: A lightweight underwater object detection model based on dynamic feature fusion, dual path rearrangement and cross-scale integration. J. Real-Time Image Process. 2025, 22, 204. [Google Scholar] [CrossRef]
Wang, T.; Wang, H.; Wang, W.; Zhang, K.; Ye, B. An Improved YOLOv8n Based Marine Debris Recognition Algorithm. In Proceedings of the 5th International Conference on Computer Graphics, Image and Virtualization (ICCGIV), Johor Bahru, Malaysia, 6–8 June 2025; pp. 1–4. [Google Scholar] [CrossRef]
Zhang, H.; Tian, C.; Zhang, A.; Liu, Y.; Gao, G.; Zhuang, Z.; Yin, T.; Zhang, N. A Bridge Defect Detection Algorithm Based on UGMB Multi-Scale Feature Extraction and Fusion. Symmetry 2025, 17, 1025. [Google Scholar] [CrossRef]

Figure 1. Typical underwater degradation and visual distractors.

Figure 2. Monotonous benthic textures (sand, rubble, and algae patches) that can trigger false positives in coral detection. The blue rectangular frame is a survey quadrat used during data acquisition to define the sampling area and provide a scale reference; it is not a target object.

Figure 3. Architecture of the Frequency-domain Feature Fusion Module (F3M).

Figure 4. Step 1 of F3M: frequency separation.

Figure 5. Visualization of the high-frequency response produced by F3M: left, the original underwater image; right, the normalized high-frequency component overlaid on the input image, highlighting texture-rich coral boundaries and fine structural details.

Figure 6. Qualitative visualization of F3M modulation at the stem stage: (a) original RGB image; (b) input feature response (channel mean) from the first convolution layer; (c) output feature response after F3MWithSA processing; (d,e) edge-weighted difference maps overlaid on grayscale and color images, respectively. The red regions indicate pixels with significant feature modulation, obtained by computing the absolute difference between (b,c) and retaining the top-15% responses on Canny edges.

Figure 7. The overall architecture of the proposed YOLO11-F3M.

Figure 8. The integration of F3M into the RT-DETR backbone (the Transformer encoder, decoder, and prediction heads are standard RT-DETR components and are omitted for brevity).

Figure 9. Representative examples of the six soft-coral categories in the SCoralDet dataset.(a) Euphyllia ancora; (b) Favosites sp.; (c) Platygyra sp.; (d) Sarcophyton sp.; (e) Sinularia sp.; (f) Waving Hand coral (Xenia sp.).

Figure 10. Representative samples from the TrashCan-Instance dataset. (a) animal; (b) plant; (c) trash; (d) ROV.

Table 1. Generalization on the SCoralDet soft-coral dataset (best epoch selected by mAP50–95; bold values indicate the best performance in each column).

Model	Precision	Recall	mAP₅₀	mAP_50–95	Parameters(M)	GFLOPs
YOLOv8n	0.787	0.696	0.760	0.499	3.01	8.1
LUOD-YOLO (based on YOLOv8) [41]	0.849	0.728	0.779	0.518	1.63	5.9
S2-YOLO (based on YOLOv8) [42]	0.869	0.679	0.764	0.516	2.74	6.9
SCoralDet (based on YOLOv10) [25]	0.779	0.650	0.724	0.483	3.38	8.8
YOLO11n-WFU	0.761	0.661	0.761	0.475	3.60	8.1
YOLO11n	0.763	0.686	0.762	0.513	2.58	6.3
YOLO12n	0.859	0.634	0.738	0.513	2.56	6.3
RT-DETR-n	0.751	0.691	0.736	0.514	9.14	19.8
RT-DETR-n-F3M	0.787	0.701	0.769	0.532	9.22	21.3
YOLO11n-F3M	0.861	0.708	0.797	0.539	2.61	6.5

Table 2. Ablation of F3M placement and attention on the coral dataset (best epoch selected by mAP_50–95; bold values indicate the best performance in each column).

Model	Precision	Recall	mAP₅₀	mAP_50–95	GFLOPs	Stem F3MWithSA	Deep F3M (SPPF-1)
YOLO11n (Baseline)	0.763	0.686	0.762	0.513	6.3	No	No
YOLO11n-onlyF3M	0.843	0.719	0.776	0.517	6.3	No	Yes
YOLO11n-onlyF3MWithSA	0.850	0.678	0.769	0.522	6.5	Yes	No
YOLO11n-F3M (Dual-site)	0.861	0.708	0.797	0.539	6.5	Yes	Yes

Table 3. Per-class performance comparison (mAP_50–95) on the SCoralDet dataset. The best results for each category are highlighted in bold.

Category	YOLO11 (Baseline)	onlyF3M (Deep-Only)	onlyF3MWithSA (Stem-Only)	F3M (Dual-Site)
Euphyllia ancora (Tentacles/High-freq)	0.552	0.559	0.548	0.601
Favosites sp. (Dense pattern)	0.543	0.537	0.552	0.559
Platygyra sp. (Ridge texture)	0.667	0.671	0.705	0.685
Sarcophyton sp. (Fine polyps)	0.516	0.538	0.547	0.587
Sinularia sp. (Low contrast)	0.382	0.367	0.370	0.397
Waving Hand (Dynamic/Blur)	0.419	0.428	0.413	0.407

Table 4. Effect of inserting F3M into different detector architectures on SCoralDet (Bold values indicate the best performance in each column).

Model	Precision	Recall	mAP₅₀	mAP_50–95	Parameters (M)	GFLOPs
YOLOv8n	0.787	0.696	0.760	0.499	3.01	8.1
YOLOv8n-F3M	0.852	0.634	0.744	0.502	3.04	8.3
RT-DETR-n	0.751	0.691	0.736	0.514	9.14	19.8
RT-DETR-n-F3M	0.787	0.701	0.769	0.532	9.22	21.3
YOLO11n	0.763	0.686	0.762	0.513	2.58	6.3
YOLO11n-F3M	0.861	0.708	0.797	0.539	2.61	6.5

Table 5. Cross-dataset generalization results on the TrashCan-Instance dataset. The best results for each category are highlighted in bold.

Model	Precision	Recall	mAP₅₀	mAP_50–95	Parameters (M)	GFLOPs
YOLOv8n	0.885	0.870	0.908	0.679	3.01	8.1
LUOD-YOLO (based on YOLOv8) [41]	0.885	0.828	0.886	0.663	1.63	5.9
S2-YOLO (based on YOLOv8) [42]	0.873	0.870	0.904	0.689	2.74	6.9
SCoralDet (based on YOLOv10) [25]	0.868	0.823	0.879	0.676	3.38	8.8
YOLO11n-WFU	0.849	0.873	0.902	0.676	3.60	8.1
YOLO11n	0.885	0.851	0.892	0.673	2.58	6.3
YOLO12n	0.893	0.827	0.885	0.671	2.56	6.3
RT-DETR-n	0.865	0.822	0.878	0.666	9.14	19.8
YOLO11n-F3M	0.895	0.844	0.908	0.679	2.61	6.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, T.; Wang, H.; Wang, W.; Zhang, K.; Ye, B.; Dong, H. F3M: A Frequency-Domain Feature Fusion Module for Robust Underwater Object Detection. J. Mar. Sci. Eng. 2026, 14, 20. https://doi.org/10.3390/jmse14010020

AMA Style

Wang T, Wang H, Wang W, Zhang K, Ye B, Dong H. F3M: A Frequency-Domain Feature Fusion Module for Robust Underwater Object Detection. Journal of Marine Science and Engineering. 2026; 14(1):20. https://doi.org/10.3390/jmse14010020

Chicago/Turabian Style

Wang, Tianyi, Haifeng Wang, Wenbin Wang, Kun Zhang, Baojiang Ye, and Huilin Dong. 2026. "F3M: A Frequency-Domain Feature Fusion Module for Robust Underwater Object Detection" Journal of Marine Science and Engineering 14, no. 1: 20. https://doi.org/10.3390/jmse14010020

APA Style

Wang, T., Wang, H., Wang, W., Zhang, K., Ye, B., & Dong, H. (2026). F3M: A Frequency-Domain Feature Fusion Module for Robust Underwater Object Detection. Journal of Marine Science and Engineering, 14(1), 20. https://doi.org/10.3390/jmse14010020

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

F3M: A Frequency-Domain Feature Fusion Module for Robust Underwater Object Detection

Abstract

1. Introduction

1.1. Research Background and Significance

1.2. Challenges of Underwater Object Detection

1.3. Progress in Modern Detection Methods

1.4. Motivation for Frequency-Domain Modeling in Underwater Detection

2. Frequency-Domain Feature Fusion Module (F3M)

2.1. F3M Module: Design and Architecture

2.1.1. Separate (Frequency Decomposition)

2.1.2. Project (Adaptive Feature Projection)

2.1.3. Fuse (Feature Fusion + Gating)

2.1.4. Spatial Attention Extension (F3MWithSA)

2.2. Baseline Architectures and Integration Strategy

2.2.1. One-Stage CNN: YOLO11

2.2.2. Real-Time Transformer: RT-DETR

3. Experiments and Analysis

3.1. Dataset

3.1.1. SCoralDet Dataset

3.1.2. TrashCan Dataset

3.2. Experimental Details

3.3. Evaluation Metrics

3.4. Comparison with Existing Detectors on the SCoralDet Dataset

3.5. Ablation Study of F3M Placement and Attention

3.6. General Applicability of F3M Across Detector Architectures

3.7. Cross-Dataset Generalization on the TrashCan Dataset

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI