Joint Inference of Image Enhancement and Object Detection via Cross-Domain Fusion Transformer

Zhao, Bingxun; Chen, Yuan

doi:10.3390/computers15010043

Open AccessArticle

Joint Inference of Image Enhancement and Object Detection via Cross-Domain Fusion Transformer

by

Bingxun Zhao

and

Yuan Chen

^*

School of Airspace Science and Engineering, Shandong University, Weihai 264209, China

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(1), 43; https://doi.org/10.3390/computers15010043

Submission received: 22 December 2025 / Revised: 4 January 2026 / Accepted: 8 January 2026 / Published: 10 January 2026

(This article belongs to the Special Issue Advanced Image Processing and Computer Vision (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

Underwater vision is fundamental to ocean exploration, yet it is frequently impaired by underwater degradation including low contrast, color distortion and blur, thereby presenting significant challenges for underwater object detection (UOD). Most existing methods employ underwater image enhancement as a preprocessing step to improve visual quality prior to detection. However, image enhancement and object detection are optimized for fundamentally different objectives, and directly cascading them leads to feature distribution mismatch. Moreover, prevailing dual-branch architectures process enhancement and detection independently, overlooking multi-scale interactions across domains and thus constraining the learning of cross-domain feature representation. To overcome these limitations, We propose an underwater cross-domain fusion Transformer detector (UCF-DETR). UCF-DETR jointly leverages image enhancement and object detection by exploiting the complementary information from the enhanced and original image domains. Specifically, an underwater image enhancement module is employed to improve visibility. We then design a cross-domain feature pyramid to integrate fine-grained structural details from the enhanced domain with semantic representations from the original domain. Cross-domain query interaction mechanism is introduced to model inter-domain query relationships, leading to accurate object localization and boundary delineation. Extensive experiments on the challenging DUO and UDD benchmarks demonstrate that UCF-DETR consistently outperforms state-of-the-art methods for UOD.

Keywords:

underwater object detection; underwater image enhancement; joint inference; cross-domain

1. Introduction

The ocean plays a vital role in maintaining ecological balance, preserving biodiversity, and supporting economic development [1,2]. Vision-enabled exploration platforms, such as autonomous underwater vehicles (AUVs) and remotely operated vehicles (ROVs), have become indispensable for marine environmental monitoring and resource exploration [3]. However, complex underwater imaging conditions pose significant challenges to visual object detection. As illustrated in Figure 1, compared with terrestrial scenes [4,5], underwater environments are severely affected by light absorption, scattering, and non-uniform illumination [6]. Consequently, captured images often suffer from severe degradations, including low contrast, blurred details, color distortion, and haze-like turbidity [7,8,9].

To alleviate these issues, underwater image enhancement (UIE) techniques have been widely adopted to improve visual quality [10], and enhanced images generally contribute to improved detection performance [11,12,13]. Nevertheless, a fundamental mismatch exists between the optimization objectives of image enhancement and the requirements of object detection. Specifically, enhancement methods primarily target pixel-level visual fidelity, whereas object detection focuses on instance-level localization and classification accuracy [14,15]. Directly relying on enhanced images for detection may lead to unstable performance, or even degrade detection accuracy due to artifacts, amplified noise, or domain shifts introduced during the enhancement process [16,17]. In particular, image enhancement methods often irreversibly alter the statistical distribution of the original images [18], which may further compromise the consistency of the underlying semantic information.

Accordingly, to address the aforementioned challenges, we investigate a cross-domain joint learning framework that encourages the model to effectively exploit informative representations from the enhanced image domain while optimizing cross-domain feature fusion between the original and enhanced domains in a detection-oriented manner. Inspired by collaborative learning paradigms [12,19,20], we explore how to leverage the inherent complementarity between the enhanced and original image domains through bidirectional interactions, enabling the integration of mutually beneficial features. In particular, the enhanced domain typically provides clearer edges and richer texture details, which are advantageous for object localization under low-contrast conditions [12]. In contrast, the original domain preserves the intrinsic data distribution, thereby retaining high-fidelity semantic information.

Building upon this framework, we propose UCF-DETR, an Underwater Cross-domain Fusion Transformer detector, which explicitly exploits complementary cross-domain representations to improve underwater object detection performance. We adopt the water-MSR method from GCC-Net [12] to enhance underwater images degraded by low contrast, color cast, and insufficient illumination. To fully leverage information from both domains, we further design a Cross-domain Feature Pyramid (CFP) that dynamically integrates fine-grained structural details from the enhanced domain with semantic features from the original domain via bidirectional, multi-scale cross-domain interactions. Moreover, we introduce a Cross-domain Query Interaction (CQI) mechanism to explicitly model relationships between detection queries across the two domains, thereby jointly improving detection accuracy in complex underwater environments.

The main contributions are summarized as follows:

We propose UCF-DETR, a cross-domain fusion Transformer-based detector for underwater object detection. By explicitly integrating complementary representations from the enhanced and original image domains, the proposed method effectively mitigates common underwater degradations, such as low visibility and poor contrast.
We design a Cross-Domain Feature Pyramid (CFP) that enables bidirectional, multi-scale feature interactions between the enhanced and original domains. By leveraging complementary information extracted from the enhanced domain to strengthen feature representations in the original domain, CFP substantially improves the model’s representational capacity under severely degraded underwater conditions.
We introduce a cross-domain query interaction mechanism to explicitly model the relationships between query embeddings from the enhanced and original domains, thereby facilitating cross-domain instance-level information exchange.
We validate the proposed approach on two challenging underwater benchmarks, DUO and UDD. Extensive experimental results demonstrate that UCF-DETR achieves superior performance on both datasets. The code is available at https://github.com/bibabu555/UCF-DETR.git.

2. Related Work

2.1. Underwater Image Enhancement

Physics-based underwater image enhancement methods primarily improve visual quality by explicitly modeling the physical characteristics of underwater imaging environments. The MLLE [21] enhances underwater images by reducing color distortion through the integration of maximum attenuation maps, adaptive integral-image contrast enhancement, and CIELAB color balancing, resulting in improved visual fidelity. Song et al. [22] proposed a voxelized light-field modeling approach based on photometric consistency under appearance variations and the Lambertian surface assumption, enabling efficient recovery of scene reflectance in deep-sea environments. Berman et al. [23] leveraged multispectral water-column profile information to estimate attenuation rates in the blue-red and blue-green channels, and further selected optimal attenuation parameters via color distribution optimization. On the other hand, deep learning-based approaches learn enhancement mappings directly from large-scale data using neural networks. Peng et al. [24] designed a U-shaped Transformer architecture that incorporates both channel-wise and spatial attention mechanisms, together with a fusion perceptual loss, achieving substantial improvements in underwater image enhancement. Wang et al. [25] proposed a differentiable enhancement framework guided by underwater color deviation and softly assigned chromaticity maps, which integrates visual-textual priors to enable zero-shot generalization, and further employs wavelet decomposition to separately guide low-frequency color correction and high-frequency detail restoration. UIEDP [26] formulates underwater image enhancement as a posterior sampling process conditioned on degraded inputs, combining pretrained diffusion models with traditional enhancement techniques to generate more visually natural results. LiteEnhanceNet [27] introduces a lightweight enhancement network based on depthwise separable convolutions and single-aggregation connections, achieving a favorable trade-off between enhancement performance and computational efficiency.

Although UIE has the potential to provide informative cues for high-level downstream tasks, such as object detection and segmentation [12,19,20], jointly modeling UIE and UOD remains a challenging problem. This challenge primarily arises from the pronounced domain discrepancy between UIE-oriented datasets and UOD benchmarks [12], as well as the general lack of corresponding clear-reference images in UOD datasets. Consequently, the joint modeling of these two tasks is hindered by inherent cross-domain objective inconsistencies [14,15]. Moreover, treating UIE as an independent preprocessing step may introduce irreversible information loss and amplified noise, which can even degrade UOD performance [15,17]. To overcome these limitations, we propose a unified framework that integrates UIE and UOD in a cross-domain joint learning paradigm. By mining complementary representations from enhanced and original underwater images and performing effective cross-domain feature fusion, the proposed approach substantially improves underwater object detection performance.

2.2. Underwater Object Detection

In recent years, underwater object detection has attracted increasing attention due to its broad application potential in underwater engineering, autonomous underwater vehicles (AUVs), and marine exploration. Most state-of-the-art methods are based on deep learning and can be broadly categorized into two groups. The first category builds upon enhanced general-purpose object detection frameworks, aiming to improve detection performance in degraded underwater environments through architectural optimization and advanced feature extraction strategies. BSR5-DETR [28] adopts a partial differential equation-inspired formulation and introduces a lightweight backbone network, termed BSR5, in which residual connections are designed to suppress feature drift. This design significantly reduces model complexity while improving detection accuracy and robustness in underwater scenarios. Ji et al. [29] proposed a feature enhancement and differential pyramid network that improves multi-scale underwater object localization and classification while reducing information redundancy. Zhang et al. [30] introduced a lightweight model, PRCII-Net, which enhances single-scale feature interactions and strengthens cross-scale information fusion, thereby improving small-object detection performance in complex underwater environments.

The second category leverages image enhancement techniques to alleviate underwater visual degradations by improving image quality, thereby providing clearer boundaries and structural cues for object detection. Yeh et al. [31] designed a color conversion module to mitigate underwater color absorption by mapping images to the grayscale domain. IF-USOD [32] improves detection performance through multimodal interactions between RGB and depth information and introduces a cross-scale learning strategy for multi-granularity saliency prediction. Liu et al. [7] jointly trained an object detector with a UnitModule to generate denoised underwater images, resulting in improved detection accuracy. DJL-Net [19] employs a dual-branch multi-task learning framework to fuse complementary features from enhancement and detection tasks, effectively compensating for feature degradation caused by underwater imaging conditions. Dai et al. [12] proposed a gated cross-domain collaboration network that performs dynamic feature fusion via gated mechanisms and cross-domain interactions between the image restoration and original image domains, substantially enhancing representation capability.

Enhanced images can effectively alleviate color distortion and contrast degradation in underwater scenes [33], providing clearer edge and texture cues that facilitate more accurate object localization under low-contrast conditions. In contrast, original underwater images preserve the intrinsic statistical distribution of the scene and retain high-fidelity semantic information. Motivated by this observation, we propose a cross-domain joint learning architecture that jointly exploits image enhancement and object detection, fostering effective interaction and complementary feature fusion between domains, thereby improving underwater object detection performance.

Despite recent progress in joint underwater image enhancement and object detection frameworks, several fundamental limitations continue to constrain their performance. First, many existing methods exhibit an enhancement-driven bias, in which the emphasis on pixel-level visual fidelity can inadvertently suppress or distort task-relevant semantic representations that are critical for accurate detection. Second, current fusion strategies are largely restricted to feature-level integration, overlooking instance-level cross-domain interactions that are crucial for robust object localization. To address these limitations, we propose a Cross-domain Feature Pyramid (CFP) that strengthens feature representations through effective cross-domain fusion, together with a Cross-domain Query Interaction (CQI) mechanism that enhances instance-level localization in complex underwater environments by explicitly modeling cross-domain interactions.

3. Method

3.1. Overview

The overall architecture of UCF-DETR is illustrated in Figure 2. Given an input underwater image, we first apply an underwater image enhancement (UIE) module to generate its enhanced counterpart, which defines an auxiliary enhanced domain. Both the original image and the enhanced image are subsequently fed into a shared-weight backbone network to extract multi-scale hierarchical features. Let

F_{r a w} \in R^{C \times H_{l} \times W_{l}}

and

F_{e n h} \in R^{C \times H_{l} \times W_{l}}

denote the feature maps produced at the l-th backbone layer for the raw and enhanced domains, respectively, where C represents the channel dimension, and

H_{l}

and

W_{l}

denote the spatial resolution at level l. These multi-scale features are flattened and linearly projected into a unified embedding space before being processed by the Transformer encoder.

Through self-attention operations, high-level feature representations

F \in R^{C \times H_{h} \times W_{h}}

are obtained. To effectively exploit complementary information between the raw and enhanced domains, we introduce a Cross-domain Feature Pyramid (CFP) within the encoder. Specifically, CFP constructs a dual-stream feature pyramid and integrates a series of cascaded bidirectional cross-domain attention layers, enabling efficient exchange of semantic and structural cues across domains at multiple spatial scales. The outputs from all pyramid levels are concatenated to form a unified encoder memory representation

M \in R^{Σ_{l} (H_{l} \times W_{l}) \times D}

, where D denotes the embedding dimension.

At the decoder, we adopt a query selection strategy to initialize object queries. Specifically, a scoring function

Ψ (\cdot)

is applied to the memory

M

to estimate the classification confidence at each spatial location. The top-

N_{q}

locations with the highest scores are selected as the initial object queries

Q = {q_{i}}_{i = 1}^{N_{q}}

, where

Q \subset M

. These queries are further refined through the proposed Cross-domain Query Interaction (CQI) module, which explicitly captures inter-domain dependencies via a gated cross-attention mechanism. Through CQI, each query selectively aggregates contextual cues from both the raw and enhanced domains, leading to more discriminative instance-level representations and improved localization priors. Finally, the refined queries are fed into the detection head to produce the final prediction set

Y = {({\hat{p}}_{i}, {\hat{b}}_{i})}_{i = 1}^{N_{q}}

, where

{\hat{p}}_{i}

denotes the predicted class logits and

{\hat{b}}_{i} \in {[0, 1]}^{4}

represents the normalized bounding box coordinates. This design facilitates accurate and robust object detection in challenging underwater environments.

3.2. Retinex-Based Model for Underwater Image Enhancement

As this work focuses on object detection frameworks that jointly exploit the image enhancement domain, we adopt the real-time physics-based UIE model from GCC-Net [12] and integrate it into the UOD framework. Based on Retinex theory, the underwater image formation process is decomposed into two independent components: incident illumination (L) and object reflectance (R). The corresponding image formation model is expressed as

\begin{matrix} I (x, y) = R (x, y) ⊙ L (x, y) \end{matrix}

(1)

where ⊙ denotes element-wise multiplication, I represents the observed image, and

(x, y)

indicates the pixel location. By applying a logarithmic transformation, the illumination and reflectance components can be separated as

\begin{matrix} log I = log R + log L \\ R = exp (log I - log L) \end{matrix}

(2)

Building upon this formulation, Multi-Scale Retinex (MSR) employs Gaussian surround functions at multiple scales to estimate the illumination component, thereby enabling dynamic range compression, color constancy, and luminance restoration. The MSR formulation is given by

\begin{matrix} R_{n}^{i} (x, y) = log I^{i} (x, y) - log [G (x, y, σ_{n}) * I^{i} (x, y)] \end{matrix}

(3)

where

G (\cdot)

denotes a Gaussian kernel with scale parameter

σ_{n}

, and ∗ represents the convolution operation.

To suppress underwater color cast and achieve color pre-correction, channel-wise normalization is applied to the input image I. Let

I_{m e a n}^{i}

and

I_{v a r}^{i}

denote the mean and standard deviation of the i-th channel, respectively. The normalized image is computed as

\begin{matrix} I^{i} = \frac{I^{i} - (I_{mean}^{i} - I_{var}^{i})}{2 I_{var}^{i}} \times 255 \end{matrix}

(4)

To balance detail enhancement and noise suppression, a three-scale weighted fusion strategy is adopted with

σ_{n} \in {30, 150, 300}

. The small-scale kernel

σ = 30

preserves fine-grained texture details, while larger-scale kernels

σ = 150, 300

effectively suppress noise and non-uniform illumination. The final enhanced image

I_{e n h}

is obtained as

\begin{matrix} I_{e n h}^{i} = \sum_{n = 1}^{3} ω_{n} R_{n}^{i}, \sum ω_{n} = 1 \end{matrix}

(5)

To reduce computational overhead, Gaussian filtering is replaced with fast filtering based on a recursive image pyramid. During the recursive process, the image resolution is halved at each pyramid level, and the corresponding Gaussian kernel scale

σ_{n}

is proportionally reduced until a predefined lower bound

σ_{min} = 10

is reached. This strategy significantly lowers computational complexity while preserving filtering quality, enabling real-time underwater image restoration and producing enhanced images for subsequent detection stages.

3.3. Cross-Domain Feature Pyramid

Underwater imaging often suffers from severe degradations, such as low contrast and insufficient illumination, which pose substantial challenges for reliable object detection. Underwater image enhancement techniques can partially restore visual details and improve perceptual quality; enhanced images typically exhibit clearer appearances and sharper boundary cues, thereby alleviating localization ambiguity under low-contrast conditions. In contrast, original underwater images preserve the intrinsic statistical distribution of the scene and retain high-fidelity semantic information. These two image domains therefore exhibit strong and complementary characteristics.

To fully exploit the synergistic information across the original and enhanced domains, we propose a Cross-Domain Feature Pyramid (CFP). CFP constructs multi-scale hierarchical representations by explicitly modeling cross-domain semantic dependencies through bidirectional attention mechanisms. Unlike conventional single-path feature fusion strategies, CFP adopts a dual-stream architecture that enables deep cross-domain association and complementary feature mining between the original and enhanced domains. The overall architecture of CFP is illustrated in Figure 3.

Cross-Domain Feature Pyramid Construction. Given the hierarchical features extracted by the shared backbone from the original and enhanced images, denoted as

F_{raw} = {F_{raw}^{l}}_{l = 1}^{L}, F_{enh} = {F_{enh}^{l}}_{l = 1}^{L}

, where l indexes the feature levels, we first employ a shared encoder with two-dimensional sinusoidal positional embeddings to project both feature streams into a unified latent space. Inspired by feature pyramid network (FPN) architectures [34], a top-down pathway is constructed to propagate high-level semantic information to lower-resolution features. However, we observe that vanilla FPNs relying on element-wise addition are insufficient to model the complex and non-linear interactions between the raw and enhanced domains.

At feature level l, the intermediate representation

F^{l} \in R^{C \times H \times W}

is computed as:

\begin{matrix} F^{l} = C_{lat} (F^{l}) + U_{2 \times} (F^{l + 1}) \end{matrix}

(6)

where

C_{lat} (\cdot)

denotes lateral convolution and

U_{2 \times} (\cdot)

represents nearest-neighbor upsampling. This operation is independently applied to both the original and enhanced streams, producing their respective intermediate feature representations.

Bidirectional Cross-Domain Interaction. To effectively associate features across domains and exploit their complementarity, we design a bidirectional interaction module based on deformable attention [35]. The design is driven by two key considerations: (i) enabling active and adaptive information exchange across domains, and (ii) preserving computational efficiency. Since the computational cost of global attention scales quadratically with the spatial resolution (

H \times W

), we adopt a sparse sampling strategy guided by learnable reference points.

At feature level l, the feature maps

F_{raw}^{l}

and

F_{enh}^{l}

are flattened and projected into query, key, and value embeddings. Specifically, learnable projection matrices

W_{Q}, W_{K}, W_{V} \in R^{C \times C}

are used to obtain

{Q_{raw}^{l}, K_{raw}^{l}, V_{raw}^{l}} and {Q_{enh}^{l}, K_{enh}^{l}, V_{enh}^{l}}

.

The bidirectional deformable cross attention is formulated as:

\begin{matrix} DCA (Q_{raw}^{l}, K_{enh}^{l}, V_{enh}^{l}) = \sum_{m = 1}^{M} W_{m} [\sum_{k = 1}^{K} A_{m k} \cdot W_{m}^{'} V_{enh}^{l} (p_{q} + Δ p_{m k})] \\ DCA (Q_{enh}^{l}, K_{raw}^{l}, V_{raw}^{l}) = \sum_{m = 1}^{M} W_{m} [\sum_{k = 1}^{K} A_{m k} \cdot W_{m}^{'} V_{raw}^{l} (p_{q} + Δ p_{m k})] \end{matrix}

(7)

where

m \in [1, M]

indexes the attention heads,

k \in [1, K]

denotes the sampled points per head,

W_{m}

and

W_{m}^{'}

are learnable linear projection matrices,

p_{q}

is the 2D reference point associated with the query,

Δ p_{m k}

denotes the learnable sampling offset, and

A_{m k}

represents the attention weight. By learning reference points and their associated offsets, deformable cross-domain attention performs computation only at a sparse set of informative sampling locations, thereby avoiding redundant operations over the entire spatial domain. This design effectively suppresses background interference while emphasizing salient object cues, which is crucial for robust classification and precise localization in challenging underwater environments.

This bidirectional attention mechanism enables each domain to selectively attend to the most informative sparse regions in the counterpart domain, effectively alleviating spatial misalignment. Through such interaction, fine-grained structural details from the enhanced domain refine original-domain features, while high-level semantic cues from the original domain guide feature optimization in the enhanced domain, achieving mutual reinforcement.

Feature Fusion and Hierarchical Aggregation. The bidirectionally interacted features are fused using residual connections followed by layer normalization:

\begin{matrix} {\hat{F}}_{raw}^{l} = LN (F_{raw}^{l} + Dropout (DCA (F_{raw}^{l}, F_{enh}^{l}))), \\ {\hat{F}}_{enh}^{l} = LN (F_{enh}^{l} + Dropout (DCA (F_{enh}^{l}, F_{raw}^{l}))), \end{matrix}

(8)

where

LN (\cdot)

denotes layer normalization.

To further enhance localization capability, we introduce a bottom-up aggregation pathway. In the original stream, the final pyramid feature at level l is constructed by fusing the downsampled feature from the lower level with the cross-attention enhanced representation:

\begin{matrix} F^{l} = C_{pan} ([D_{2 \times} (F^{l - 1}), {\hat{F}}^{l}]), \end{matrix}

(9)

where

[\cdot, \cdot]

denotes channel-wise concatenation,

D_{2 \times}

represents downsampling convolution, and

C_{pan}

denotes the convolutional module in the path aggregation network.

Through hierarchical bidirectional interaction and aggregation, CFP aligns and fuses cross-domain features to produce representations that simultaneously encode high-level semantic information and fine-grained structural details. As a result, the original domain benefits from the clear boundary cues provided by the enhanced domain, while the enhanced domain receives semantic guidance from the original domain, collectively leading to improved object detection performance in complex underwater environments.

3.4. Cross-Domain Query Interaction

In Transformer-based object detection frameworks, object queries encode instance-level semantic information and localization priors. However, severe underwater imaging degradations, such as low contrast and insufficient illumination, often attenuate the localization cues embedded in queries derived from raw images. In contrast, while enhanced underwater images can provide clearer boundary and texture cues, the enhancement process may introduce artifacts or domain shifts, resulting in suboptimal feature alignment and rendering direct cross-domain feature fusion ineffective. To address these issues, we propose a Cross-Domain Query Interaction (CQI) module, as illustrated in Figure 4, which collaboratively leverages queries from the raw and enhanced domains to effectively exploit complementary instance-level representations.

Cross-Domain Query Interaction. We denote the query features as

Q \in R^{N_{q} \times D}

, where

N_{q}

is the number of object queries and D denotes the feature dimension. Let

Q_{raw}

and

Q_{enh}

represent queries originating from the original and enhanced domains, respectively. To alleviate feature distribution discrepancies between the two domains and to suppress noise induced by direct interaction, we introduce a lightweight domain-adaptive adapter implemented as a multi-layer perceptron (MLP) that performs nonlinear transformation on the input queries:

\begin{matrix} {\tilde{Q}}_{raw} = LN (MLP (Q_{raw})) \\ {\tilde{Q}}_{enh} = LN (MLP (Q_{enh})) \end{matrix}

(10)

where

LN (\cdot)

denotes layer normalization.

To facilitate complementary information exchange across domains, we further introduce a cross-attention interaction mechanism, which allows queries from one domain to actively retrieve relevant contextual information from the other domain. The interacted query features are computed as

\begin{matrix} H_{raw} = Attn (Q_{raw}, {\tilde{Q}}_{enh}, {\tilde{Q}}_{enh}) \\ H_{enh} = Attn (Q_{enh}, {\tilde{Q}}_{raw}, {\tilde{Q}}_{raw}) \end{matrix}

(11)

Adaptive Gated Fusion. Since enhanced-domain features may contain artifacts or domain-induced noise, not all cross-domain information is beneficial for object detection. Blind fusion can therefore lead to semantic drift in the original queries. To address this issue, we design an adaptive gating network that dynamically regulates the contribution of cross-domain features. The gating network predicts element-wise confidence weights based on the concatenation of the current query and its corresponding interacted feature, using two fully connected layers followed by nonlinear activations:

\begin{matrix} G_{raw} = σ (W_{2} (δ (W_{1} [Q_{raw}, H_{raw}]))), \\ G_{enh} = σ (W_{2} (δ (W_{1} [Q_{enh}, H_{enh}]))), \end{matrix}

(12)

where

[\cdot, \cdot]

denotes channel-wise concatenation,

W_{1}

and

W_{2}

are learnable linear transformations,

δ (\cdot)

denotes the ReLU activation function, and

σ (\cdot)

is the Sigmoid function, producing confidence weights in the range

[0, 1]

.

The updated object queries are obtained through gated residual connections followed by layer normalization:

\begin{matrix} Q_{raw} = LN (Q_{raw} + G_{raw} ⊙ H_{raw}), \\ Q_{enh} = LN (Q_{enh} + G_{enh} ⊙ H_{enh}), \end{matrix}

(13)

where ⊙ denotes element-wise multiplication.

The CQI module is inserted before the feed-forward network (FFN) in each decoder layer, enabling progressive cross-domain query interaction as the decoding depth increases. Through multi-level interactions between enhanced-domain queries rich in structural details and original-domain queries that preserve high-fidelity semantic information, CQI effectively exploits complementary localization cues and significantly improves object localization accuracy in complex and severely degraded underwater environments.

4. Experiment

4.1. Evaluation Metrics

4.1.1. Datasets

We evaluate the proposed UCF-DETR on two widely used underwater object detection benchmarks: DUO [36] and UDD [37].

The DUO dataset was released as part of the 2020 Underwater Object Detection Algorithm Competition and contains 7782 images with 75,514 annotated instances across four representative underwater species, namely echinus, holothurian, starfish, and scallop.

The UDD dataset is a real-world underwater object detection benchmark comprising 3194 in situ underwater images. Among them, 2560 images are used for training, 128 for validation, and 506 for testing. The dataset covers three object categories: holothurian, echinus, and scallop.

4.1.2. Experiment Setup

Training Details. Our method is implemented based on the RT-DETR framework [38]. For the main experiments, we employ the AdamW optimizer for parameter optimization, with a base learning rate of

1 \times 10^{- 4}

. The backbone network is trained with a smaller learning rate of

1 \times 10^{- 5}

. The weight decay is set to

1 \times 10^{- 4}

, and the momentum parameters

β_{1}

and

β_{2}

are set to

0.9

and

0.999

, respectively.

During training, a dynamic data-loading strategy is adopted. Specifically, in the first 11 training epochs, both the data augmentation pipeline and the batch collation function are progressively adjusted to gradually improve the model’s adaptation to underwater scenarios. The data augmentation strategy consists solely of standard operations, including RandomHorizontalFlip, RandomZoomOut, and RandomIoUCrop. Training is conducted on a single NVIDIA GeForce RTX 4090 GPU (NVIDIA, Santa Clara, CA, USA), and all remaining hyperparameters follow the default settings of RT-DETR.

Inference Details. Inference speed, reported in frames per second (FPS), is measured on a single RTX 4090 GPU with a batch size of 1.

Comparison Details. To ensure a rigorous and equitable comparison, all competing methods are evaluated following their official implementations and recommended configurations. For results directly cited from original papers, we maintain the reported backbones and training schedules. In cases where re-training was necessary, we adhered to the same experimental protocols and data partitions as our proposed method to eliminate potential biases arising from hyperparameter tuning or hardware variations.

Evaluation Metrics. All experimental results are reported following the COCO-style evaluation protocol. We adopt the standard Average Precision (AP) metrics, including

A P_{50}

(IoU threshold = 0.50),

A P_{75}

(IoU threshold = 0.75), and the overall

A P

, which is averaged over multiple IoU thresholds ranging from

0.50

to

0.95

with a step size of

0.05

. These metrics provide a comprehensive evaluation of detection performance under varying localization accuracy requirements.

4.2. Experiment Results

We compare our method with several state-of-the-art approaches on the two benchmark datasets and systematically evaluate the performance of the UCF-DETR model.

4.2.1. Results on DUO

Under the DUO benchmark, we perform a comprehensive evaluation of the proposed UCF-DETR against a wide range of state-of-the-art (SOTA) methods, with quantitative results reported in Table 1. The results clearly indicate that UCF-DETR consistently outperforms existing approaches. UCF-DETR achieves a

+ 2.2 %

improvement in Average Precision (AP) over the baseline RT-DETR, demonstrating the effectiveness of the proposed design.

When compared with generic object detection frameworks, UCF-DETR surpasses Relation DETR [39] by

+ 0.7 %

AP and outperforms DINO [40] by

+ 1.5 %

AP, highlighting its superior capability in handling underwater visual characteristics. Furthermore, relative to representative underwater object detection methods, UCF-DETR exceeds DJL-Net [19] by

+ 0.8 %

AP, thereby establishing a new state-of-the-art performance on the DUO benchmark.

The performance gains can be attributed to the inherent challenges of underwater object detection, including severe image degradations such as low contrast, color distortion, and non-uniform illumination. To mitigate these adverse effects, UCF-DETR integrates an Underwater Image Enhancement (UIE) module, which substantially improves visual quality and enhances object visibility in degraded regions. In addition, the proposed cross-domain feature pyramid effectively leverages complementary information from both original and enhanced representations, leading to more discriminative and robust feature learning under challenging underwater conditions.

Beyond accuracy improvements, UCF-DETR achieves an inference speed of

18.9

FPS on an NVIDIA RTX 4090 GPU, indicating that the proposed method delivers superior detection performance and exhibits strong potential for real-time underwater applications.

Table 1. Comparison with the state-of-the-art on DUO. The results with red and blue indicate the best and second-best results of each column, respectively.

Model	Backbone	Epochs	Params (M)	${AP}^{val}$	${AP}_{50}^{val}$	${AP}_{75}^{val}$	FPS
General Object Detector
YOLO11	-	36	20.13	67.7	85.7	-	74.1
DETR [41]	ResNet50	50	41.28	51.7	78.8	60.2	8.7
DINO [40]	ResNet50	12	42.28	64.9	85.1	71.3	21.3
RT-DETR [38]	ResNet50	12	36.57	64.2	84.5	71.9	57.9
Relation DETR [39]	ResNet50	12	43.48	65.7	85.2	72.6	20.7
Underwater Object Detector
Boosting-RCNN [42]	ResNet50	12	48.07	60.8	80.6	69.0	37.0
DJL-Net [19]	ResNet50	12	58.48	65.6	84.2	73.0	-
RoIAttn [43]	ResNet50	12	-	62.3	82.8	71.4	14.2
GCC-Net [12]	Swin-T	12	39.46	61.1	81.6	67.3	20.8
Underwater Cross-Domain Fusion Transformer Detector (ours)
UCF-DETR	ResNet50	12	44.28	66.4	85.5	73.9	18.9

4.2.2. Results on UDD

We conduct a comprehensive comparison between the proposed UCF-DETR and representative state-of-the-art methods on the UDD dataset, with quantitative results summarized in Table 2. Compared with generic object detectors, UCF-DETR achieves

30.9 %

AP using 12 training epochs, outperforming the majority of competing approaches and exhibiting a clear advantage over RT-DETR (26.8% AP) and DINO (28.3% AP). Moreover, UCF-DETR attains 19.3%

{AP}_{75}

and 71.1%

{AP}_{50}

, indicating its strong localization accuracy and detection reliability under challenging underwater conditions.

Further comparisons with underwater-specific detection methods show that UCF-DETR ranks second in terms of overall AP, marginally behind the current top-performing DJL-Net (

31.5 % A P

), while maintaining a noticeable performance margin over Boosting R-CNN (

28.4 % A P

) and RoIAttn (

28.3 % A P

). Notably, UCF-DETR achieves

18.7 %

{AP}_{S}

, outperforming DJL-Net [19] (

17.5 %

{AP}_{S}

), which demonstrates its superior robustness in detecting small and medium-scale objects. Although UCF-DETR exhibits slightly inferior performance on

{AP}_{L}

compared with certain competing methods, it maintains balanced and competitive performance across multiple evaluation metrics, collectively validating the effectiveness of the proposed cross-domain feature fusion strategy in complex and severely degraded underwater environments.

4.3. Ablation Study

Ablation on effectiveness of Each Component in UCF-DETR. Table 3 presents the ablation study results of the core components of UCF-DETR on the DUO dataset using a ResNet-50 backbone, with a particular focus on evaluating the impact of the Cross-domain Feature Pyramid (CFP) module and the Cross-domain Query Interaction (CQI) module on detection performance.

After incorporating the CFP module, the model exhibits consistent improvements across all evaluation metrics. Specifically, the Average Precision (AP) increases from

64.2 %

to

65.8 %

(

+ 1.6 %

), while

{AP}_{50}

improves by

1.0 %

to

85.5 %

, and

{AP}_{75}

rises by

2.0 %

to

73.9 %

. Notably, the performance on small-object detection (

{AP}_{S}

) is significantly boosted by

4.5 %

, reaching

56.1 %

, which indicates that CFP effectively enhances the extraction of fine-grained features and thereby substantially improves detection accuracy for small-scale targets. Meanwhile, the detection performance for medium and large-scale objects also improves steadily, with

{AP}_{M}

and

{AP}_{L}

increasing by

2.0 %

and

1.8 %

, respectively.

Further integrating the CQI module to form the complete model leads to additional performance gains. The overall AP is further improved by

0.6 %

to

66.4 %

, with the small-object detection metric

{AP}_{S}

again showing a notable increase of

1.5 %

to

58.6 %

, and

{AP}_{M}

exhibiting a modest improvement of

0.4 %

. Although slight fluctuations are observed in

{AP}_{50}

and

{AP}_{L}

after introducing CQI, the overall results demonstrate that the CQI and CFP modules exhibit strong synergy, jointly enhancing the model’s adaptability to complex underwater scenes and providing complementary gains, particularly in boosting small-object detection performance.

4.3.1. Effect of Underwater Image Enhancement

Figure 5 qualitatively illustrates the effectiveness of the proposed Underwater Image Enhancement (UIE) module. UIE module effectively mitigates several prevalent underwater image degradations, including color cast, insufficient illumination, and low contrast, leading to more accurate color correction and improved illumination recovery. Compared with the original inputs, the enhanced images exhibit clearer structural details and enhanced visual clarity, thereby providing richer and more reliable visual cues for subsequent cross-domain feature interaction.

These improvements enable the model to better exploit discriminative and complementary information from the enhanced domain, which in turn facilitates more robust feature representation under degraded conditions and contributes to the observed gains in overall detection performance.

4.3.2. Effect of Cross-Domain Feature Fusion in CFP

To qualitatively evaluate the representational capacity of the learned features, Figure 6 provides a comparative visualization of feature discriminability between the baseline RT-DETR encoder and the proposed CFP module. As observed, CFP produces stronger and more spatially concentrated activations over foreground objects. This enhanced semantic discriminability highlights the effectiveness of the bidirectional interaction and deformable sampling mechanisms in suppressing background interference while accentuating salient object cues—properties that are essential for robust classification and accurate localization in challenging underwater environments.

As shown in Figure 6, the baseline model without the CFP module exhibits spurious activations in background regions. While foreground objects, particularly those with low contrast, receive insufficient attention. This results in limited overall discriminability and poses challenges for precise object localization in complex underwater scenes. In contrast, after incorporating the CFP module, the encoder features display markedly improved spatial focus. Even under adverse conditions characterized by low contrast and blurred textures, the model generates strong, well-localized activations on target regions, demonstrating that CFP effectively enhances the separation between foreground and background and guides the network to attend to semantically relevant areas.

These qualitative observations are highly consistent with the preceding quantitative results. By leveraging cross-domain feature interaction, the CFP module integrates complementary structural and contrast information from the enhanced domain, thereby improving feature discriminability. This enhanced representation not only enables the model to maintain robust detection performance under severe underwater degradations, but also provides more informative and discriminative features for subsequent detection heads.

4.4. Analysis

Convergence Curves. Figure 7 compares the

{AP}_{50}

convergence behavior of UCF-DETR with representative state-of-the-art (SOTA) detectors over 12 training epochs on both the DUO and UDD datasets, providing insight into the feature learning dynamics of the proposed method. On the DUO dataset (Figure 7a), UCF-DETR consistently achieves higher

{AP}_{50}

values than the baseline throughout the entire training process, indicating more effective and stable optimization. On the UDD dataset (Figure 7b), although DINO and Relation-DETR demonstrate relatively competitive performance during the early training stages, UCF-DETR exhibits a steeper performance growth rate. Specifically, it surpasses DINO around the 7th epoch and ultimately establishes a clear and stable performance margin. This behavior suggests that, while complex underwater conditions may hinder initial feature alignment, the proposed cross-domain fusion mechanism in UCF-DETR facilitates more effective representation learning and enables a higher convergence peak.

Detection Result Visualization. Figure 8 also presents qualitative comparisons between the baseline detector and UCF-DETR in representative underwater scenes. Under severe underwater degradations, such as color distortion, low contrast, and water turbidity, the baseline model is prone to false positives and missed detections. Even when objects are correctly identified, the baseline often produces low confidence scores, reflecting limited discriminative capability under extreme conditions. In contrast, UCF-DETR demonstrates markedly improved detection robustness. Benefiting from the proposed feature fusion strategy, UCF-DETR effectively suppresses environmental interference, significantly reduces false alarms and missed detections, and yields more accurate bounding box localization. Notably, the proposed method is able to precisely detect targets with higher confidence scores even in low-contrast and blurred regions, highlighting its enhanced feature representation capacity and superior generalization ability in complex underwater environments.

Error Analysis. To gain deeper insight into the detection improvements achieved by UCF-DETR across different object scales, we conduct a multi-dimensional error analysis comparing the baseline method and UCF-DETR on the DUO dataset. Figure 9 presents the precision-recall (PR) curves for large-, medium-, and small-scale objects, along with their progressive evolution under increasingly relaxed evaluation criteria. The labels C75, C50, Loc, Sim, Oth, BG, and FN correspond to AP values computed after sequentially eliminating specific error sources. For example, Loc reflects performance after ignoring localization errors, whereas BG denotes results obtained after removing background false positives.

Overall, UCF-DETR consistently outperforms the baseline across all object scales and evaluation settings. For large- and medium-scale objects, UCF-DETR achieves C75 scores of

0.704

and

0.736

, respectively, under the stringent IoU threshold of

0.75

, compared with

0.671

and

0.700

for the baseline, yielding relative improvements of

3.3 %

and

3.6 %

. These gains indicate that UCF-DETR produces more accurately localized bounding boxes, effectively alleviating the ambiguity caused by blurred object boundaries and uneven illumination in underwater environments. The proposed CQI module strengthens instance-level query representations by integrating informative cues from enhanced underwater images through cross-domain interactions, thereby significantly improving detection reliability under strict localization requirements.

As illustrated in Figure 9c,f, small-object detection remains particularly challenging in underwater scenarios. The baseline method attains a C75 score of only

0.560

for small objects, highlighting its limited capability in precisely localizing tiny targets. In contrast, UCF-DETR improves this score to

0.617

, corresponding to a relative gain of

5.7 %

. This improvement confirms that the proposed approach preserves richer fine-grained information within the feature hierarchy, thereby enhancing sensitivity to distant or small-scale biological targets.

With respect to false positives and background interference, the comparison between the Loc and BG curves reveals that UCF-DETR consistently achieves a larger area under the curve than the baseline across all object scales. Notably, for small objects, the baseline exhibits similar values for C50 (

0.721

) and Loc (

0.729

), yet a pronounced gap relative to BG (

0.782

), indicating that background clutter constitutes a dominant error source. In contrast, UCF-DETR not only improves all metrics consistently, but also effectively suppresses spurious activations by enhancing feature discriminability, thereby reducing background-induced false positives and producing a more reliable confidence distribution.

In summary, this error analysis demonstrates that UCF-DETR surpasses the baseline not merely in terms of overall average precision, but more importantly, achieves a structural improvement in detection quality by substantially reducing localization errors and mitigating background confusion across object scales.

4.5. Limitations

To facilitate further research on underwater object detection, Figure 10 presents representative failure cases of the proposed UCF-DETR under extreme underwater degradations. While UCF-DETR demonstrates strong robustness across most scenarios, these examples reveal several limitations when visual information is severely compromised. From left to right, the figure illustrates four representative failure modes commonly encountered in challenging underwater environments:

(a): Severe color distortion: In this case, the texture of the starfish becomes nearly indistinguishable from the surrounding background due to extreme visual degradation. Moreover, under such severely degraded conditions, noise artifacts introduced by the UIE module adversely affect the quality of feature representations in the enhanced domain. Although the proposed UCF-DETR is designed to extract complementary cues from the restored enhanced domain to improve detection performance, severe degradation may corrupt these informative features, preventing them from providing reliable semantic guidance and ultimately leading to the observed missed detection.
(b): Low-light conditions: The contour and texture information of the holothurian are severely attenuated, causing the object to blend into the background and leading to detection failure.
(c): Turbid and low-contrast environments: Background textures are spuriously activated by the model, resulting in the scallop being incorrectly detected as a foreground object.
(d): Low-contrast hazy blur: The combined effects of haze and contrast loss cause severe degradation of visual features. Under such conditions, the model simultaneously fails to detect the starfish and erroneously classifies background regions as echinus.
(e): Blurred and densely crowded scenes: in severely blurred underwater environments with high biological density, marine organisms often exhibit substantial spatial overlap, while small-scale targets lack well-defined boundaries and discriminative texture cues. Under these adverse conditions, the detection model is prone to confusion among overlapping structures and ambiguous contours. This results in frequent false positives for echinus, triggered by cluttered background patterns, as well as missed detections of starfish whose fine-grained visual characteristics are obscured by blur and occlusion. This failure case underscores the intrinsic challenge of accurately distinguishing small, densely clustered objects in degraded underwater imaging scenarios.

Figure 10. Representative Failure Cases of UCF-DETR on the DUO Benchmark.

These failure cases highlight the intrinsic challenges posed by extreme underwater degradations and underscore the need for continued advances in robust feature representation, enhancement mechanisms, and cross-domain learning strategies to further improve detection reliability in highly adverse conditions.

5. Conclusions

This paper presents UCF-DETR, a cross-domain fusion Transformer-based detector designed for underwater scenes, aiming to address the inherent challenges of underwater object detection, including low contrast, color distortion, and blur. The proposed approach establishes a unified collaborative reasoning framework that jointly leverages complementary information from both the raw image domain and the enhanced image domain, thereby significantly improving detection robustness and accuracy in complex underwater environments. The primary contributions of this work stem from the proposed Cross-domain Feature Pyramid (CFP) and Cross-domain Query Interaction (CQI) modules, which enable effective cross-domain collaboration at both the feature representation level and the instance-level query level, respectively. CFP introduces a bidirectional attention-based fusion mechanism that integrates rich edge and texture cues from the enhanced domain with multi-scale features from the raw domain, alleviating the representational limitations of relying on a single domain under severe underwater degradations. Meanwhile, the CQI module explicitly models cross-domain interactions among object queries, leading to improved localization accuracy and more discriminative instance representations. Extensive experimental evaluations on two challenging underwater object detection benchmarks, DUO and UDD, demonstrate that UCF-DETR consistently outperforms existing state-of-the-art methods. Nonetheless, under extremely degraded conditions, the proposed framework still exhibits limitations, as the image enhancement process may introduce artifacts or incur information loss, thereby constraining the upper bound of detection performance. Future work will focus on preserving semantic consistency during cross-domain interactions to mitigate performance degradation induced by domain shifts. In addition, exploring computationally efficient cross-domain interaction mechanisms constitutes an important research direction toward achieving real-time underwater object detection.

Author Contributions

Conceptualization, Y.C.; methodology, Y.C. and B.Z.; software, B.Z.; validation, Y.C. and B.Z.; formal analysis, B.Z.; investigation, Y.C.; resources, Y.C.; data curation, B.Z.; writing—original draft preparation, B.Z.; writing—review and editing, Y.C.; visualization, B.Z.; supervision, Y.C.; project administration, Y.C.; funding acquisition, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

The work is supported by National Key R&D Program of China (No. 2023YFD2401104), Natural Science Foundation of Shandong Province (No. ZR2019MEE019 and No. ZR2020ME112), Instrument and Equipment Development Project of Shandong University (No.zy20240203), Guangdong Basic and Applied Basic Research Foundation (No. 2024A1515011555).

Data Availability Statement

The datasets presented in this study can be downloaded here: https://github.com/chongweiliu/DUO.git (accessed on 25 October 2025) and https://github.com/chongweiliu/UDD_Official.git (accessed on 15 November 2025). The code is available at: https://github.com/bibabu555/UCF-DETR.git (accessed on 19 December 2025).

Acknowledgments

We would like to thank the editors and reviewers for their time, hard work, and valuable comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cai, W.; Zhu, J.; Zhang, M.; Liu, M. Optical Flow Prompts Distractor-Aware Siamese Network for Tracking Autonomous Underwater Vehicle with Sonar and Camera Videos. Neural Netw. 2025, 196, 108328. [Google Scholar] [CrossRef]
Liu, K.; Peng, L.; Tang, S. Underwater object detection using TC-YOLO with attention mechanisms. Sensors 2023, 23, 2567. [Google Scholar] [CrossRef]
Wang, Q.; Zhang, Y.; He, B. Automatic seabed target segmentation of AUV via multilevel adversarial network and marginal distribution adaptation. IEEE Trans. Ind. Electron. 2023, 71, 749–759. [Google Scholar] [CrossRef]
Dodić, D.; Vujović, V.; Jovković, S.; Milutinović, N.; Trpkoski, M. SAHI-Tuned YOLOv5 for UAV Detection of TM-62 Anti-Tank Landmines: Small-Object, Occlusion-Robust, Real-Time Pipeline. Computers 2025, 14, 448. [Google Scholar] [CrossRef]
Bai, J.; Zhu, W.; Nie, Z.; Yang, X.; Xu, Q.; Li, D. HFC-YOLO11: A Lightweight Model for the Accurate Recognition of Tiny Remote Sensing Targets. Computers 2025, 14, 195. [Google Scholar] [CrossRef]
Chen, X.; Yuan, M.; Fan, C.; Chen, X.; Li, Y.; Wang, H. Research on an underwater object detection network based on dual-branch feature extraction. Electronics 2023, 12, 3413. [Google Scholar] [CrossRef]
Liu, Z.; Wang, B.; Li, Y.; He, J.; Li, Y. UnitModule: A lightweight joint image enhancement module for underwater object detection. Pattern Recognit. 2024, 151, 110435. [Google Scholar] [CrossRef]
Chang, L.; Wang, Y.; Du, B.; Xu, C. Rectangling and enhancing underwater stitched image via content-aware warping and perception balancing. Neural Netw. 2025, 181, 106809. [Google Scholar] [CrossRef]
Cao, R.; Zhang, R.; Yan, X.; Zhang, J. BG-YOLO: A bidirectional-guided method for underwater object detection. Sensors 2024, 24, 7411. [Google Scholar] [CrossRef]
Shao, J.; Zhang, H.; Miao, J. LAMSNN: Learnable adaptive modulation for artifact suppression in spiking underwater image enhancement networks. Neural Netw. 2025, 195, 108210. [Google Scholar] [CrossRef]
Zhang, W.; Li, X.; Huang, Y.; Xu, S.; Tang, J.; Hu, H. Underwater image enhancement via frequency and spatial domains fusion. Opt. Lasers Eng. 2025, 186, 108826. [Google Scholar] [CrossRef]
Dai, L.; Liu, H.; Song, P.; Liu, M. A gated cross-domain collaborative network for underwater object detection. Pattern Recognit. 2024, 149, 110222. [Google Scholar] [CrossRef]
Fu, Z.; Wang, W.; Huang, Y.; Ding, X.; Ma, K.K. Uncertainty inspired underwater image enhancement. In Computer Vision—ECCV 2022; Springer: Cham, Switzerland, 2022; pp. 465–482. [Google Scholar]
Wang, Y.; Guo, J.; He, W.; Gao, H.; Yue, H.; Zhang, Z.; Li, C. Is underwater image enhancement all object detectors need? IEEE J. Ocean. Eng. 2023, 49, 606–621. [Google Scholar] [CrossRef]
Fu, C.; Liu, R.; Fan, X.; Chen, P.; Fu, H.; Yuan, W.; Zhu, M.; Luo, Z. Rethinking general underwater object detection: Datasets, challenges, and solutions. Neurocomputing 2023, 517, 243–256. [Google Scholar] [CrossRef]
Chen, L.; Jiang, Z.; Tong, L.; Liu, Z.; Zhao, A.; Zhang, Q.; Dong, J.; Zhou, H. Perceptual underwater image enhancement with deep learning and physical priors. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 3078–3092. [Google Scholar] [CrossRef]
Saleem, A.; Awad, A.; Paheding, S.; Lucas, E.; Havens, T.C.; Esselman, P.C. Understanding the influence of image enhancement on underwater object detection: A quantitative and qualitative study. Remote Sens. 2025, 17, 185. [Google Scholar] [CrossRef]
Liu, W.; Ren, G.; Yu, R.; Guo, S.; Zhu, J.; Zhang, L. Image-adaptive YOLO for object detection in adverse weather conditions. Proc. AAAI Conf. Artif. Intell. 2022, 36, 1792–1800. [Google Scholar] [CrossRef]
Wang, B.; Wang, Z.; Guo, W.; Wang, Y. A dual-branch joint learning network for underwater object detection. Knowl.-Based Syst. 2024, 293, 111672. [Google Scholar] [CrossRef]
Liu, R.; Jiang, Z.; Yang, S.; Fan, X. Twin adversarial contrastive learning for underwater image enhancement and beyond. IEEE Trans. Image Process. 2022, 31, 4922–4936. [Google Scholar] [CrossRef]
Zhang, W.; Zhuang, P.; Sun, H.H.; Li, G.; Kwong, S.; Li, C. Underwater image enhancement via minimal color loss and locally adaptive contrast enhancement. IEEE Trans. Image Process. 2022, 31, 3997–4010. [Google Scholar] [CrossRef]
Song, Y.; She, M.; Köser, K. Advanced underwater image restoration in complex illumination conditions. ISPRS J. Photogramm. Remote Sens. 2024, 209, 197–212. [Google Scholar] [CrossRef]
Berman, D.; Levy, D.; Avidan, S.; Treibitz, T. Underwater single image color restoration using haze-lines and a new quantitative dataset. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2822–2837. [Google Scholar] [CrossRef]
Peng, L.; Zhu, C.; Bian, L. U-shape transformer for underwater image enhancement. IEEE Trans. Image Process. 2023, 32, 3066–3079. [Google Scholar] [CrossRef]
Wang, H.; Zhang, W.; Xu, Y.; Li, H.; Ren, P. WaterCycleDiffusion: Visual-textual fusion empowered underwater image enhancement. Inf. Fusion 2025, 127, 103693. [Google Scholar] [CrossRef]
Du, D.; Li, E.; Si, L.; Zhai, W.; Xu, F.; Niu, J.; Sun, F. UIEDP: Boosting underwater image enhancement with diffusion prior. Expert Syst. Appl. 2025, 259, 125271. [Google Scholar] [CrossRef]
Zhang, S.; Zhao, S.; An, D.; Li, D.; Zhao, R. LiteEnhanceNet: A lightweight network for real-time single underwater image enhancement. Expert Syst. Appl. 2024, 240, 122546. [Google Scholar] [CrossRef]
Zhou, J.; He, Z.; Zhang, D.; Liu, S.; Fu, X.; Li, X. Spatial residual for underwater object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4996–5013. [Google Scholar] [CrossRef]
Ji, X.; Chen, S.; Hao, L.Y.; Zhou, J.; Chen, L. FBDPN: CNN-Transformer hybrid feature boosting and differential pyramid network for underwater object detection. Expert Syst. Appl. 2024, 256, 124978. [Google Scholar] [CrossRef]
Zhang, D.; Yu, C.; Li, Z.; Qin, C.; Xia, R. A lightweight network enhanced by attention-guided cross-scale interaction for underwater object detection. Appl. Soft Comput. 2025, 184, 113811. [Google Scholar] [CrossRef]
Yeh, C.H.; Lin, C.H.; Kang, L.W.; Huang, C.H.; Lin, M.H.; Chang, C.Y.; Wang, C.C. Lightweight deep neural network for joint learning of underwater object detection and color conversion. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6129–6143. [Google Scholar] [CrossRef] [PubMed]
Yuan, G.; Song, J.; Li, J. IF-USOD: Multimodal information fusion interactive feature enhancement architecture for underwater salient object detection. Inf. Fusion 2025, 117, 102806. [Google Scholar] [CrossRef]
Zhang, D.; Wu, C.; Zhou, J.; Zhang, W.; Lin, Z.; Polat, K.; Alenezi, F. Robust underwater image enhancement with cascaded multi-level sub-networks and triple attention mechanism. Neural Netw. 2024, 169, 685–697. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2017; pp. 2117–2125. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Liu, C.; Li, H.; Wang, S.; Zhu, M.; Wang, D.; Fan, X.; Wang, Z. A dataset and benchmark of underwater object detection for robot picking. In 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW); IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
Jiang, L.; Wang, Y.; Jia, Q.; Xu, S.; Liu, Y.; Fan, X.; Li, H.; Liu, R.; Xue, X.; Wang, R. Underwater species detection using channel sharpening attention. In Proceedings of the 29th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2021; pp. 4259–4267. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2024; pp. 16965–16974. [Google Scholar]
Hou, X.; Liu, M.; Zhang, S.; Wei, P.; Chen, B.; Lan, X. Relation detr: Exploring explicit position relation prior for object detection. In Computer Vision—ECCV 2024; Springer: Cham, Switzerland, 2024; pp. 89–105. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.; Shum, H.Y. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Computer Vision—ECCV 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Song, P.; Li, P.; Dai, L.; Wang, T.; Chen, Z. Boosting R-CNN: Reweighting R-CNN samples by RPN’s error for underwater object detection. Neurocomputing 2023, 530, 150–164. [Google Scholar] [CrossRef]
Liang, X.; Song, P. Excavating roi attention for underwater object detection. In 2022 IEEE International Conference on Image Processing (ICIP); IEEE: Piscataway, NJ, USA, 2022; pp. 2651–2655. [Google Scholar]
Girshick, R. Fast r-cnn. In 2015 IEEE International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2015; pp. 1440–1448. [Google Scholar]
Chen, Z.; Yang, C.; Li, Q.; Zhao, F.; Zha, Z.J.; Wu, F. Disentangle your dense object detector. In Proceedings of the 29th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2021; pp. 4939–4948. [Google Scholar]

Figure 1. Challenges of Object Detection in Underwater Degraded Scenarios. (a) Low Contrast. (b) Turbid and Blurry. (c) Color Distortion. (d) Low Visibility.

Figure 2. Overview of the UCF-DETR architecture. The UCF-DETR proposed for cross-domain reasoning comprises three modules: the Underwater Image Enhancement (UIE), Cross-domain Feature Pyramid (CFP), and Cross-domain Query Interaction (CQI).

Figure 3. Detailed architecture of the proposed Cross-domain Feature Pyramid (CFP). The yellow paths indicate the Enhanced Image branch, while the blue paths indicate the Raw Image branch.

Figure 4. Detailed architecture of the proposed Cross-Domain Query Interaction (CQI) module. Yellow denotes queries from the Enhanced Image branch, and blue denotes queries from the Raw Image branch. Solid arrows represent feature propagation, while dashed arrows indicate cross-domain query interaction.

Figure 5. Visualization of the Underwater Image Enhancement (UIE) module.

Figure 6. Visualization of Feature Activation Strength in Encoder. Red regions indicate strong activation (foreground), while blue regions indicate weak activation (background). The visualization demonstrates that UCF-DETR yields more distinct and localized feature responses.

Figure 7. Convergence Curves of Different Methods over 12 Training Epochs. (a) Convergence curve of the training process on DUO dataset. (b) Convergence curve of the training process on UDD dataset.

Figure 8. Qualitative Comparison Between UCF-DETR and the Baseline Detector Under Representative Underwater Degradation Scenarios.

Figure 9. Multi-dimensional error analysis comparison between UCF-DETR and RT-DETR on the DUO dataset. (a) Error Analysis on Large Object Detection. (b) Error Analysis on Medium Object Detection. (c) Error Analysis on Small Object Detection. (d) Error Analysis on Large Object Detection. (e) Error Analysis on Medium Object Detection. (f) Error Analysis on Small Object Detection.

Table 2. Comparison with the state-of-the-art on UDD. The results with red and blue indicate the best and second-best results of each column, respectively.

Model	Epoch	${AP}^{val}$	${AP}_{50}^{val}$	${AP}_{75}^{val}$	${AP}_{S}^{val}$	${AP}_{M}^{val}$	${AP}_{L}^{val}$
General Object Detector
Faster R-CNN [44]	12	27.3	65.9	17.2	15.9	26.3	31.3
Relation-DETR [39]	12	27.3	68.1	12.4	17.9	25.1	29.6
DINO [40]	12	28.3	68.7	14.8	18.5	26.3	33.0
DDOD [45]	12	28.3	66.0	16.8	17.7	26.5	31.0
RT-DETR [38]	12	26.8	60.4	17.2	15.3	24.8	33.1
Underwater Object Detector
Boosting-R-CNN [42]	12	28.4	64.3	18.8	16.4	28.0	30.6
RoIAttn [43]	12	28.3	64.5	16.9	15.0	28.3	31.5
DJL-Net [19]	12	31.5	72.3	19.1	17.5	30.7	33.1
GCC-Net [12]	12	26.3	64.6	14.1	15.2	24.7	29.4
Underwater Cross-Domain Fusion Transformer Detector (ours)
UCF-DETR	12	30.9	71.1	19.3	18.7	30.1	31.3

Table 3. Ablation on effectiveness of each component in proposed UCF-DETR on DUO dataset with ResNet 50. Values highlighted in red indicate performance degradation, whereas values highlighted in blue indicate performance improvement.

CFP	CQI	${AP}^{val}$	${AP}_{50}^{val}$	${AP}_{75}^{val}$	${AP}_{S}^{val}$	${AP}_{M}^{val}$	${AP}_{L}^{val}$
		64.2	84.5	71.9	51.6	65.7	63.4
✓		65.8 (↑1.6)	85.5 (↑1.0)	73.9 (↑2.0)	56.1 (↑4.5)	67.7 (↑2.0)	65.2 (↑1.8)
✓	✓	66.4 (↑0.6)	85.5 (↓0.0)	73.9 (↓0.0)	58.6 (↑1.5)	68.1 (↑0.4)	65.0 (↓0.2)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, B.; Chen, Y. Joint Inference of Image Enhancement and Object Detection via Cross-Domain Fusion Transformer. Computers 2026, 15, 43. https://doi.org/10.3390/computers15010043

AMA Style

Zhao B, Chen Y. Joint Inference of Image Enhancement and Object Detection via Cross-Domain Fusion Transformer. Computers. 2026; 15(1):43. https://doi.org/10.3390/computers15010043

Chicago/Turabian Style

Zhao, Bingxun, and Yuan Chen. 2026. "Joint Inference of Image Enhancement and Object Detection via Cross-Domain Fusion Transformer" Computers 15, no. 1: 43. https://doi.org/10.3390/computers15010043

APA Style

Zhao, B., & Chen, Y. (2026). Joint Inference of Image Enhancement and Object Detection via Cross-Domain Fusion Transformer. Computers, 15(1), 43. https://doi.org/10.3390/computers15010043

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Joint Inference of Image Enhancement and Object Detection via Cross-Domain Fusion Transformer

Abstract

1. Introduction

2. Related Work

2.1. Underwater Image Enhancement

2.2. Underwater Object Detection

3. Method

3.1. Overview

3.2. Retinex-Based Model for Underwater Image Enhancement

3.3. Cross-Domain Feature Pyramid

3.4. Cross-Domain Query Interaction

4. Experiment

4.1. Evaluation Metrics

4.1.1. Datasets

4.1.2. Experiment Setup

4.2. Experiment Results

4.2.1. Results on DUO

4.2.2. Results on UDD

4.3. Ablation Study

4.3.1. Effect of Underwater Image Enhancement

4.3.2. Effect of Cross-Domain Feature Fusion in CFP

4.4. Analysis

4.5. Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI