Next Article in Journal
Selective Downsampling for Fast and Accurate 3D Global Registration with Applications in Medical Imaging
Previous Article in Journal
BAF–FedLLM: Behavior-Aware Federated Modeling of Student Actions via Privacy-Preserving Large Language Model
Previous Article in Special Issue
Optimizing RTAB-Map Viewability to Reduce Cognitive Workload in VR Teleoperation: A User-Centric Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

JTA-GAN: A Physics-Informed Framework for Realistic Underwater Image Generation and Improved Object Detection

1
Department of Mechanical Engineering, National Pingtung University of Science and Technology, Pingtung 912301, Taiwan
2
Department of Systems and Naval Mechatronics Engineering, National Cheng Kung University, Tainan 701401, Taiwan
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(4), 605; https://doi.org/10.3390/math14040605
Submission received: 22 December 2025 / Revised: 31 January 2026 / Accepted: 6 February 2026 / Published: 9 February 2026
(This article belongs to the Special Issue Advances in Machine Learning and Intelligent Systems)

Abstract

Accurate object detection in underwater environments is severely challenged by light attenuation, wavelength-dependent color distortion, and scattering-induced turbidity, which create a substantial domain gap between terrestrial and underwater imagery. Conventional Generative Adversarial Network(GAN)-based translation models, such as CycleGAN, attempt to mitigate this gap but often suffer from instability and unrealistic color shifts due to their black-box design. To address these limitations, we propose JTA-GAN (Joint Turbidity–Attenuation GAN), a physics-informed generative framework that explicitly disentangles underwater image formation into scene radiance (J, derived from the physical imaging model), transmission (T), and ambient light (A). By enforcing a simplified physical imaging model within the generator architecture, JTA-GAN enables spatially coherent haze and attenuation synthesis without requiring ground-truth depth supervision. An asymmetric architecture stabilizes reverse mapping, while Learned Perceptual Image Patch Similarity(LPIPS)-based perceptual loss further improves reconstruction realism. Using the JTA-GAN network, we generated 65,153 physically plausible synthetic images for training You Only Look Once(YOLO)-based detectors. Evaluation on the SUIM benchmark demonstrates consistent performance improvements; specifically, YOLOv8s trained with synthetic data from JTA-GAN achieves 17.3% mAP(mean Average Precision), outperforming the land-only baseline (13.2%) and CycleGAN-based augmentation (10.8%). These results confirm that physics-informed generative modeling provides a theoretically grounded and effective solution for underwater domain adaptation under the high-turbidity and low-light conditions represented in the study.

1. Introduction

Object detection in underwater environments remains a challenging problem despite rapid advances in deep learning for general object recognition [1,2,3,4,5,6]. State-of-the-art detectors such as YOLO variants typically require large amounts of labelled training data to achieve robust performance. However, collecting and annotating underwater imagery is costly, time-consuming, and often impractical for many application domains (e.g., diver detection, wreck inspection). Consequently, there exists a substantial domain gap between large, well-annotated terrestrial datasets and the limited, diverse, and visually degraded underwater image collections; this gap significantly degrades detector generalization when models trained on land images are deployed underwater [7,8,9].
A common strategy to reduce this gap is to use generative models to synthesize realistic target-domain images for data augmentation. Unpaired image-to-image translation frameworks such as CycleGAN are widely adopted because they do not require pixel-level paired images. Nevertheless, our experiments and broader empirical experience indicate that off-the-shelf, “black-box” style GANs are ill-suited for the land → underwater adaptation task. In particular, standard CycleGAN often learns superficial color transforms (e.g., uniform blue/green washes) rather than physically plausible degradations such as distance-dependent haze, wavelength-selective attenuation, and structured ambient illumination. When used to generate training data for detectors, these artifact-laden synthetic images can be actively harmful: in our evaluation a YOLOv8s model trained on CycleGAN-generated data achieved a lower mAP (10.8% on SUIM) [10] than the land-only baseline (13.2%), demonstrating that naive synthetic augmentation can degrade downstream performance.
This work proposes a different philosophy: rather than treating domain translation as an appearance-only style transfer, we inject physically motivated constraints into the generative model so that the synthesis process respects the dominant phenomena in underwater optics. We present JTA-GAN, a hybrid, non-symmetric GAN framework that explicitly disentangles scene radiance, per-pixel transmission (turbidity) and ambient light, and synthesizes underwater images by applying a differentiable physics layer based on a simplified underwater image formation model: I(x) = J(x) · T(x) + A(x)(1 − T(x)),where J denotes scene radiance (content), T the spatially varying transmission map (turbidity/haze) and A the ambient background illumination (attenuation/color cast). Crucially, the T-map is learned in a self-supervised manner from 2-D image features derived from land images—we do not rely on external depth prediction models or ground-truth depth maps.
Two practical design choices proved essential for stable and useful synthesis. First, we adopt an asymmetric dual-generator architecture. The forward generator G L S (land → sea) is physics-informed: its decoder is architected to output disentangled J, T, A components which are combined by the physics layer to produce synthetic underwater images. The reverse mapping G S L (sea → land) is implemented as a stable black-box U-Net; attempts to model the inverse operation via explicit division (IA)/T introduce numerical instability when T approaches zero. Second, balanced domain sampling proved critical: extreme imbalance (e.g., earlier experiments with 1479 land vs. 120 underwater) led to discriminator domination and generator collapse, whereas a balanced setup (1197 land vs. 890 underwater) stabilized training and yielded meaningful T-maps. Additional training choices—PatchGAN discriminators, a combined loss including adversarial terms, cycle-consistency (L1) with λ c y c = 10, a perceptual LPIPS loss ( λ p e r c e p = 1) and a physics regularization (TV-style, λ p h y = 1)—were necessary to obtain sharp, physically plausible results. Empirically, LPIPS was decisive: adding LPIPS in our final V5 model removed painterly artifacts and restored high-frequency detail that L1 alone could not.
We thoroughly validate JTA-GAN in two ways. First, we qualitatively compare its outputs against CycleGAN and show that JTA-GAN produces structured, depth-aware T-maps and visually plausible attenuation and caustics, while CycleGAN tends to produce global color casts and texture distortions. Second, we evaluate the downstream utility of the synthetic data by training YOLO detectors on datasets augmented with generated images. For GAN training, we used 1197 land images (COCO-debris subset) [11] and 890 underwater images (UIEB raw) [12]. Using our script generate_yolo_data_v6.py we synthesized a YOLO training set from 65,153 COCO images (and corrected bounding boxes to avoid edge shifts). On the SUIM test set (376 images), the YOLOv8s model trained on JTA-GAN synthetic data achieved 17.3% mAP50–95, substantially outperforming the land-only baseline (13.2%) and the CycleGAN baseline (10.8%). Per-class analysis shows the largest gains for semantically consistent classes (person: 34.3% vs. baseline 23.2%), while semantically mismatched classes (boat vs. underwater wrecks) remain challenging for all methods (boat mAP remains < 3%), pointing to an orthogonal limitation caused by dataset semantic mismatch rather than synthesis fidelity.
In summary, this paper makes three principal contributions:
  • We identify and empirically characterize the failure modes of black-box GANs in underwater synthesis, showing how unrealistic artifacts can actively degrade detector performance.
  • We propose JTA-GAN, which introduces three specific architectural innovations:
    (a)
    a physics-informed decoder that explicitly embeds an underwater-optimized optical model.
    (b)
    a self-supervised learning strategy for transmission maps (T-maps) that requires no depth supervision.
    (c)
    a task-oriented asymmetric architecture that ensures numerical stability by avoiding singularities during reverse mapping.
  • We demonstrate through rigorous evaluation on the SUIM benchmark that JTA-GAN synthetic data significantly improves downstream underwater object detection.
The remainder of the paper is organized as follows. Section 2 reviews related work on domain adaptation and physics-aware image synthesis. Section 3 details the JTA-GAN architecture and loss formulation. Section 4 describes datasets, experimental protocols and quantitative/qualitative results. Section 5 discusses limitations and future directions and Section 6 concludes.
The pursuit of balancing detection precision with computational efficiency has led to several innovative architectural designs in recent underwater and industrial vision research. For instance, PRCII-Net [13] employs a lightweight attention-guided cross-scale interaction module to enhance small target detection in complex underwater environments while maintaining low resource consumption. In the domain of image enhancement and alignment, GAN-based unsupervised frameworks [14] integrated with feature frequency-aware decomposition have demonstrated superior robustness in processing textures and structural details. Furthermore, the development of efficient lightweight CNNs(Convolutional Neural Network) for industrial surface defect detection [15] highlights the effectiveness of coordinate attention mechanisms and weighted feature pyramid networks in resource-constrained scenarios. Our proposed JTA-GAN aligns with these trends by integrating a physics-informed inductive bias into an efficient dual-generator architecture, ensuring high-quality synthesis and real-time detection performance.

2. Related Work

2.1. Underwater Image Formation and Physics-Based Restoration

Underwater images suffer from wavelength-dependent attenuation, forward scattering, and spatially varying ambient illumination. Classical models such as the Jaffe–McGlamery formulation describe underwater image formation as a combination of attenuated scene radiance and backscattered light. However, these models typically require depth measurements or multiple images to estimate transmission and background illumination. Learning-based restoration methods (e.g., UWCNN, Water-Net) attempt to approximate these physical processes [16], but they rely on paired underwater/clean datasets or synthetic depth-based renderings that are not available for our unpaired land → underwater translation setting. Moreover, these methods aim to restore underwater images, not synthesize underwater degradations compatible with downstream detector training.
In contrast, our work adopts a simplified yet physically meaningful image formation model, I = J · T + A (1 − T),and embeds it directly into the generator architecture. This allows the model to learn per-pixel turbidity and ambient light without depth supervision, bridging the gap between physically grounded image formation and unpaired image synthesis.

2.2. Unpaired Image-to-Image Translation (CycleGAN and Variants)

Unpaired translation frameworks such as CycleGAN, UNIT, MUNIT, and DRIT have achieved strong performance on style transfer tasks by enforcing cycle consistency between two domains. However, because these models are optimized to match global appearance statistics rather than domain-specific physical processes, they often rely on color shifts and texture hallucination to satisfy the discriminator. For underwater synthesis, this manifests as “blue-wash” or “green-wash” artifacts, low-frequency smearing, and the loss of fine-scale structure.
Multiple studies have reported that CycleGAN-style models struggle with domains exhibiting complex light propagation effects. Our experiments confirm this: CycleGAN frequently collapses into global color mapping and produces unstable or noisy reverse translations. When these artifact-prone images are used to train YOLO, performance deteriorates (mAP dropping from 13.2% to 10.8%), demonstrating that naïve image translation can actively harm detector training. These limitations highlight the need for models that incorporate underwater-specific priors rather than relying purely on adversarial style alignment.

2.3. Synthetic Data for Underwater Vision and Domain Adaptation

Synthetic data is an increasingly important tool for improving underwater vision, where annotations are scarce. Existing approaches include physics-based rendering, domain randomization, and unpaired image translation [17,18,19,20,21]. While these methods show promise, two challenges remain unresolved: Lack of physically grounded degradations.
Most synthetic datasets rely on global color transformations or heuristic filters, failing to capture depth-dependent haze and wavelength-selective attenuation essential for underwater realism.
Weak validation on downstream detection tasks. Many works evaluate synthetic data qualitatively or using enhancement metrics, but rarely examine how synthesis quality affects modern detector performance. Our results empirically show a direct negative impact when using artifact-heavy CycleGAN data to train YOLO.
Among existing underwater datasets, WaterPairs [22] demonstrates the benefit of paired supervision for enhancement and detection, but such paired data are unavailable for land → underwater translation. Our work is among the few that rigorously analyze the causal link between synthetic image fidelity and downstream object detection accuracy, providing a more robust evaluation of domain adaptation strategies.
The integration of Artificial Intelligence in underwater domains extends far beyond image enhancement. In marine ecology, AI-powered frameworks are now used for real-time monitoring of biological indicators, such as identifying coral reef pests like Crown-of-Thorns Starfish. For submerged infrastructure and archeological surveys, enhanced detectors such as UDINO and composite-enhanced YOLOv11 algorithms facilitate the precise localization of shipwrecks and debris. Furthermore, recent advancements in Autonomous Underwater Vehicles have introduced integrated systems that combine real-time detection (e.g., YOLOv11) with Large Language Model (LLM)-based summary generation to automate sea exploration. Beyond visual data, AI has also revolutionized underwater acoustic target recognition, where deep learning enables the autonomous classification of ship-radiated signals [23].
Beyond generative strategies, recent advances in detector architectures have also addressed underwater challenges. For instance, recent studies have proposed lightweight networks enhanced by attention-guided cross-scale interaction to improve feature extraction in turbid media. While these architectural innovations optimize how a detector processes degraded features, JTA-GAN provides a complementary approach by addressing the domain gap at the data source through physics-informed synthesis. By combining such robust feature interaction logic with the high-fidelity training data generated by JTA-GAN, the generalization of underwater perception systems can be significantly reinforced [13,24,25,26,27].

2.4. Physics-Guided GANs and Disentanglement Approaches

Recent advances explore integrating physical constraints or disentangled representations into GANs to improve interpretability and stability. Some models decompose content and style (e.g., DRIT, MUNIT) or combine differentiable renderers with GAN training. However, these works generally target generic domain transfer or 3D-aware synthesis and do not model underwater-specific optical processes. More importantly, they do not disentangle the three fundamental components of underwater image formation—scene radiance, transmission, and ambient light—nor do they integrate a differentiable physics layer within the generator pipeline.
Compared with existing approaches, JTA-GAN uniquely enforces a structured decomposition J, T, A and simulates underwater degradations through a physics-informed rendering equation. This yields a disentanglement that is both physically interpretable and directly beneficial for downstream detector training, addressing the limitations of purely appearance-based GAN models.
To better situate JTA-GAN within the current research landscape, it is helpful to contrast its design philosophy with other physics-informed generative models such as WaterGAN and UWGAN. A primary distinction lies in the data requirements: WaterGAN often relies on depth maps or pre-defined 3D scene structures to supervise the scattering process. In contrast, JTA-GAN is designed for the more challenging unpaired 2D domain adaptation task, where it learns to estimate a structured transmission map (T) directly from monocular RGB images without any depth supervision. Furthermore, while many physics-based models focus purely on visual enhancement, JTA-GAN’s architecture—specifically its asymmetric design and perceptual loss—is optimized to preserve object-level discriminative features, making it more suitable for downstream detection tasks than purely appearance-oriented generative frameworks.

2.5. Summary

To summarize, existing underwater enhancement models require depth or paired data, standard CycleGAN-type models neglect underwater physics and produce harmful artifacts, and prior synthetic-data works seldom evaluate downstream detection performance. No existing GAN framework explicitly disentangles radiance, turbidity, and ambient light under an underwater formation model. These gaps motivate the development of JTA-GAN, which integrates physics-aware decomposition with stable cycle-consistent training to generate high-fidelity synthetic underwater images for robust object detection.

2.6. Contextual Support from Recent SOTA Studies

Recent Advancements in Underwater Domain Adaptation (2024–2025) Recent years have seen a surge in specialized deep learning frameworks for underwater vision. State-of-the-art (SOTA) object detection models, such as AGW-YOLOv8 and UDINO, have introduced multi-scale frequency enhancement modules to mitigate scattering and color shifts. To address the data scarcity and domain shift, integrated frameworks like EnYOLO have been proposed to perform simultaneous image enhancement and object detection with domain-adaptation capabilities. Furthermore, advancements in generative AI have introduced hybrid models, such as ALDiff-UIE and HyNPhyAttnGAN, which combine the strengths of Diffusion Models and GANs with physics-based simulations to generate high-fidelity underwater imagery. Other researchers have explored Edge-deployable Online Domain Adaptation using Adaptive Batch Normalization (AdaBN) to handle real-time data drift in AUV-based surveys. Our JTA-GAN aligns with these recent trends by explicitly embedding a simplified physical model to ensure structural consistency, a strategy also echoed in very recent physics-informed VAE approaches [28,29,30,31].

3. Methodology

This section presents the proposed JTA-GAN (Joint Turbidity–Attenuation Generative Adversarial Network), a physics-informed framework designed for unpaired land-to-underwater domain adaptation. The overall architecture and translation results are illustrated in Figure 1. Unlike conventional black-box translation models such as CycleGAN, which rely purely on statistical appearance transformations, JTA-GAN explicitly embeds a simplified underwater optical model into the generative process. This enables the system to disentangle scene radiance from environmental degradations—particularly turbidity and ambient light—thereby producing synthetic underwater images with higher structural fidelity and physical plausibility.
In addition, JTA-GAN adopts a deliberately asymmetric dual-generator architecture, motivated by the inherent ill-posedness of underwater image inversion. The forward generator (Land → Sea) is physically guided and outputs interpretable degradation components, whereas the backward generator (Sea → Land) is a stable black-box reconstruction network trained under cycle consistency and perceptual supervision. This design avoids the numerical instabilities introduced by directly inverting the physical formation model and results in significantly improved training stability.

3.1. Simplified Underwater Optical Model

Underwater imaging is governed by a combination of wavelength-dependent absorption, scattering, and depth-dependent turbidity, which significantly alter color and contrast. While the full Jaffe–McGlamery model provides an accurate description of these phenomena, it requires depth maps, multiple images, or specialized sensors that are unavailable in typical unpaired datasets.
Thus, we employ its widely adopted simplified form [32]:
I ( x ) = J ( x ) T ( x ) + A ( 1 T ( x ) )
where J(x) is the clean scene radiance (corresponding to terrestrial content), T(x) ∈ (0, 1] is the transmission map representing spatially varying turbidity, A ∈ ℝ3 is the global ambient light vector representing wavelength-selective attenuation.
Although simplified, Equation (1) retains the essential physical behavior required for synthetic training data: attenuation increases with effective depth, and the ambient illumination produces characteristic blue/green color casts. Importantly, this model allows learning without depth supervision, because JTA-GAN estimates T(x) directly from 2-D land images in a self-supervised manner.
Embedding Equation (1) into the generator enforces a structured decomposition of underwater effects and prevents the degenerate global color mappings frequently observed in standard CycleGAN training.
The rationale for adopting this simplified formulation is two-fold. First, while the comprehensive radiative transfer equation (RTE) describes underwater image formation through three components—direct attenuation, forward scattering, and backscattering—our framework adopts a simplified model that focuses primarily on attenuation and backscattering. This simplification is physically justified for object detection tasks for two reasons. First, forward scattering typically results in small-angle blur, which behaves similarly to a Gaussian low-pass filter. In the context of deep learning-based detection, convolutional neural networks are generally robust to minor blurring; however, they are highly sensitive to the contrast reduction and color distortion caused by backscattering (the “veiling light”). By focusing on the latter, Equation (1) captures the dominant degradation factor affecting detector performance.
Second, and more importantly, this formulation serves as a strong inductive bias for the generator. In a standard unconstrained GAN (e.g., CycleGAN), the generator G learns a direct mapping G ( I l a n d ) I w a t e r . This “black-box” approach often leads to overfitting, where the model satisfies the discriminator by applying unrealistic global color filters or hallucinating textures that do not exist in the physical world. By enforcing the structure defined in Equation (1), we compel the network to explicitly disentangle scene radiance J from environmental factors T and A. This structural constraint acts as a regularizer, preventing the generator from learning physically impossible mappings (e.g., dense fog with high contrast) and ensuring that the synthesized images adhere to the natural laws of light propagation.

3.2. Asymmetric Network Architecture

Given the asymmetry between underwater degradation (well-posed, deterministic) and underwater restoration (ill-posed, unstable), JTA-GAN adopts a non-symmetric architecture composed of:
(1)
a physics-constrained forward generator G L S
(2)
a black-box inverse generator G S L
We will first examine the overall architecture of the proposed JTA-GAN, as visualized in Figure 2. This schematic highlights the asymmetric dual-generator design, wherein the land-to-underwater path is explicitly constrained by a physics-based image formation model, while the reverse path is implemented as a black-box reconstruction network to ensure numerical stability.

3.2.1. Physics-Informed Forward Generator G L S

The forward generator converts a land image J into a synthetic underwater image I. Instead of directly outputting RGB pixels, the network predicts two physically meaningful quantities:
(1)
Transmission Head H T
Outputs a spatially varying transmission map: T ^ ( x ) ( 0 , 1 ] , enforced via Sigmoid activation. The T-map captures haze/turbidity distribution and implicitly encodes depth-like structures.
(2)
Ambient Light Head H A
Predicts a global ambient vector: A ^ R 3 . Using global average pooling ensures that A ^ represents scene-level illumination rather than per-pixel noise.
(3)
Differentiable Physics Layer
The final underwater image is synthesized by substituting the network-predicted transmission and ambient light into the physical image formation process:
I ^ s e a ( x ) = J ( x ) T ^ ( x ) + A ^ ( 1 T ^ ( x ) ) ,
Here, J ( x ) denotes the input land image, which is treated as the clean scene radiance. The terms T ^ ( x ) and A ^ represent the transmission map and ambient light vector estimated by the generator, rather than ground-truth physical quantities. This formulation can be viewed as a learned, differentiable approximation of the underlying underwater image formation model introduced earlier, and enables end-to-end optimization under physical constraints.
This design tightly constrains the generator to produce physically grounded degradations, discouraging mode collapse and suppressing the unrealistic texture distortions seen in black-box GANs.
Next, we will illustrate the complete training workflow of JTA-GAN, as depicted in Figure 3. The diagram serves to clarify how adversarial, cycle-consistency, perceptual, and physics-based losses are jointly optimized, emphasizing that physical constraints are applied exclusively in the forward direction to regulate the synthesis process.

3.2.2. Stable Black-Box Inverse Generator G S L

While the forward land-to-underwater mapping can be naturally guided by a well-defined physical image formation process, the inverse underwater-to-land transformation is inherently ill-posed. Recovering the clean scene radiance from a degraded underwater observation requires undoing attenuation and scattering effects, which is highly sensitive to noise and estimation errors, particularly under severe turbidity.
In principle, inverting the simplified underwater image formation model (Equation (1)) to recover the clean scene radiance J ( x ) would require solving:
J ( x ) = I ( x ) A ( 1 T ( x ) ) T ( x ) ,
Here, the numerator represents the removal of the additive backscattering noise, while the division by T ( x ) attempts to compensate for the multiplicative attenuation. However, this formulation reveals a critical numerical instability: small values of T ( x ) (which occur in deep or turbid regions) cause the denominator to approach zero. Consequently, even negligible noise in the input I ( x ) is amplified drastically as T x 0 .
In practice, explicitly enforcing this inverse formulation during training causes gradient explosion, unstable optimization, and frequent model collapse, as confirmed by our preliminary experiments. To avoid these issues, we deliberately adopt an asymmetric design and implement the inverse generator G S L as a purely data-driven U-Net without physical constraints. Rather than enforcing physical interpretability, this network focuses on learning a robust and stable mapping that supports:
  • stable gradient propagation during cycle-consistent training.
  • perceptually consistent reconstruction of land-domain images.
  • robustness under extreme turbidity and low-transmission conditions.
  • reliable enforcement of cycle-consistency constraints.
This asymmetric strategy allows the physics-informed forward generator to model underwater degradation accurately, while the black-box inverse generator ensures training stability and effective semantic preservation. Together, they form a complementary and practical solution for unpaired underwater domain adaptation.
Unlike standard unconstrained GANs that employ symmetric cycles, JTA-GAN’s asymmetry is a direct response to the singular nature of the underwater imaging equation. In the reverse mapping G S L , explicitly enforcing the inverse physical model J = (IA (1 − T))/T introduces a critical numerical risk: in scenarios of high turbidity where T(x) → 0, the division operation triggers gradient explosion and model collapse.
Therefore, our asymmetric choice is specifically tailored to underwater synthesis: it allows the forward generator G L S to produce physically interpretable components (J, T, A) for high-quality data generation, while delegating the ill-posed restoration task to a robust black-box U-Net. This ensures that the adversarial game remains stable even under extreme synthetic degradations, a prerequisite for generating the large-scale datasets required for robust object detection.

3.3. PatchGAN Discriminators

Two discriminators, D L and D S , are employed to distinguish real and generated images in the land and underwater domains, respectively. Both discriminators adopt a PatchGAN architecture, which classifies overlapping local image patches rather than enforcing a single global realism constraint.
Compared with image-level discriminators, PatchGAN is particularly effective for underwater image synthesis, as it emphasizes high-frequency structural cues such as local contrast attenuation, caustic patterns, wave-induced distortions, and shadow consistency. These fine-grained details are critical for preserving object boundaries and texture cues that directly affect downstream object detection performance.
By providing localized adversarial supervision, the PatchGAN discriminators encourage the generator to produce spatially coherent degradations while avoiding the global color bias and texture hallucination commonly observed in black-box GAN frameworks. This design complements the physics-informed generator and contributes to stable training and improved detection robustness.

3.4. Composite Loss Function

To jointly enforce visual realism, semantic consistency, perceptual fidelity, and physical plausibility, the proposed JTA-GAN is optimized using a composite objective function. Each loss term plays a complementary role in stabilizing training and ensuring that the synthesized underwater images are both physically interpretable and beneficial for downstream object detection. The overall training objective is defined as:
L t o t a l = L G A N + λ c y c L c y c + λ p e r c e p L p e r c e p + λ p h y L p h y ,
where λ cyc , λ percep , and λ phy are scalar weights controlling the relative importance of cycle consistency, perceptual similarity, and physical regularization, respectively.
Let x p ( L ) denote an image sampled from the land domain and y p ( S ) denote an image sampled from the underwater domain. The generators G L S and G S L map images between the two domains, while D S denotes the discriminator operating in the underwater domain.
(a)
Adversarial Loss( L G A N )
To encourage domain-level realism of the synthesized underwater images, we employ the Least Squares GAN (LSGAN) objective, which is known to provide more stable gradients than the original GAN formulation. The adversarial loss for the forward generator G L S and discriminator D S is defined as:
L G A N ( G L S , D S ) = E y p ( S ) [ ( D S ( y ) 1 ) 2 ] + E x p ( L ) [ D S ( G L S ( x ) ) 2 ] ,
where D S ( ) outputs a patch-level realism score for underwater images. This loss encourages the generated images G L S ( x ) to match the distribution of real underwater images while maintaining stable adversarial training.
(b)
Cycle Consistency Loss ( L c y c )
To preserve semantic structure and spatial correspondence during unpaired translation, we impose a cycle-consistency constraint between the two domains. The cycle loss is formulated using the l 1 -norm:
L c y c = E G S L ( G L S ( x ) ) x 1 + E G L S ( G S L ( y ) ) y 1 ,
which enforces structural consistency and discourages geometric distortion across the translation cycle.
(c)
Perceptual Loss (LPIPS)( L p e r c e p )
While the l 1 cycle loss enforces pixel-level consistency, it often leads to over-smoothed or “painterly” artifacts. To improve perceptual fidelity, we incorporate the Learned Perceptual Image Patch Similarity (LPIPS) loss [33,34,35].
Let ϕ ( ) denote deep feature representations extracted from a pre-trained network. The perceptual loss is defined as:
L p e r c e p = ϕ x ϕ x ^ c y c   + ϕ ( y ) ϕ ( y ^ c y c ) 2 2 ,
where x ^ cyc = G S L G L S x and y ^ cyc = G L S ( G S L ( y ) ) denote the cycle-reconstructed images. This loss encourages perceptual similarity in feature space and is critical for preserving high-frequency textures.
(d)
Physics Regularization Loss ( L p h y )
To ensure that the predicted transmission map T ^ ( x ) is physically plausible and spatially coherent, we apply a Total Variation (TV) regularization term:
L p h y = i , j ( T i + 1 , j T i , j + T i , j + 1 T i , j ) ,
where i j indexes spatial locations in the transmission map. This regularization suppresses high-frequency noise and promotes smooth haze transitions, consistent with natural underwater scattering behavior.

3.5. Implementation Details

Selection of λ c y c = 10 : This value was chosen following the original CycleGAN implementation, which has been widely validated as an optimal balance for maintaining structural consistency in unpaired image-to-image translation. Our preliminary tests confirmed that lower values (e.g., λ c y c = 1 ) led to significant geometric distortions, while excessively high values restricted the generator’s ability to learn domain-specific underwater degradations.
Selection of λ p e r c e p = 1 and λ p h y = 1 : These weights were empirically determined to ensure that perceptual realism and physical consistency were effectively integrated without overwhelming the primary adversarial loss. During the initial tuning phase, we observed that setting λ p e r c e p = 1 was sufficient to eliminate “painterly” artifacts and restore high-frequency textures. Similarly, λ p h y = 1 provided enough regularization to ensure a smooth and spatially coherent transmission map (T-map) while still allowing the network to adapt to diverse turbidity levels.
Sensitivity Analysis: Our observations indicated that the model performance is relatively robust to minor variations in these weights. Significant performance drops (measured by synthesis quality and downstream mAP) only occurred when the weights were adjusted by an order of magnitude, suggesting that our current configuration sits within a stable and effective regime for underwater domain adaptation.

3.6. Dataset Construction and Alignment

In addition to realistic image synthesis, effective domain adaptation for object detection requires precise geometric consistency between generated images and their associated annotations. Misalignment between synthesized images and bounding boxes can severely degrade detector training, particularly for anchor-based models such as YOLO.
To ensure stable adversarial training, JTA-GAN is trained on a balanced subset consisting of 1197 land images from the COCO-debris dataset and 890 underwater images from the UIEB dataset. This balanced configuration prevents discriminator domination and promotes stable estimation of physically meaningful transmission maps.
After GAN training, the forward generator G L S is applied to a large-scale land dataset comprising 65,153 COCO images to construct synthetic underwater training data for object detection. During GAN preprocessing, images are resized to a fixed resolution using aspect-ratio preserving padding, which introduces spatial offsets between original image coordinates and synthesized outputs.
To address this issue, we implement an automated bounding-box remapping procedure that explicitly accounts for padding and scaling operations applied during synthesis. This procedure recalibrates all bounding box coordinates to maintain precise spatial correspondence with the generated underwater images. The alignment process is applied uniformly to all synthetic datasets, including those generated by CycleGAN and JTA-GAN, ensuring a fair and controlled comparison.
This detection-aware alignment step is essential for isolating the impact of image synthesis quality on downstream detection performance and preventing confounding effects caused by annotation misalignment.

4. Experiments and Results

This section presents a comprehensive evaluation of the proposed JTA-GAN framework, including experimental settings, quantitative performance comparisons, ablation studies, and qualitative visual analysis. The effectiveness of the physics-informed synthetic data generation is validated through downstream object detection experiments on the SUIM benchmark, with comparisons against a land-only baseline and a conventional CycleGAN-based data augmentation strategy.

4.1. Experimental Setup

To provide a fair and reproducible evaluation of the proposed JTA-GAN framework, this section describes the experimental configuration used for both generative model training and downstream object detection. We first outline the datasets employed for GAN training, synthetic data generation, and detection evaluation, followed by the implementation details and training protocols. All experimental settings are kept consistent across baselines to ensure that performance differences can be attributed solely to the quality of the generated data rather than variations in detector architecture or optimization strategy.

4.1.1. GAN Training Setup

The detailed information of the training datasets and experimental configurations is summarized in Table 1.
JTA-GAN is trained in an unpaired manner using images from two distinct domains:
  • Land domain: 1197 images selected from the COCO-debris subset, containing person and boat categories.
  • Underwater domain: 890 raw underwater images from the UIEB dataset, covering diverse visibility conditions and color distributions.
All images are resized to 256 × 256. Training is conducted for 200 epochs using the Adam optimizer (β1 = 0.5) with a batch size of 1, following standard practices for instance-normalized GAN architectures.
The same training protocol (epochs, optimizer, batch size) is used for CycleGAN, ensuring a fair comparison.

4.1.2. YOLO Training Setup

After GAN training, the final JTA-GAN model (V5) is applied to the full COCO-debris source set to generate synthetic underwater images. A total of 65,153 images are synthesized using our custom script generate_yolo_data_v6.py, which also recalculates bounding-box coordinates to compensate for padding-induced spatial shifts.
YOLO training configuration is summarized as follows:
  • Models: YOLOv8s, YOLOv8l
  • Epochs: 50
  • Input resolution: 640
  • Batch size: 16
  • Optimizer: SGD (Ultralytics default)
  • Evaluation set: SUIM test set (376 images)

4.1.3. Computational Resources

All experiments were conducted on a workstation equipped with:
  • GPU: NVIDIA RTX 4080S (16 GB Video Random Access Memory, VRAM) × 1
  • CPU: AMD 9700X (8 cores, 16 threads)
  • RAM: 32 GB
  • OS: Windows 11
  • Frameworks: PyTorch 2.9.0, CUDA 12.8

4.2. Baselines and Compared Models

To clearly assess the effectiveness of the proposed physics-informed synthesis strategy, we compare JTA-GAN against representative baseline configurations commonly used in underwater domain adaptation. These baselines are designed to isolate the impact of synthetic data quality on downstream detection performance, ranging from no domain adaptation to standard black-box GAN-based augmentation. All compared models share identical detector architectures and training settings to ensure a fair and controlled comparison.
To systematically isolate the influence of synthetic data quality on detection performance, we designed four distinct experimental configurations. Table 2 details these settings, comparing the proposed JTA-GAN against a land-only baseline and a standard CycleGAN benchmark, while also incorporating an ablation study on detector capacity.

4.3. Quantitative Results

We quantitatively evaluate the impact of different training data configurations by measuring object detection accuracy on a held-out underwater benchmark. The primary evaluation metric is mean Average Precision over Intersection over Union(IoU) thresholds from 0.5 to 0.95 (mAP50–95). To rigorously evaluate the numerical stability and reproducibility of the proposed JTA-GAN framework, we conducted a statistical analysis across three independent training sessions for the YOLOv8s detector. Each performance metric is reported as mean ± standard deviation ( σ ), rather than a simple average. This experimental protocol ensures that the observed performance gains are consistent and not the result of specific random initializations, directly addressing the stability of the physics-informed inductive bias.
To evaluate the efficacy of physics-informed synthesis in mitigating the domain gap, presents a comprehensive performance comparison (mAP50–95) between JTA-GAN and baseline methods. As the data indicates, the inclusion of physical constraints yields significant gains in detection accuracy. The quantitative detection results, including mAP for different models, are presented in Table 3.
Figure 4 compares detection accuracy under different training datasets. CycleGAN-based synthetic data consistently performs worse than the land-only baseline, confirming that naive image translation can introduce harmful artifacts rather than reducing the domain gap.
To provide an intuitive overview of the quantitative results, Figure 5 visualizes the mAP50–95 performance of YOLO detectors trained under different data augmentation strategies on the SUIM test set.
YOLOv8s trained with JTA-GAN synthetic data achieves 17.3% mAP, substantially outperforming:
  • the land-only baseline (13.2%),
  • the CycleGAN-based synthetic dataset (10.8%).
The relative improvement of +4.1% over baseline and +6.5% over CycleGAN demonstrates that physics-guided synthesis improves detector generalization under underwater conditions.
Notably, YOLOv8l exhibits similar trends, achieving comparable overall performance. This observation indicates that data quality, rather than detector capacity, is the dominant factor limiting underwater detection performance in this setting.
Key Observations:
  • CycleGAN degrades detector performance
    CycleGAN-generated images exhibit global color-wash artifacts and structural distortions. Despite identical training protocols, YOLO performance drops from 13.2% to 10.8%, confirming that artifact-prone synthetic data can actively harm downstream learning.
  • JTA-GAN significantly improves detection accuracy
    JTA-GAN yields the highest mAP across all configurations. The improvement is attributed to physics-guided degradation modeling and stable cycle-consistent training.
  • Detector size is not the determining factor
    The absence of significant gains from YOLOv8l over YOLOv8s suggests that architectural scaling alone cannot compensate for poor data quality.

4.4. Qualitative Comparison

To better understand the performance trends observed in the quantitative evaluation, we conduct a qualitative comparison of the synthetic underwater images generated by different GAN-based translation models. This analysis focuses on visual characteristics such as color attenuation, haze distribution, and structural integrity, providing insight into how physically guided image synthesis contributes to more reliable training data for underwater object detection.
CycleGAN typically produces flat blue/green overlays and distorted reconstructions, failing to simulate depth-dependent attenuation or scattering.
In contrast, JTA-GAN generates:
  • structured T-maps;
  • smooth, low-frequency ambient illumination;
  • naturalistic color shifts;
  • preserved object boundaries and scene geometry.
These properties are critical for maintaining detector-relevant features such as edges and object silhouettes.

Analysis of Self-Supervised T-Maps

The visualization of the learned T-maps in Figure 6 provides compelling evidence of the model’s internal physical consistency. Although JTA-GAN is trained without any ground-truth depth data or transmission labels, the generator successfully learns to predict spatially varying transmission values that strongly correlate with scene geometry. As shown in the visualizations, objects closer to the camera (e.g., the person in the foreground) are assigned higher transmission values (brighter regions), indicating high visibility. Conversely, distant background elements are assigned lower transmission values (darker regions), simulating the stronger scattering and attenuation effects that occur over longer optical paths.
This behavior indicates that the network has implicitly learned “depth cues” from the monocular land images—such as object scale, occlusion, and perspective—to modulate the scattering effect locally. For instance, in the second row of Figure 6, the T-map accurately segments the umbrella and the pedestrian from the background, applying heavy fog only to the distant street. This capability highlights a critical advantage over CycleGAN: instead of applying a uniform “underwater style” filter across the entire image, JTA-GAN simulates the volumetric nature of underwater turbidity. The T-map acts as a pixel-wise control gate, preserving the contrast of foreground objects while degrading the background. This physical plausibility is not merely a visual enhancement but a functional one; it ensures that the synthetic training data contains realistic signal-to-noise ratio (SNR) gradients, training the downstream detector to distinguish objects from the haze just as it would need to in real-world underwater scenarios.

4.5. Error Analysis and Confusion Analysis

To further understand the strengths and limitations of the proposed JTA-GAN framework beyond aggregate mAP metrics, we conduct a detailed error analysis combining qualitative inspection and class-wise confusion statistics. In particular, we focus on identifying whether residual detection failures originate from synthesis quality, detector capacity, or semantic discrepancies between training and evaluation datasets.

4.5.1. Failure Modes of CycleGAN-Based Synthetic Data

As observed in Figure 7 and Figure 8, detectors trained with CycleGAN-generated images consistently exhibit degraded performance across most categories. Visual inspection reveals that CycleGAN primarily applies global color shifts (blue/green washes) while failing to model depth-dependent attenuation and scattering. These artifacts distort local textures and edges, leading to unstable feature learning in YOLO. As a result, CycleGAN-based augmentation not only fails to improve detection accuracy but actively degrades performance, reducing mAP from 13.2% (land-only baseline) to 10.8%.
This phenomenon highlights that visually plausible style transfer does not necessarily translate into task-relevant synthetic data, especially for detection tasks that rely heavily on geometric and structural cues.

4.5.2. Analysis of JTA-GAN Improvements

In contrast, JTA-GAN significantly reduces false negatives and improves localization stability for semantically consistent categories such as person. The physics-guided synthesis produces structured T-maps and spatially coherent attenuation, which preserve object boundaries and relative contrast under turbidity. These properties lead to more robust feature representations during YOLO training, explaining the observed mAP50–95 improvement to 17.3%.
Importantly, the performance gain is consistent across YOLOv8s and YOLOv8l, confirming that the improvement originates from data quality rather than detector capacity.
The superiority of JTA-GAN in preserving discriminative features is further evidenced by a qualitative analysis of the synthesis behavior. As shown in Figure 7, while CycleGAN tends to distort local textures through global color mapping, JTA-GAN maintains high structural fidelity. This is because the physics-informed architecture explicitly disentangles scene radiance J from environmental degradations. By treating the terrestrial content as a rigid prior and only modulating it through the transmission map T (visualized in Figure 6), the generator prevents the loss of object-level cues such as silhouettes and edges. This preservation of ‘detector-friendly’ features explains the significant mAP gains on real-world underwater benchmarks, as the downstream detector can learn robust representations that are invariant to turbidity-induced contrast reduction.

4.5.3. Confusion Matrix Analysis

Figure 9 presents the normalized confusion matrices for YOLO detectors trained under different data augmentation strategies. Several critical insights emerge.
First, for semantically aligned classes (e.g., person), JTA-GAN substantially increases true positive rates while suppressing cross-class confusion, indicating that physically grounded degradations improve category discriminability.
Second, the boat category exhibits persistently low mAP across all configurations. The confusion matrix reveals frequent misclassification between boat and background or debris-like structures. This behavior is expected and not attributable to synthesis failure. During training, the boat class primarily consists of small surface vessels such as yachts and fishing boats from COCO, whereas the SUIM evaluation set predominantly contains large underwater wrecks and ship remnants. This severe semantic mismatch causes a domain shift at the object level, which cannot be resolved solely through appearance-level domain adaptation.
Crucially, despite this mismatch, JTA-GAN does not exacerbate confusion compared with the land-only baseline, demonstrating that the proposed synthesis remains structurally faithful and does not introduce harmful biases.

4.5.4. Analysis of Results

The combined error and confusion analysis clarifies that JTA-GAN effectively addresses appearance-induced domain gaps while revealing an orthogonal limitation caused by semantic inconsistency between training and evaluation object definitions. These findings underscore an important distinction: physics-consistent image synthesis can improve detector robustness, but resolving semantic mismatch requires complementary strategies such as class redefinition, hierarchical labeling, or instance-level domain alignment.

4.5.5. Failure Case and Limitation Analysis

While JTA-GAN demonstrates superior performance in most underwater scenarios, it still faces challenges in extreme conditions. Figure 10 illustrates typical failure cases where the synthesized images may not fully reach the desired level of realism. These failures primarily occur in two scenarios:
Extreme Turbidity and Information Loss: When the input land image contains regions with very low contrast or intricate textures, the generator may estimate a near-zero transmission map (T(x)), leading to excessive haze synthesis that obscures the object’s semantic features.
Complex Lighting Artifacts: Since our physical model employs a global ambient light vector A, it cannot fully simulate localized, dynamic lighting effects such as caustics or non-uniform refractions found in shallow water.
Identifying these failure modes provides a clear boundary for the current physics-informed approach and suggests that incorporating more complex optical models could be a focus for future development.

4.6. Ablation Study

To quantify the contribution of each component in JTA-GAN, we evaluate three ablation settings consistent with the models actually trained during development (V3, V4, V5). The comparison focuses on training stability, reconstruction behavior, and the final SUIM detection performance using YOLOv8s. Table 4 presents the ablation results across various model configurations, demonstrating the necessity of each proposed component in our framework.
As suggested, we have expanded the discussion on dataset balance in the revised manuscript. We conducted experiments using various sample ratios to identify the optimal configuration. In our early trials, we tested an imbalanced ratio of approximately 12:1 (1479 land images vs. 120 underwater images). However, the results showed that such a significant disparity led to discriminator domination, where the discriminator in the target domain converged too rapidly. This caused the generator to receive vanishing gradients, resulting in the failure of transmission map (T) estimation and noisy synthesis.
Our empirical findings indicate that discriminator saturation typically occurs when the imbalance ratio exceeds 10:1. By adjusting the samples to a more balanced ratio of 1.3:1 (1197:890), we successfully stabilized the adversarial competition, allowing JTA-GAN to learn the physical parameters effectively. We have added these details to Section 4.1.2 to clarify the selection of this threshold.
The stability of JTA-GAN training is highly sensitive to the balance between the land domain (XL) and the underwater domain (XS). We conducted experiments using various sample ratios to determine the operational boundaries of our framework. In preliminary trials, we tested an imbalanced configuration of 1479:120 (approx. 12:1). However, the results were unsatisfactory as the discriminator in the underwater domain saturated prematurely, leading to vanishing gradients and generator collapse. Through further empirical testing, we observed that discriminator domination consistently occurred when the land-to-underwater ratio exceeded 10:1. In contrast, the adopted ratio of 1197:890 (approx. 1.3:1) maintained a competitive adversarial game, ensuring that the generator could accurately disentangle scene radiance (J) from transmission (T) and ambient light (A).
To rigorously evaluate the indispensability of the physics layer and the asymmetric architecture, we extended our ablation study to include variants V1 and V2. V1 represents a standard black-box CycleGAN without any physical constraints, which results in a significantly lower mAP (10.8%) due to unrealistic global color shifts. V2 incorporates the physics layer but employs a symmetric inverse mapping ( I = J · T + A ( 1 T ) ); as expected, this configuration proved to be numerically unstable, frequently encountering gradient explosions in highly turbid regions where T approaches zero. These results empirically confirm that the physics layer provides the necessary inductive bias for realistic synthesis, while the asymmetric design is the prerequisite for training stability in underwater environments.

4.7. Computational Efficiency Analysis

To evaluate the practical feasibility of the proposed framework for real-time underwater applications, we measured the computational overhead and latency on an NVIDIA RTX 4080S GPU. As summarized in Table 5, the JTA-GAN training process for 200 epochs was completed in approximately 4.03 h (72 s per epoch), demonstrating that the physics-informed architecture does not introduce prohibitive overhead during the synthesis phase. For the downstream task, the JTA-GAN-trained YOLOv8s detector achieves a total per-image latency of 3.0 ms on the SUIM validation set (comprising 0.5 ms preprocessing, 1.9 ms inference, and 0.6 ms post-processing). This corresponds to a high-throughput rate of approximately 333.3 FPS (Frames Per Second), which significantly exceeds the real-time requirements of most AUVs. Furthermore, the peak GPU VRAM consumption was recorded at 8.0 GB, with system RAM usage ranging between 10.0 and 18.0 GB. These metrics indicate that our model is highly compatible with the hardware constraints of modern underwater robotic platforms.

5. Discussion

To further clarify the causal links between our methodology and the observed performance gains, we provide a targeted analysis of our key design choices. First, the integration of the physics-informed decoder (J-T-A decomposition) is directly responsible for closing the domain gap between land and underwater environments. Unlike CycleGAN, which produces generic color shifts, our model enforces distance-dependent degradation, allowing the YOLO detector to learn features that are physically consistent with real-world underwater optics, resulting in a 6.5% mAP improvement over the CycleGAN baseline. Second, the adoption of the asymmetric dual-generator architecture is the prerequisite for the high-quality synthesis observed in V5. By avoiding the numerical singularities inherent in symmetric physical inversion, we achieved a stable training signal that prevents artifacts and preserves object-level discriminative features. Finally, the perceptual loss (LPIPS) specifically addresses the structural fidelity of underwater objects. The qualitative clarity of synthesized textures (as seen in Figure 7) correlates with the enhanced detection precision for detailed classes like ‘Human’ and ‘Robot,’ as it prevents the ‘oil-painting’ artifacts that often confuse supervised detectors.

5.1. Interpretation of the Experimental Results

The experimental results demonstrate that the effectiveness of synthetic data for underwater object detection is highly dependent on whether the generation process respects the underlying physical characteristics of the target domain. While unpaired image-to-image translation frameworks such as CycleGAN are often assumed to be beneficial for domain adaptation, our results clearly indicate that naive, appearance-driven synthesis can be counterproductive.
Specifically, detectors trained on CycleGAN-generated underwater images consistently underperformed even the land-only baseline. This degradation can be attributed to the tendency of CycleGAN to satisfy adversarial objectives through global color shifts and texture hallucination, rather than modeling physically meaningful underwater degradation. Such artifacts distort object boundaries and introduce spurious textures, which negatively impact feature learning in downstream detectors such as YOLOv8. These findings highlight that synthetic data augmentation is not inherently beneficial; without appropriate constraints, it may amplify domain noise rather than reduce domain discrepancy.
In contrast, JTA-GAN consistently improves detection performance across all evaluated settings. By explicitly disentangling scene radiance, transmission, and ambient light, the proposed framework generates underwater images that preserve semantic structure while introducing realistic degradation patterns. The observed mAP improvement—from 13.2% (land-only) to 17.3% (JTA-GAN)—confirms that physics-consistent synthesis produces training data that is both visually plausible and functionally useful for detection models. Importantly, these gains are achieved without reliance on depth supervision or paired data, underscoring the practicality of the proposed approach.
Physics-Based Synthesis as Adversarial Data Augmentation: From a representation learning perspective, the success of JTA-GAN can be interpreted as a form of physically grounded adversarial data augmentation. The fundamental challenge in domain adaptation is the misalignment of feature distributions between source and target domains. Traditional style transfer methods (like CycleGAN) attempt to align these distributions by modifying the global appearance (texture and color). However, they often inadvertently alter the semantic content (e.g., distorting object shapes), which confuses the detector.
JTA-GAN addresses this by keeping the semantic content (scene radiance J) rigid while injecting physical noise (turbidity T and attenuation A) in a structurally consistent manner. When a YOLO detector is trained on this data, it is essentially being trained to be invariant to these specific physical degradations. The network learns to ignore the “veiling light” and focus on the underlying structural features that remain constant across domains. This explains why the mAP improvement is so significant (from 13.2% to 17.3%): the detector is not just learning to recognize “blue objects,” but is learning to extract robust features that persist even when the signal is attenuated by the medium. Furthermore, because the synthesis is constrained by the LPIPS perceptual loss, the semantic consistency of the bounding boxes is preserved. The generated objects remain strictly aligned with their labels, ensuring that the detector is provided with high-quality, reliable supervision signals, thereby solving the “label shift” problem often encountered in GAN-based augmentation.

5.2. Analysis of Category-Specific Performance and Semantic Mismatch

A notable observation is the consistently low detection performance for the boat category across all experimental configurations. This behavior should not be interpreted as a failure of the proposed synthesis model. Instead, it reflects a fundamental semantic mismatch between the training and evaluation datasets.
During training, the boat class primarily consists of small-scale surface vessels such as yachts and fishing boats derived from terrestrial COCO images. In contrast, the SUIM evaluation set predominantly contains large underwater structures, including wrecks and submerged ruins, which differ substantially in scale, geometry, and visual context. As a result, even physically realistic underwater synthesis cannot bridge this high-level semantic gap. This conclusion is further supported by the fact that increasing detector capacity (YOLOv8l vs. YOLOv8s) does not significantly improve boat detection performance.
The confusion matrix analysis (Figure 9) reinforces this interpretation by revealing systematic misclassification patterns rather than random noise. These errors suggest that semantic inconsistency—rather than insufficient visual realism—is the dominant limiting factor for this category. Consequently, the low boat mAP should be viewed as an inherent dataset limitation, not as evidence of ineffective domain translation.

5.3. Stability, Loss Design, and the Role of Perceptual Supervision

Another critical insight from the experiments concerns training stability and loss formulation. Our ablation studies demonstrate that balanced domain sampling is a prerequisite for stable adversarial learning in asymmetric GAN architectures. Severe imbalance between land and underwater domains leads to discriminator domination and generator collapse, whereas a more balanced configuration enables meaningful T-map estimation and stable convergence.
Moreover, the inclusion of LPIPS perceptual loss plays a decisive role in preserving high-frequency structure during cycle reconstruction. Models trained solely with L1-based cycle consistency exhibit painterly artifacts and over-smoothed textures, particularly in the reverse mapping. Incorporating perceptual supervision mitigates these effects by aligning feature-level representations, resulting in reconstructions that are visually sharper and semantically more faithful [36]. This improvement is not merely cosmetic; it directly contributes to better downstream detection performance by maintaining discriminative object features.

5.4. Limitations and Implications for Underwater Domain Adaptation

Despite its advantages, the proposed JTA-GAN framework has several inherent limitations. First, the simplified optical model assumes a single global ambient light vector and does not explicitly account for wavelength-dependent attenuation or spatially varying illumination beyond transmission effects. While this approximation is sufficient for effective data augmentation, it does not capture the full complexity of underwater light propagation.
Second, the framework operates on single images and does not model temporal consistency, which may limit its applicability to video-based underwater perception tasks. Finally, as demonstrated by the boat category results, physics-consistent synthesis alone cannot resolve high-level semantic mismatches between training and evaluation datasets.
Nevertheless, these limitations also clarify an important implication: effective domain adaptation for underwater detection requires both physically grounded low-level modeling and semantically aligned training data. JTA-GAN addresses the former by constraining image synthesis through interpretable physical components, thereby preventing the harmful artifacts observed in black-box GANs. The remaining challenges point toward complementary research directions, such as semantic-aware data selection and multi-modal supervision, rather than deficiencies in the proposed generative framework itself.

6. Conclusions and Future Work

This paper presents JTA-GAN, a physics-guided generative framework for unpaired land-to-underwater domain adaptation aimed at improving underwater object detection. Unlike conventional image-to-image translation methods that rely solely on appearance alignment, JTA-GAN explicitly embeds a simplified underwater image formation model into the generative process. By disentangling scene radiance, transmission (T), and ambient light, the proposed framework synthesizes underwater images that are both physically interpretable and structurally consistent.
A key design choice of JTA-GAN is its asymmetric dual-generator architecture. The forward generator is constrained by a differentiable physics layer to model underwater degradation in a well-posed manner, while the inverse generator adopts a stable black-box design to avoid the numerical instability inherent in reversing the physical model. Together with balanced domain sampling, PatchGAN discriminators, perceptual supervision via LPIPS, and physics-based regularization, this architecture achieves stable training and high-fidelity image synthesis without requiring depth supervision or paired data.
Comprehensive experiments demonstrate that synthetic data quality is a decisive factor in underwater detection performance. While CycleGAN-based synthesis degrades YOLO performance due to unrealistic artifacts, detectors trained with JTA-GAN–generated images achieve substantial improvements, reaching 17.3% mAP50–95 on the SUIM benchmark, compared to 13.2% for the land-only baseline. These results confirm that physics-consistent image synthesis can effectively reduce domain gaps and enhance the robustness of underwater perception systems. Furthermore, category-wise and confusion-matrix analyses reveal that remaining performance bottlenecks are largely attributable to semantic mismatches between training and evaluation datasets rather than limitations of the proposed generative model.
Several directions can be explored to further extend the proposed framework. First, the current simplified optical model assumes a single global ambient light vector and does not explicitly model wavelength-dependent attenuation. Incorporating multi-spectral or wavelength-aware formulations may further improve realism under diverse underwater conditions.
Second, extending JTA-GAN to temporal or video-based synthesis would enable the generation of temporally consistent underwater sequences, which are crucial for applications such as autonomous underwater vehicles and long-term monitoring. Integrating motion-aware constraints or temporal coherence losses is a promising avenue for future research.
Third, the semantic mismatch observed in certain categories (e.g., surface vessels versus underwater wrecks) suggests that semantic-aware data curation or class-level adaptation strategies could complement physics-guided synthesis. Combining JTA-GAN with category-specific domain alignment or hybrid real–synthetic datasets may further enhance detection robustness.
Finally, optimizing computational efficiency and exploring lightweight implementations could facilitate real-time deployment on embedded or edge devices, expanding the applicability of physics-informed generative learning to practical underwater inspection and surveillance scenarios.

Author Contributions

Conceptualization, Y.-H.C., L.-Y.Y. and Y.-Y.C.; methodology, Y.-H.C., L.-Y.Y. and Y.-Y.C.; software, Y.-H.C. and L.-Y.Y.; validation, Y.-H.C., L.-Y.Y. and Y.-Y.C.; formal analysis, Y.-H.C. and L.-Y.Y.; investigation, Y.-H.C., L.-Y.Y. and Y.-Y.C.; resources, Y.-H.C., L.-Y.Y. and Y.-Y.C.; data curation, Y.-H.C., L.-Y.Y. and Y.-Y.C.; writing—original draft preparation, Y.-H.C., L.-Y.Y. and Y.-Y.C.; writing—review and editing, Y.-H.C., L.-Y.Y. and Y.-Y.C.; visualization, Y.-H.C., L.-Y.Y. and Y.-Y.C.; supervision, Y.-H.C. and Y.-Y.C.; funding acquisition, Y.-H.C. and Y.-Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Council (NSTC) of Taiwan under Grant No. NSTC 113-228-E-110-006.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Adam, M.A.A.; Tapamo, J.R. Survey on Image-Based Vehicle Detection Methods. World Electr. Veh. J. 2025, 16, 303. [Google Scholar] [CrossRef]
  2. Wei, W.; Chen, H.; Gong, J.; Che, K.; Ren, W.; Zhang, B. Real-Time Parking Space Detection Based on Deep Learning and Panoramic Images. Sensors 2025, 25, 6449. [Google Scholar] [CrossRef] [PubMed]
  3. Xiao, J.; Shen, H.; Ding, Y.; Guo, B. Industrial-AdaVAD: Adaptive Industrial Video Anomaly Detection Empowered by Edge Intelligence. Mathematics 2025, 13, 2711. [Google Scholar] [CrossRef]
  4. Chen, K.; Zhou, X.; Ren, J. DLF-YOLO: A Dynamic Synergy Attention-Guided Lightweight Framework for Few-Shot Clothing Trademark Defect Detection. Electronics 2025, 14, 2113. [Google Scholar] [CrossRef]
  5. Slimi, H.; Balti, A.; Sayadi, M.; Ben Khelifa, M.M. Augmented Gait Classification: Integrating YOLO, CNN-SNN Hybridization, and GAN Synthesis for Knee Osteoarthritis and Parkinson’s Disease. Signals 2025, 6, 64. [Google Scholar] [CrossRef]
  6. Min, X.; Ye, Y.; Xiong, S.; Chen, X. Computer Vision Meets Generative Models in Agriculture: Technological Advances, Challenges and Opportunities. Appl. Sci. 2025, 15, 7663. [Google Scholar] [CrossRef]
  7. Liu, M.; Jiang, W.; Hou, M.; Qi, Z.; Li, R.; Zhang, C. A Deep Learning Approach for Object Detection of Rockfish in Challenging Underwater Environments. Front. Mar. Sci. 2023, 10, 1242041. [Google Scholar] [CrossRef]
  8. Wang, X.; Zhang, Z.; Shang, X. Research on Improved YOLO11 for Detecting Small Targets in Sonar Images Based on Data Enhancement. Appl. Sci. 2025, 15, 6919. [Google Scholar] [CrossRef]
  9. Zhu, H.; Zhu, D.; Qin, X.; Guo, F. Efficient and Accurate Beach Litter Detection Method Based on QSB-YOLO. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 460–467. [Google Scholar] [CrossRef]
  10. Islam, M.J.; Xia, Y.; Sattar, J. Fast Underwater Image Enhancement for Improved Visual Perception. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019. [Google Scholar]
  11. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
  12. Li, C.; Guo, C.; Ren, W.; Cong, R.; Hou, J.; Kwong, S.; Tao, D. An Underwater Image Enhancement Benchmark Dataset and Beyond. IEEE Trans. Image Process. 2019, 29, 4376–4389. [Google Scholar] [CrossRef]
  13. Zhang, D.; Yu, C.; Li, Z.; Qin, C.; Xia, R. A lightweight network enhanced by attention-guided cross-scale interaction for underwater object detection. Appl. Soft Comput. 2025, 184, 113811. [Google Scholar] [CrossRef]
  14. Qin, C.; Ran, X.; Zhang, D. Unsupervised image stitching based on Generative Adversarial Networks and feature frequency awareness algorithm. Appl. Soft Comput. 2025, 183, 113466. [Google Scholar] [CrossRef]
  15. Zhang, D.; Hao, X.; Wang, D. An efficient lightweight convolutional neural network for industrial surface defect detection. Artif. Intell. Rev. 2023, 56, 10651–10677. [Google Scholar] [CrossRef]
  16. Lee, S.-W.; Lee, S.-H.; Son, D.-M.; Lee, S.-H. Image Visibility Enhancement Under Inclement Weather with an Intensified Generative Training Set. Mathematics 2025, 13, 2833. [Google Scholar] [CrossRef]
  17. Yang, Z.; Yin, Y.; Jing, Q.; Shao, Z. A High-Precision Detection Model of Small Objects in Maritime UAV Perspective Based on Improved YOLOv5. J. Mar. Sci. Eng. 2023, 11, 1680. [Google Scholar] [CrossRef]
  18. Wei, Z.; Dong, S.; Wang, X. Petrochemical Equipment Tracking by Improved Yolov7 Network and Hybrid Matching in Moving Scenes. Sensors 2023, 23, 4546. [Google Scholar] [CrossRef] [PubMed]
  19. Li, C.; Wang, Y.; Liu, X. An Improved YOLOv7 Lightweight Detection Algorithm for Obscured Pedestrians. Sensors 2023, 23, 5912. [Google Scholar] [CrossRef]
  20. Wang, Y.; Jia, Y.; Gu, L. EFM-Net: Feature Extraction and Filtration with Mask Improvement Network for Object Detection in Remote Sensing Images. Remote Sens. 2021, 13, 4151. [Google Scholar] [CrossRef]
  21. Zhang, R.; Song, Y.; Zhang, R.; Lei, Y.; Cheng, H.; Zhong, J. A Novel Anti-UAV Detection Method for Airport Safety Based on Style Transfer Learning and Deep Learning. Electronics 2025, 14, 4620. [Google Scholar] [CrossRef]
  22. Chen, L.; Dong, X.; Xie, Y.; Wang, S. WaterPairs: A Paired Dataset for Underwater Image Enhancement and Underwater Object Detection. Intell. Mar. Technol. Syst. 2024, 2, 6. [Google Scholar] [CrossRef]
  23. Feng, S.; Ma, S.; Zhu, X.; Yan, M. Artificial Intelligence-Based Underwater Acoustic Target Recognition: A Survey. Remote Sens. 2024, 16, 3333. [Google Scholar] [CrossRef]
  24. Li, J.; Skinner, K.A.; Eustice, R.M.; Johnson-Roberson, M. WaterGAN: Unsupervised-to-supervised deep learning for optical detection of marine debris. IEEE Robot. Autom. Lett. 2017, 2, 1564–1571. [Google Scholar]
  25. Wang, N.; Zhou, Y.; Han, F.; Zhu, H.; Zheng, J. UWGAN: Underwater GAN for Real-World Underwater Image Restoration. arXiv 2019, arXiv:1912.03465. [Google Scholar]
  26. Jyothimurugan, M.; Pavithra, S.; Roselind, J.D. Efficient underwater ecological monitoring with embedded AI: Detecting Crown-of-Thorns Starfish via DCGAN and YOLOv6. Front. Mar. Sci. 2025, 12, 1658205. [Google Scholar] [CrossRef]
  27. Almazrouei, H.; Al Nasseri, M.; Alzaabi, M. An AI-Powered Autonomous Underwater System for Sea Exploration and Scientific Research. arXiv 2025, arXiv:2512.07652. [Google Scholar] [CrossRef]
  28. Wang, W.; Yu, Z.; Huang, M. Refining features for underwater object detection at the frequency level. Front. Mar. Sci. 2025, 12, 1544839. [Google Scholar] [CrossRef]
  29. Wen, J.; Cui, J.; Chen, B.M. EnYOLO: A Real-Time Framework for Domain-Adaptive Underwater Object Detection with Image Enhancement. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 11054–11060. [Google Scholar]
  30. Ding, X.; Chen, X.; Sui, Y.; Wang, Y.; Zhang, J. Underwater Image Enhancement Using a Diffusion Model with Adversarial Learning. J. Imaging 2025, 11, 212. [Google Scholar] [CrossRef]
  31. Zhu, M. HyNPhyAttnGAN: Underwater image enhancement with Unet-GAN by fusing hybrid normalization, physics simulation, and multi-scale attention mechanisms. In Proceedings of the Seventeenth International Conference on Digital Image Processing (ICDIP 2025), Haikou, China, 25–27 April 2025; Volume 13709, p. 1370932. [Google Scholar]
  32. Chiang, J.Y.; Chen, Y.C. Underwater Image Enhancement by Wavelength Compensation and Dehazing. IEEE Trans. Image Process. 2012, 21, 1756–1769. [Google Scholar] [CrossRef]
  33. Ghildyal, A.; Liu, F. Attacking Perceptual Similarity Metrics. Trans. Mach. Learn. Res. 2023, 2023, 898. [Google Scholar]
  34. Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
  35. Ding, K.; Ma, K.; Wang, S.; Simoncelli, E.P. Image Quality Assessment: Unifying Structure and Texture Similarity. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2567–2581. [Google Scholar] [CrossRef]
  36. Ghazanfari, S.; Araujo, A.; Krishnamurthy, P.; Khorrami, F.; Garg, S. LipSim: A Provably Robust Perceptual Similarity Metric. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Figure 1. Overview of the proposed JTA-GAN. Illustration of the forward and backward translation results, including (a) the land-domain input image, (b) the synthesized underwater output, (c) the reconstructed land-domain image obtained from the inverse generator, and (d) the self-supervised transmission map predicted by the physics-guided decoder.
Figure 1. Overview of the proposed JTA-GAN. Illustration of the forward and backward translation results, including (a) the land-domain input image, (b) the synthesized underwater output, (c) the reconstructed land-domain image obtained from the inverse generator, and (d) the self-supervised transmission map predicted by the physics-guided decoder.
Mathematics 14 00605 g001
Figure 2. Training pipeline of JTA-GAN. Land images and underwater images form two unpaired domains. The forward generator estimates transmission and ambient-light parameters and produces synthetic underwater images, while the inverse generator reconstructs land images to enforce cycle consistency. PatchGAN discriminators supervise domain realism on both sides.
Figure 2. Training pipeline of JTA-GAN. Land images and underwater images form two unpaired domains. The forward generator estimates transmission and ambient-light parameters and produces synthetic underwater images, while the inverse generator reconstructs land images to enforce cycle consistency. PatchGAN discriminators supervise domain realism on both sides.
Mathematics 14 00605 g002
Figure 3. Architecture and information flow of the proposed JTA-GAN. The physics-informed forward generator predicts a transmission map and global ambient-light vector, which are fused through a differentiable rendering layer to synthesize underwater images. The reverse generator is implemented as a black-box U-Net to ensure stable reconstruction and gradient flow during cycle-consistency training. Green and orange arrows represent the forward data flow and gradient backpropagation, respectively.
Figure 3. Architecture and information flow of the proposed JTA-GAN. The physics-informed forward generator predicts a transmission map and global ambient-light vector, which are fused through a differentiable rendering layer to synthesize underwater images. The reverse generator is implemented as a black-box U-Net to ensure stable reconstruction and gradient flow during cycle-consistency training. Green and orange arrows represent the forward data flow and gradient backpropagation, respectively.
Mathematics 14 00605 g003
Figure 4. Detection accuracy comparison under different training datasets.
Figure 4. Detection accuracy comparison under different training datasets.
Mathematics 14 00605 g004
Figure 5. Final performance summary on SUIM (mAP50–95).
Figure 5. Final performance summary on SUIM (mAP50–95).
Mathematics 14 00605 g005
Figure 6. Visualization of self-supervised transmission maps predicted by JTA-GAN. The model learns depth-like turbidity distributions directly from unpaired land images, producing smooth and physically coherent T-maps without any depth supervision. These maps govern the attenuation strength and contribute to realistic underwater synthesis.
Figure 6. Visualization of self-supervised transmission maps predicted by JTA-GAN. The model learns depth-like turbidity distributions directly from unpaired land images, producing smooth and physically coherent T-maps without any depth supervision. These maps govern the attenuation strength and contribute to realistic underwater synthesis.
Mathematics 14 00605 g006
Figure 7. Qualitative comparison of translation behavior between CycleGAN and JTA-GAN. CycleGAN produces global blue/green washes and structurally distorted reconstructions, revealing instability in the backward mapping. In contrast, JTA-GAN yields spatially structured transmission maps and produces consistent forward and reverse translations with preserved structural content.
Figure 7. Qualitative comparison of translation behavior between CycleGAN and JTA-GAN. CycleGAN produces global blue/green washes and structurally distorted reconstructions, revealing instability in the backward mapping. In contrast, JTA-GAN yields spatially structured transmission maps and produces consistent forward and reverse translations with preserved structural content.
Mathematics 14 00605 g007
Figure 8. YOLO detection results on the SUIM validation set. Comparison of detection outputs across four training conditions: (a) land-only baseline, (b) CycleGAN-based synthetic data, (c) JTA-GAN synthetic data with YOLOv8s, and (d) JTA-GAN synthetic data with YOLOv8l. JTA-GAN significantly improves cross-domain detection performance, while CycleGAN further degrades accuracy.
Figure 8. YOLO detection results on the SUIM validation set. Comparison of detection outputs across four training conditions: (a) land-only baseline, (b) CycleGAN-based synthetic data, (c) JTA-GAN synthetic data with YOLOv8s, and (d) JTA-GAN synthetic data with YOLOv8l. JTA-GAN significantly improves cross-domain detection performance, while CycleGAN further degrades accuracy.
Mathematics 14 00605 g008
Figure 9. Confusion matrix comparison for YOLOv8s trained under three configurations: (a) land-only baseline, (b) CycleGAN-based synthetic data, and (c) JTA-GAN synthetic data. CycleGAN introduces color-wash artifacts that increase FP and FN across most categories, whereas JTA-GAN reduces haze-related FN and preserves structural cues. The “Boat” class consistently shows low accuracy due to semantic mismatch between COCO (training) and SUIM (evaluation).
Figure 9. Confusion matrix comparison for YOLOv8s trained under three configurations: (a) land-only baseline, (b) CycleGAN-based synthetic data, and (c) JTA-GAN synthetic data. CycleGAN introduces color-wash artifacts that increase FP and FN across most categories, whereas JTA-GAN reduces haze-related FN and preserves structural cues. The “Boat” class consistently shows low accuracy due to semantic mismatch between COCO (training) and SUIM (evaluation).
Mathematics 14 00605 g009
Figure 10. Typical failure cases generated by JTA-GAN. (Left) Information loss caused by overestimated turbidity in low-contrast scenes, leading to total signal attenuation. (Right) Unrealistic global color cast and flat lighting, occurring when the simplified physical model fails to capture complex, non-uniform underwater illumination.
Figure 10. Typical failure cases generated by JTA-GAN. (Left) Information loss caused by overestimated turbidity in low-contrast scenes, leading to total signal attenuation. (Right) Unrealistic global color cast and flat lighting, occurring when the simplified physical model fails to capture complex, non-uniform underwater illumination.
Mathematics 14 00605 g010
Table 1. Shows all datasets used in the JTA-GAN training pipeline, synthetic data generation, YOLO detector training, and final evaluation.
Table 1. Shows all datasets used in the JTA-GAN training pipeline, synthetic data generation, YOLO detector training, and final evaluation.
StageDatasetSourceImagesRole in Pipeline
GAN TrainingCOCO-debrisCOCO 20171197Land-domain input for learning scene radiance distributions; semantic classes are person and small boats used for YOLO training.
GAN TrainingUIEB (raw)UIEB890Underwater-domain input; provides real underwater turbidity, color attenuation, and illumination patterns.
Synthetic GenerationCOCO → JTA-GANGenerated via generate_yolo_data_v6.py65,153Underwater-domain input; provides real underwater turbidity, color attenuation, and illumination patterns.
YOLO TrainingJTA-GAN Synthetic Set(Above)65,153Used to train YOLOv8s and YOLOv8l; represents underwater-like labeled data for supervised detection.
EvaluationSUIM Test SetSUIM dataset376Used only for final performance evaluation; includes underwater scenes with wrecks, ruins, and large structures producing low boat-class mAP (semantic mismatch).
Table 2. Compared training configurations and experimental intent.
Table 2. Compared training configurations and experimental intent.
Training DataGAN ModelBoatAll
COCO-debris (land only)NoneYOLOv8sDomain gap baseline
CycleGAN synthetic underwaterCycleGANYOLOv8sStandard GAN baseline
JTA-GAN synthetic underwaterJTA-GAN v5YOLOv8sProposed method
JTA-GAN synthetic underwaterJTA-GAN v5YOLOv8lAblation on detector size
Table 3. Quantitative comparison of detection performance on the SUIM benchmark. Results are presented as mean ± standard deviation based on three independent training runs (n = 3).
Table 3. Quantitative comparison of detection performance on the SUIM benchmark. Results are presented as mean ± standard deviation based on three independent training runs (n = 3).
ExperimentYOLO ModelPerson (%)Boat (%)All (%)Interpretation
BaselineYOLOv8s23.23.213.2Land-only baseline; shows severe domain gap
CycleGANYOLOv8s21.00.510.8CycleGAN degrades performance(artifact-heavy images)
JTA-GANYOLOv8s 34.3   ± 0.2 0.3   ± 0.1 17.3   ± 0.1Our method; +60% better than CycleGAN
JTA-GANYOLOv8l31.42.917.1Larger YOLO improves “boat” but not “overall”
Table 4. Ablation results across model variants showing the necessity of each core module.
Table 4. Ablation results across model variants showing the necessity of each core module.
Model VariantLPIPSAsymmetric Arch.Dataset Balance (Land/Underwater)Physics LayerT-Map QualitymAP (%)
CycleGAN1197/890 (balanced)N/A10.8
V2 (Sym-Physics)1479/120 (imbalanced)UnstableN/A 1
V31479/120 (imbalanced)Broken, noisyN/A 2
V41197/890 (balanced)Smooth but oversmoothedN/A 3
V51197/890 (balanced)Structured, depth-like17.3
Note 1: V2 suffered from gradient explosion during reverse mapping due to numerical singularities (T → 0). Note 2: Training collapsed; synthesis was too degraded for detector training. Note 3: Significant loss of high-frequency detail; excluded from detection experiments.
Table 5. Detailed computational efficiency metrics of the proposed JTA-GAN and YOLO framework.
Table 5. Detailed computational efficiency metrics of the proposed JTA-GAN and YOLO framework.
CategoryMetricValue
JTA-GAN TrainingTotal Time (200 epochs)
Time per Epoch
4.03 h
72 s
YOLOv8 TrainingTotal Time (50 epochs)4.76 h
Real-time inferencePre-processing0.5 ms/image
Inference Latency1.9 ms/image
Post-processing0.6 ms/image
Total Latency3.0 ms/image
Throughput≈333 FPS
Memory FootprintPeak GPU VRAM
System RAM Usage
8.0 GB
10.0–18.0 GB
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Y.-H.; Yu, L.-Y.; Chen, Y.-Y. JTA-GAN: A Physics-Informed Framework for Realistic Underwater Image Generation and Improved Object Detection. Mathematics 2026, 14, 605. https://doi.org/10.3390/math14040605

AMA Style

Chen Y-H, Yu L-Y, Chen Y-Y. JTA-GAN: A Physics-Informed Framework for Realistic Underwater Image Generation and Improved Object Detection. Mathematics. 2026; 14(4):605. https://doi.org/10.3390/math14040605

Chicago/Turabian Style

Chen, Yung-Hsiang, Li-Yen Yu, and Yung-Yue Chen. 2026. "JTA-GAN: A Physics-Informed Framework for Realistic Underwater Image Generation and Improved Object Detection" Mathematics 14, no. 4: 605. https://doi.org/10.3390/math14040605

APA Style

Chen, Y.-H., Yu, L.-Y., & Chen, Y.-Y. (2026). JTA-GAN: A Physics-Informed Framework for Realistic Underwater Image Generation and Improved Object Detection. Mathematics, 14(4), 605. https://doi.org/10.3390/math14040605

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop