1. Introduction
Object detection in underwater environments remains a challenging problem despite rapid advances in deep learning for general object recognition [
1,
2,
3,
4,
5,
6]. State-of-the-art detectors such as YOLO variants typically require large amounts of labelled training data to achieve robust performance. However, collecting and annotating underwater imagery is costly, time-consuming, and often impractical for many application domains (e.g., diver detection, wreck inspection). Consequently, there exists a substantial domain gap between large, well-annotated terrestrial datasets and the limited, diverse, and visually degraded underwater image collections; this gap significantly degrades detector generalization when models trained on land images are deployed underwater [
7,
8,
9].
A common strategy to reduce this gap is to use generative models to synthesize realistic target-domain images for data augmentation. Unpaired image-to-image translation frameworks such as CycleGAN are widely adopted because they do not require pixel-level paired images. Nevertheless, our experiments and broader empirical experience indicate that off-the-shelf, “black-box” style GANs are ill-suited for the land → underwater adaptation task. In particular, standard CycleGAN often learns superficial color transforms (e.g., uniform blue/green washes) rather than physically plausible degradations such as distance-dependent haze, wavelength-selective attenuation, and structured ambient illumination. When used to generate training data for detectors, these artifact-laden synthetic images can be actively harmful: in our evaluation a YOLOv8s model trained on CycleGAN-generated data achieved a lower mAP (10.8% on SUIM) [
10] than the land-only baseline (13.2%), demonstrating that naive synthetic augmentation can degrade downstream performance.
This work proposes a different philosophy: rather than treating domain translation as an appearance-only style transfer, we inject physically motivated constraints into the generative model so that the synthesis process respects the dominant phenomena in underwater optics. We present JTA-GAN, a hybrid, non-symmetric GAN framework that explicitly disentangles scene radiance, per-pixel transmission (turbidity) and ambient light, and synthesizes underwater images by applying a differentiable physics layer based on a simplified underwater image formation model: I(x) = J(x) · T(x) + A(x)(1 − T(x)),where J denotes scene radiance (content), T the spatially varying transmission map (turbidity/haze) and A the ambient background illumination (attenuation/color cast). Crucially, the T-map is learned in a self-supervised manner from 2-D image features derived from land images—we do not rely on external depth prediction models or ground-truth depth maps.
Two practical design choices proved essential for stable and useful synthesis. First, we adopt an asymmetric dual-generator architecture. The forward generator (land → sea) is physics-informed: its decoder is architected to output disentangled J, T, A components which are combined by the physics layer to produce synthetic underwater images. The reverse mapping (sea → land) is implemented as a stable black-box U-Net; attempts to model the inverse operation via explicit division (I − A)/T introduce numerical instability when T approaches zero. Second, balanced domain sampling proved critical: extreme imbalance (e.g., earlier experiments with 1479 land vs. 120 underwater) led to discriminator domination and generator collapse, whereas a balanced setup (1197 land vs. 890 underwater) stabilized training and yielded meaningful T-maps. Additional training choices—PatchGAN discriminators, a combined loss including adversarial terms, cycle-consistency (L1) with = 10, a perceptual LPIPS loss ( = 1) and a physics regularization (TV-style, = 1)—were necessary to obtain sharp, physically plausible results. Empirically, LPIPS was decisive: adding LPIPS in our final V5 model removed painterly artifacts and restored high-frequency detail that L1 alone could not.
We thoroughly validate JTA-GAN in two ways. First, we qualitatively compare its outputs against CycleGAN and show that JTA-GAN produces structured, depth-aware
T-maps and visually plausible attenuation and caustics, while CycleGAN tends to produce global color casts and texture distortions. Second, we evaluate the downstream utility of the synthetic data by training YOLO detectors on datasets augmented with generated images. For GAN training, we used 1197 land images (COCO-debris subset) [
11] and 890 underwater images (UIEB raw) [
12]. Using our script generate_yolo_data_v6.py we synthesized a YOLO training set from 65,153 COCO images (and corrected bounding boxes to avoid edge shifts). On the SUIM test set (376 images), the YOLOv8s model trained on JTA-GAN synthetic data achieved 17.3% mAP50–95, substantially outperforming the land-only baseline (13.2%) and the CycleGAN baseline (10.8%). Per-class analysis shows the largest gains for semantically consistent classes (person: 34.3% vs. baseline 23.2%), while semantically mismatched classes (boat vs. underwater wrecks) remain challenging for all methods (boat mAP remains < 3%), pointing to an orthogonal limitation caused by dataset semantic mismatch rather than synthesis fidelity.
In summary, this paper makes three principal contributions:
We identify and empirically characterize the failure modes of black-box GANs in underwater synthesis, showing how unrealistic artifacts can actively degrade detector performance.
We propose JTA-GAN, which introduces three specific architectural innovations:
- (a)
a physics-informed decoder that explicitly embeds an underwater-optimized optical model.
- (b)
a self-supervised learning strategy for transmission maps (T-maps) that requires no depth supervision.
- (c)
a task-oriented asymmetric architecture that ensures numerical stability by avoiding singularities during reverse mapping.
We demonstrate through rigorous evaluation on the SUIM benchmark that JTA-GAN synthetic data significantly improves downstream underwater object detection.
The remainder of the paper is organized as follows.
Section 2 reviews related work on domain adaptation and physics-aware image synthesis.
Section 3 details the JTA-GAN architecture and loss formulation.
Section 4 describes datasets, experimental protocols and quantitative/qualitative results.
Section 5 discusses limitations and future directions and
Section 6 concludes.
The pursuit of balancing detection precision with computational efficiency has led to several innovative architectural designs in recent underwater and industrial vision research. For instance, PRCII-Net [
13] employs a lightweight attention-guided cross-scale interaction module to enhance small target detection in complex underwater environments while maintaining low resource consumption. In the domain of image enhancement and alignment, GAN-based unsupervised frameworks [
14] integrated with feature frequency-aware decomposition have demonstrated superior robustness in processing textures and structural details. Furthermore, the development of efficient lightweight CNNs(Convolutional Neural Network) for industrial surface defect detection [
15] highlights the effectiveness of coordinate attention mechanisms and weighted feature pyramid networks in resource-constrained scenarios. Our proposed JTA-GAN aligns with these trends by integrating a physics-informed inductive bias into an efficient dual-generator architecture, ensuring high-quality synthesis and real-time detection performance.
2. Related Work
2.1. Underwater Image Formation and Physics-Based Restoration
Underwater images suffer from wavelength-dependent attenuation, forward scattering, and spatially varying ambient illumination. Classical models such as the Jaffe–McGlamery formulation describe underwater image formation as a combination of attenuated scene radiance and backscattered light. However, these models typically require depth measurements or multiple images to estimate transmission and background illumination. Learning-based restoration methods (e.g., UWCNN, Water-Net) attempt to approximate these physical processes [
16], but they rely on paired underwater/clean datasets or synthetic depth-based renderings that are not available for our unpaired land → underwater translation setting. Moreover, these methods aim to restore underwater images, not synthesize underwater degradations compatible with downstream detector training.
In contrast, our work adopts a simplified yet physically meaningful image formation model, I = J · T + A (1 − T),and embeds it directly into the generator architecture. This allows the model to learn per-pixel turbidity and ambient light without depth supervision, bridging the gap between physically grounded image formation and unpaired image synthesis.
2.2. Unpaired Image-to-Image Translation (CycleGAN and Variants)
Unpaired translation frameworks such as CycleGAN, UNIT, MUNIT, and DRIT have achieved strong performance on style transfer tasks by enforcing cycle consistency between two domains. However, because these models are optimized to match global appearance statistics rather than domain-specific physical processes, they often rely on color shifts and texture hallucination to satisfy the discriminator. For underwater synthesis, this manifests as “blue-wash” or “green-wash” artifacts, low-frequency smearing, and the loss of fine-scale structure.
Multiple studies have reported that CycleGAN-style models struggle with domains exhibiting complex light propagation effects. Our experiments confirm this: CycleGAN frequently collapses into global color mapping and produces unstable or noisy reverse translations. When these artifact-prone images are used to train YOLO, performance deteriorates (mAP dropping from 13.2% to 10.8%), demonstrating that naïve image translation can actively harm detector training. These limitations highlight the need for models that incorporate underwater-specific priors rather than relying purely on adversarial style alignment.
2.3. Synthetic Data for Underwater Vision and Domain Adaptation
Synthetic data is an increasingly important tool for improving underwater vision, where annotations are scarce. Existing approaches include physics-based rendering, domain randomization, and unpaired image translation [
17,
18,
19,
20,
21]. While these methods show promise, two challenges remain unresolved: Lack of physically grounded degradations.
Most synthetic datasets rely on global color transformations or heuristic filters, failing to capture depth-dependent haze and wavelength-selective attenuation essential for underwater realism.
Weak validation on downstream detection tasks. Many works evaluate synthetic data qualitatively or using enhancement metrics, but rarely examine how synthesis quality affects modern detector performance. Our results empirically show a direct negative impact when using artifact-heavy CycleGAN data to train YOLO.
Among existing underwater datasets, WaterPairs [
22] demonstrates the benefit of paired supervision for enhancement and detection, but such paired data are unavailable for land → underwater translation. Our work is among the few that rigorously analyze the causal link between synthetic image fidelity and downstream object detection accuracy, providing a more robust evaluation of domain adaptation strategies.
The integration of Artificial Intelligence in underwater domains extends far beyond image enhancement. In marine ecology, AI-powered frameworks are now used for real-time monitoring of biological indicators, such as identifying coral reef pests like Crown-of-Thorns Starfish. For submerged infrastructure and archeological surveys, enhanced detectors such as UDINO and composite-enhanced YOLOv11 algorithms facilitate the precise localization of shipwrecks and debris. Furthermore, recent advancements in Autonomous Underwater Vehicles have introduced integrated systems that combine real-time detection (e.g., YOLOv11) with Large Language Model (LLM)-based summary generation to automate sea exploration. Beyond visual data, AI has also revolutionized underwater acoustic target recognition, where deep learning enables the autonomous classification of ship-radiated signals [
23].
Beyond generative strategies, recent advances in detector architectures have also addressed underwater challenges. For instance, recent studies have proposed lightweight networks enhanced by attention-guided cross-scale interaction to improve feature extraction in turbid media. While these architectural innovations optimize how a detector processes degraded features, JTA-GAN provides a complementary approach by addressing the domain gap at the data source through physics-informed synthesis. By combining such robust feature interaction logic with the high-fidelity training data generated by JTA-GAN, the generalization of underwater perception systems can be significantly reinforced [
13,
24,
25,
26,
27].
2.4. Physics-Guided GANs and Disentanglement Approaches
Recent advances explore integrating physical constraints or disentangled representations into GANs to improve interpretability and stability. Some models decompose content and style (e.g., DRIT, MUNIT) or combine differentiable renderers with GAN training. However, these works generally target generic domain transfer or 3D-aware synthesis and do not model underwater-specific optical processes. More importantly, they do not disentangle the three fundamental components of underwater image formation—scene radiance, transmission, and ambient light—nor do they integrate a differentiable physics layer within the generator pipeline.
Compared with existing approaches, JTA-GAN uniquely enforces a structured decomposition J, T, A and simulates underwater degradations through a physics-informed rendering equation. This yields a disentanglement that is both physically interpretable and directly beneficial for downstream detector training, addressing the limitations of purely appearance-based GAN models.
To better situate JTA-GAN within the current research landscape, it is helpful to contrast its design philosophy with other physics-informed generative models such as WaterGAN and UWGAN. A primary distinction lies in the data requirements: WaterGAN often relies on depth maps or pre-defined 3D scene structures to supervise the scattering process. In contrast, JTA-GAN is designed for the more challenging unpaired 2D domain adaptation task, where it learns to estimate a structured transmission map (T) directly from monocular RGB images without any depth supervision. Furthermore, while many physics-based models focus purely on visual enhancement, JTA-GAN’s architecture—specifically its asymmetric design and perceptual loss—is optimized to preserve object-level discriminative features, making it more suitable for downstream detection tasks than purely appearance-oriented generative frameworks.
2.5. Summary
To summarize, existing underwater enhancement models require depth or paired data, standard CycleGAN-type models neglect underwater physics and produce harmful artifacts, and prior synthetic-data works seldom evaluate downstream detection performance. No existing GAN framework explicitly disentangles radiance, turbidity, and ambient light under an underwater formation model. These gaps motivate the development of JTA-GAN, which integrates physics-aware decomposition with stable cycle-consistent training to generate high-fidelity synthetic underwater images for robust object detection.
2.6. Contextual Support from Recent SOTA Studies
Recent Advancements in Underwater Domain Adaptation (2024–2025) Recent years have seen a surge in specialized deep learning frameworks for underwater vision. State-of-the-art (SOTA) object detection models, such as AGW-YOLOv8 and UDINO, have introduced multi-scale frequency enhancement modules to mitigate scattering and color shifts. To address the data scarcity and domain shift, integrated frameworks like EnYOLO have been proposed to perform simultaneous image enhancement and object detection with domain-adaptation capabilities. Furthermore, advancements in generative AI have introduced hybrid models, such as ALDiff-UIE and HyNPhyAttnGAN, which combine the strengths of Diffusion Models and GANs with physics-based simulations to generate high-fidelity underwater imagery. Other researchers have explored Edge-deployable Online Domain Adaptation using Adaptive Batch Normalization (AdaBN) to handle real-time data drift in AUV-based surveys. Our JTA-GAN aligns with these recent trends by explicitly embedding a simplified physical model to ensure structural consistency, a strategy also echoed in very recent physics-informed VAE approaches [
28,
29,
30,
31].
3. Methodology
This section presents the proposed JTA-GAN (Joint Turbidity–Attenuation Generative Adversarial Network), a physics-informed framework designed for unpaired land-to-underwater domain adaptation. The overall architecture and translation results are illustrated in
Figure 1. Unlike conventional black-box translation models such as CycleGAN, which rely purely on statistical appearance transformations, JTA-GAN explicitly embeds a simplified underwater optical model into the generative process. This enables the system to disentangle scene radiance from environmental degradations—particularly turbidity and ambient light—thereby producing synthetic underwater images with higher structural fidelity and physical plausibility.
In addition, JTA-GAN adopts a deliberately asymmetric dual-generator architecture, motivated by the inherent ill-posedness of underwater image inversion. The forward generator (Land → Sea) is physically guided and outputs interpretable degradation components, whereas the backward generator (Sea → Land) is a stable black-box reconstruction network trained under cycle consistency and perceptual supervision. This design avoids the numerical instabilities introduced by directly inverting the physical formation model and results in significantly improved training stability.
3.1. Simplified Underwater Optical Model
Underwater imaging is governed by a combination of wavelength-dependent absorption, scattering, and depth-dependent turbidity, which significantly alter color and contrast. While the full Jaffe–McGlamery model provides an accurate description of these phenomena, it requires depth maps, multiple images, or specialized sensors that are unavailable in typical unpaired datasets.
Thus, we employ its widely adopted simplified form [
32]:
where
J(
x) is the clean scene radiance (corresponding to terrestrial content),
T(
x) ∈ (0, 1] is the transmission map representing spatially varying turbidity,
A ∈ ℝ
3 is the global ambient light vector representing wavelength-selective attenuation.
Although simplified, Equation (1) retains the essential physical behavior required for synthetic training data: attenuation increases with effective depth, and the ambient illumination produces characteristic blue/green color casts. Importantly, this model allows learning without depth supervision, because JTA-GAN estimates T(x) directly from 2-D land images in a self-supervised manner.
Embedding Equation (1) into the generator enforces a structured decomposition of underwater effects and prevents the degenerate global color mappings frequently observed in standard CycleGAN training.
The rationale for adopting this simplified formulation is two-fold. First, while the comprehensive radiative transfer equation (RTE) describes underwater image formation through three components—direct attenuation, forward scattering, and backscattering—our framework adopts a simplified model that focuses primarily on attenuation and backscattering. This simplification is physically justified for object detection tasks for two reasons. First, forward scattering typically results in small-angle blur, which behaves similarly to a Gaussian low-pass filter. In the context of deep learning-based detection, convolutional neural networks are generally robust to minor blurring; however, they are highly sensitive to the contrast reduction and color distortion caused by backscattering (the “veiling light”). By focusing on the latter, Equation (1) captures the dominant degradation factor affecting detector performance.
Second, and more importantly, this formulation serves as a strong inductive bias for the generator. In a standard unconstrained GAN (e.g., CycleGAN), the generator G learns a direct mapping This “black-box” approach often leads to overfitting, where the model satisfies the discriminator by applying unrealistic global color filters or hallucinating textures that do not exist in the physical world. By enforcing the structure defined in Equation (1), we compel the network to explicitly disentangle scene radiance J from environmental factors T and A. This structural constraint acts as a regularizer, preventing the generator from learning physically impossible mappings (e.g., dense fog with high contrast) and ensuring that the synthesized images adhere to the natural laws of light propagation.
3.2. Asymmetric Network Architecture
Given the asymmetry between underwater degradation (well-posed, deterministic) and underwater restoration (ill-posed, unstable), JTA-GAN adopts a non-symmetric architecture composed of:
- (1)
a physics-constrained forward generator
- (2)
a black-box inverse generator
We will first examine the overall architecture of the proposed JTA-GAN, as visualized in
Figure 2. This schematic highlights the asymmetric dual-generator design, wherein the land-to-underwater path is explicitly constrained by a physics-based image formation model, while the reverse path is implemented as a black-box reconstruction network to ensure numerical stability.
3.2.1. Physics-Informed Forward Generator
The forward generator converts a land image J into a synthetic underwater image I. Instead of directly outputting RGB pixels, the network predicts two physically meaningful quantities:
Outputs a spatially varying transmission map: enforced via Sigmoid activation. The T-map captures haze/turbidity distribution and implicitly encodes depth-like structures.
- (2)
Ambient Light Head
Predicts a global ambient vector: Using global average pooling ensures that represents scene-level illumination rather than per-pixel noise.
- (3)
Differentiable Physics Layer
The final underwater image is synthesized by substituting the network-predicted transmission and ambient light into the physical image formation process:
Here, denotes the input land image, which is treated as the clean scene radiance. The terms and represent the transmission map and ambient light vector estimated by the generator, rather than ground-truth physical quantities. This formulation can be viewed as a learned, differentiable approximation of the underlying underwater image formation model introduced earlier, and enables end-to-end optimization under physical constraints.
This design tightly constrains the generator to produce physically grounded degradations, discouraging mode collapse and suppressing the unrealistic texture distortions seen in black-box GANs.
Next, we will illustrate the complete training workflow of JTA-GAN, as depicted in
Figure 3. The diagram serves to clarify how adversarial, cycle-consistency, perceptual, and physics-based losses are jointly optimized, emphasizing that physical constraints are applied exclusively in the forward direction to regulate the synthesis process.
3.2.2. Stable Black-Box Inverse Generator
While the forward land-to-underwater mapping can be naturally guided by a well-defined physical image formation process, the inverse underwater-to-land transformation is inherently ill-posed. Recovering the clean scene radiance from a degraded underwater observation requires undoing attenuation and scattering effects, which is highly sensitive to noise and estimation errors, particularly under severe turbidity.
In principle, inverting the simplified underwater image formation model (Equation (1)) to recover the clean scene radiance
would require solving:
Here, the numerator represents the removal of the additive backscattering noise, while the division by attempts to compensate for the multiplicative attenuation. However, this formulation reveals a critical numerical instability: small values of (which occur in deep or turbid regions) cause the denominator to approach zero. Consequently, even negligible noise in the input is amplified drastically as .
In practice, explicitly enforcing this inverse formulation during training causes gradient explosion, unstable optimization, and frequent model collapse, as confirmed by our preliminary experiments. To avoid these issues, we deliberately adopt an asymmetric design and implement the inverse generator as a purely data-driven U-Net without physical constraints. Rather than enforcing physical interpretability, this network focuses on learning a robust and stable mapping that supports:
stable gradient propagation during cycle-consistent training.
perceptually consistent reconstruction of land-domain images.
robustness under extreme turbidity and low-transmission conditions.
reliable enforcement of cycle-consistency constraints.
This asymmetric strategy allows the physics-informed forward generator to model underwater degradation accurately, while the black-box inverse generator ensures training stability and effective semantic preservation. Together, they form a complementary and practical solution for unpaired underwater domain adaptation.
Unlike standard unconstrained GANs that employ symmetric cycles, JTA-GAN’s asymmetry is a direct response to the singular nature of the underwater imaging equation. In the reverse mapping , explicitly enforcing the inverse physical model J = (I – A (1 − T))/T introduces a critical numerical risk: in scenarios of high turbidity where T(x) → 0, the division operation triggers gradient explosion and model collapse.
Therefore, our asymmetric choice is specifically tailored to underwater synthesis: it allows the forward generator to produce physically interpretable components (J, T, A) for high-quality data generation, while delegating the ill-posed restoration task to a robust black-box U-Net. This ensures that the adversarial game remains stable even under extreme synthetic degradations, a prerequisite for generating the large-scale datasets required for robust object detection.
3.3. PatchGAN Discriminators
Two discriminators, and , are employed to distinguish real and generated images in the land and underwater domains, respectively. Both discriminators adopt a PatchGAN architecture, which classifies overlapping local image patches rather than enforcing a single global realism constraint.
Compared with image-level discriminators, PatchGAN is particularly effective for underwater image synthesis, as it emphasizes high-frequency structural cues such as local contrast attenuation, caustic patterns, wave-induced distortions, and shadow consistency. These fine-grained details are critical for preserving object boundaries and texture cues that directly affect downstream object detection performance.
By providing localized adversarial supervision, the PatchGAN discriminators encourage the generator to produce spatially coherent degradations while avoiding the global color bias and texture hallucination commonly observed in black-box GAN frameworks. This design complements the physics-informed generator and contributes to stable training and improved detection robustness.
3.4. Composite Loss Function
To jointly enforce visual realism, semantic consistency, perceptual fidelity, and physical plausibility, the proposed JTA-GAN is optimized using a composite objective function. Each loss term plays a complementary role in stabilizing training and ensuring that the synthesized underwater images are both physically interpretable and beneficial for downstream object detection. The overall training objective is defined as:
where
, , and
are scalar weights controlling the relative importance of cycle consistency, perceptual similarity, and physical regularization, respectively.
Let denote an image sampled from the land domain and denote an image sampled from the underwater domain. The generators and map images between the two domains, while denotes the discriminator operating in the underwater domain.
- (a)
Adversarial Loss()
To encourage domain-level realism of the synthesized underwater images, we employ the Least Squares GAN (LSGAN) objective, which is known to provide more stable gradients than the original GAN formulation. The adversarial loss for the forward generator
and discriminator
is defined as:
where
outputs a patch-level realism score for underwater images. This loss encourages the generated images
to match the distribution of real underwater images while maintaining stable adversarial training.
- (b)
Cycle Consistency Loss ()
To preserve semantic structure and spatial correspondence during unpaired translation, we impose a cycle-consistency constraint between the two domains. The cycle loss is formulated using the
-norm:
which enforces structural consistency and discourages geometric distortion across the translation cycle.
- (c)
Perceptual Loss (LPIPS)()
While the
cycle loss enforces pixel-level consistency, it often leads to over-smoothed or “painterly” artifacts. To improve perceptual fidelity, we incorporate the Learned Perceptual Image Patch Similarity (LPIPS) loss [
33,
34,
35].
Let
denote deep feature representations extracted from a pre-trained network. The perceptual loss is defined as:
where
and
denote the cycle-reconstructed images. This loss encourages perceptual similarity in feature space and is critical for preserving high-frequency textures.
- (d)
Physics Regularization Loss ()
To ensure that the predicted transmission map
is physically plausible and spatially coherent, we apply a Total Variation (TV) regularization term:
where
indexes spatial locations in the transmission map. This regularization suppresses high-frequency noise and promotes smooth haze transitions, consistent with natural underwater scattering behavior.
3.5. Implementation Details
Selection of : This value was chosen following the original CycleGAN implementation, which has been widely validated as an optimal balance for maintaining structural consistency in unpaired image-to-image translation. Our preliminary tests confirmed that lower values (e.g., ) led to significant geometric distortions, while excessively high values restricted the generator’s ability to learn domain-specific underwater degradations.
Selection of and : These weights were empirically determined to ensure that perceptual realism and physical consistency were effectively integrated without overwhelming the primary adversarial loss. During the initial tuning phase, we observed that setting was sufficient to eliminate “painterly” artifacts and restore high-frequency textures. Similarly, provided enough regularization to ensure a smooth and spatially coherent transmission map (T-map) while still allowing the network to adapt to diverse turbidity levels.
Sensitivity Analysis: Our observations indicated that the model performance is relatively robust to minor variations in these weights. Significant performance drops (measured by synthesis quality and downstream mAP) only occurred when the weights were adjusted by an order of magnitude, suggesting that our current configuration sits within a stable and effective regime for underwater domain adaptation.
3.6. Dataset Construction and Alignment
In addition to realistic image synthesis, effective domain adaptation for object detection requires precise geometric consistency between generated images and their associated annotations. Misalignment between synthesized images and bounding boxes can severely degrade detector training, particularly for anchor-based models such as YOLO.
To ensure stable adversarial training, JTA-GAN is trained on a balanced subset consisting of 1197 land images from the COCO-debris dataset and 890 underwater images from the UIEB dataset. This balanced configuration prevents discriminator domination and promotes stable estimation of physically meaningful transmission maps.
After GAN training, the forward generator is applied to a large-scale land dataset comprising 65,153 COCO images to construct synthetic underwater training data for object detection. During GAN preprocessing, images are resized to a fixed resolution using aspect-ratio preserving padding, which introduces spatial offsets between original image coordinates and synthesized outputs.
To address this issue, we implement an automated bounding-box remapping procedure that explicitly accounts for padding and scaling operations applied during synthesis. This procedure recalibrates all bounding box coordinates to maintain precise spatial correspondence with the generated underwater images. The alignment process is applied uniformly to all synthetic datasets, including those generated by CycleGAN and JTA-GAN, ensuring a fair and controlled comparison.
This detection-aware alignment step is essential for isolating the impact of image synthesis quality on downstream detection performance and preventing confounding effects caused by annotation misalignment.
4. Experiments and Results
This section presents a comprehensive evaluation of the proposed JTA-GAN framework, including experimental settings, quantitative performance comparisons, ablation studies, and qualitative visual analysis. The effectiveness of the physics-informed synthetic data generation is validated through downstream object detection experiments on the SUIM benchmark, with comparisons against a land-only baseline and a conventional CycleGAN-based data augmentation strategy.
4.1. Experimental Setup
To provide a fair and reproducible evaluation of the proposed JTA-GAN framework, this section describes the experimental configuration used for both generative model training and downstream object detection. We first outline the datasets employed for GAN training, synthetic data generation, and detection evaluation, followed by the implementation details and training protocols. All experimental settings are kept consistent across baselines to ensure that performance differences can be attributed solely to the quality of the generated data rather than variations in detector architecture or optimization strategy.
4.1.1. GAN Training Setup
The detailed information of the training datasets and experimental configurations is summarized in
Table 1.
JTA-GAN is trained in an unpaired manner using images from two distinct domains:
Land domain: 1197 images selected from the COCO-debris subset, containing person and boat categories.
Underwater domain: 890 raw underwater images from the UIEB dataset, covering diverse visibility conditions and color distributions.
All images are resized to 256 × 256. Training is conducted for 200 epochs using the Adam optimizer (β1 = 0.5) with a batch size of 1, following standard practices for instance-normalized GAN architectures.
The same training protocol (epochs, optimizer, batch size) is used for CycleGAN, ensuring a fair comparison.
4.1.2. YOLO Training Setup
After GAN training, the final JTA-GAN model (V5) is applied to the full COCO-debris source set to generate synthetic underwater images. A total of 65,153 images are synthesized using our custom script generate_yolo_data_v6.py, which also recalculates bounding-box coordinates to compensate for padding-induced spatial shifts.
YOLO training configuration is summarized as follows:
4.1.3. Computational Resources
All experiments were conducted on a workstation equipped with:
GPU: NVIDIA RTX 4080S (16 GB Video Random Access Memory, VRAM) × 1
CPU: AMD 9700X (8 cores, 16 threads)
RAM: 32 GB
OS: Windows 11
Frameworks: PyTorch 2.9.0, CUDA 12.8
4.2. Baselines and Compared Models
To clearly assess the effectiveness of the proposed physics-informed synthesis strategy, we compare JTA-GAN against representative baseline configurations commonly used in underwater domain adaptation. These baselines are designed to isolate the impact of synthetic data quality on downstream detection performance, ranging from no domain adaptation to standard black-box GAN-based augmentation. All compared models share identical detector architectures and training settings to ensure a fair and controlled comparison.
To systematically isolate the influence of synthetic data quality on detection performance, we designed four distinct experimental configurations.
Table 2 details these settings, comparing the proposed JTA-GAN against a land-only baseline and a standard CycleGAN benchmark, while also incorporating an ablation study on detector capacity.
4.3. Quantitative Results
We quantitatively evaluate the impact of different training data configurations by measuring object detection accuracy on a held-out underwater benchmark. The primary evaluation metric is mean Average Precision over Intersection over Union(IoU) thresholds from 0.5 to 0.95 (mAP50–95). To rigorously evaluate the numerical stability and reproducibility of the proposed JTA-GAN framework, we conducted a statistical analysis across three independent training sessions for the YOLOv8s detector. Each performance metric is reported as mean ± standard deviation (), rather than a simple average. This experimental protocol ensures that the observed performance gains are consistent and not the result of specific random initializations, directly addressing the stability of the physics-informed inductive bias.
To evaluate the efficacy of physics-informed synthesis in mitigating the domain gap, presents a comprehensive performance comparison (mAP50–95) between JTA-GAN and baseline methods. As the data indicates, the inclusion of physical constraints yields significant gains in detection accuracy. The quantitative detection results, including mAP for different models, are presented in
Table 3.
Figure 4 compares detection accuracy under different training datasets. CycleGAN-based synthetic data consistently performs worse than the land-only baseline, confirming that naive image translation can introduce harmful artifacts rather than reducing the domain gap.
To provide an intuitive overview of the quantitative results,
Figure 5 visualizes the mAP50–95 performance of YOLO detectors trained under different data augmentation strategies on the SUIM test set.
YOLOv8s trained with JTA-GAN synthetic data achieves 17.3% mAP, substantially outperforming:
The relative improvement of +4.1% over baseline and +6.5% over CycleGAN demonstrates that physics-guided synthesis improves detector generalization under underwater conditions.
Notably, YOLOv8l exhibits similar trends, achieving comparable overall performance. This observation indicates that data quality, rather than detector capacity, is the dominant factor limiting underwater detection performance in this setting.
Key Observations:
CycleGAN degrades detector performance
CycleGAN-generated images exhibit global color-wash artifacts and structural distortions. Despite identical training protocols, YOLO performance drops from 13.2% to 10.8%, confirming that artifact-prone synthetic data can actively harm downstream learning.
JTA-GAN significantly improves detection accuracy
JTA-GAN yields the highest mAP across all configurations. The improvement is attributed to physics-guided degradation modeling and stable cycle-consistent training.
Detector size is not the determining factor
The absence of significant gains from YOLOv8l over YOLOv8s suggests that architectural scaling alone cannot compensate for poor data quality.
4.4. Qualitative Comparison
To better understand the performance trends observed in the quantitative evaluation, we conduct a qualitative comparison of the synthetic underwater images generated by different GAN-based translation models. This analysis focuses on visual characteristics such as color attenuation, haze distribution, and structural integrity, providing insight into how physically guided image synthesis contributes to more reliable training data for underwater object detection.
CycleGAN typically produces flat blue/green overlays and distorted reconstructions, failing to simulate depth-dependent attenuation or scattering.
In contrast, JTA-GAN generates:
structured T-maps;
smooth, low-frequency ambient illumination;
naturalistic color shifts;
preserved object boundaries and scene geometry.
These properties are critical for maintaining detector-relevant features such as edges and object silhouettes.
Analysis of Self-Supervised T-Maps
The visualization of the learned
T-maps in
Figure 6 provides compelling evidence of the model’s internal physical consistency. Although JTA-GAN is trained without any ground-truth depth data or transmission labels, the generator successfully learns to predict spatially varying transmission values that strongly correlate with scene geometry. As shown in the visualizations, objects closer to the camera (e.g., the person in the foreground) are assigned higher transmission values (brighter regions), indicating high visibility. Conversely, distant background elements are assigned lower transmission values (darker regions), simulating the stronger scattering and attenuation effects that occur over longer optical paths.
This behavior indicates that the network has implicitly learned “depth cues” from the monocular land images—such as object scale, occlusion, and perspective—to modulate the scattering effect locally. For instance, in the second row of
Figure 6, the
T-map accurately segments the umbrella and the pedestrian from the background, applying heavy fog only to the distant street. This capability highlights a critical advantage over CycleGAN: instead of applying a uniform “underwater style” filter across the entire image, JTA-GAN simulates the volumetric nature of underwater turbidity. The
T-map acts as a pixel-wise control gate, preserving the contrast of foreground objects while degrading the background. This physical plausibility is not merely a visual enhancement but a functional one; it ensures that the synthetic training data contains realistic signal-to-noise ratio (SNR) gradients, training the downstream detector to distinguish objects from the haze just as it would need to in real-world underwater scenarios.
4.5. Error Analysis and Confusion Analysis
To further understand the strengths and limitations of the proposed JTA-GAN framework beyond aggregate mAP metrics, we conduct a detailed error analysis combining qualitative inspection and class-wise confusion statistics. In particular, we focus on identifying whether residual detection failures originate from synthesis quality, detector capacity, or semantic discrepancies between training and evaluation datasets.
4.5.1. Failure Modes of CycleGAN-Based Synthetic Data
As observed in
Figure 7 and
Figure 8, detectors trained with CycleGAN-generated images consistently exhibit degraded performance across most categories. Visual inspection reveals that CycleGAN primarily applies global color shifts (blue/green washes) while failing to model depth-dependent attenuation and scattering. These artifacts distort local textures and edges, leading to unstable feature learning in YOLO. As a result, CycleGAN-based augmentation not only fails to improve detection accuracy but actively degrades performance, reducing mAP from 13.2% (land-only baseline) to 10.8%.
This phenomenon highlights that visually plausible style transfer does not necessarily translate into task-relevant synthetic data, especially for detection tasks that rely heavily on geometric and structural cues.
4.5.2. Analysis of JTA-GAN Improvements
In contrast, JTA-GAN significantly reduces false negatives and improves localization stability for semantically consistent categories such as person. The physics-guided synthesis produces structured T-maps and spatially coherent attenuation, which preserve object boundaries and relative contrast under turbidity. These properties lead to more robust feature representations during YOLO training, explaining the observed mAP50–95 improvement to 17.3%.
Importantly, the performance gain is consistent across YOLOv8s and YOLOv8l, confirming that the improvement originates from data quality rather than detector capacity.
The superiority of JTA-GAN in preserving discriminative features is further evidenced by a qualitative analysis of the synthesis behavior. As shown in
Figure 7, while CycleGAN tends to distort local textures through global color mapping, JTA-GAN maintains high structural fidelity. This is because the physics-informed architecture explicitly disentangles scene radiance J from environmental degradations. By treating the terrestrial content as a rigid prior and only modulating it through the transmission map
T (visualized in
Figure 6), the generator prevents the loss of object-level cues such as silhouettes and edges. This preservation of ‘detector-friendly’ features explains the significant mAP gains on real-world underwater benchmarks, as the downstream detector can learn robust representations that are invariant to turbidity-induced contrast reduction.
4.5.3. Confusion Matrix Analysis
Figure 9 presents the normalized confusion matrices for YOLO detectors trained under different data augmentation strategies. Several critical insights emerge.
First, for semantically aligned classes (e.g., person), JTA-GAN substantially increases true positive rates while suppressing cross-class confusion, indicating that physically grounded degradations improve category discriminability.
Second, the boat category exhibits persistently low mAP across all configurations. The confusion matrix reveals frequent misclassification between boat and background or debris-like structures. This behavior is expected and not attributable to synthesis failure. During training, the boat class primarily consists of small surface vessels such as yachts and fishing boats from COCO, whereas the SUIM evaluation set predominantly contains large underwater wrecks and ship remnants. This severe semantic mismatch causes a domain shift at the object level, which cannot be resolved solely through appearance-level domain adaptation.
Crucially, despite this mismatch, JTA-GAN does not exacerbate confusion compared with the land-only baseline, demonstrating that the proposed synthesis remains structurally faithful and does not introduce harmful biases.
4.5.4. Analysis of Results
The combined error and confusion analysis clarifies that JTA-GAN effectively addresses appearance-induced domain gaps while revealing an orthogonal limitation caused by semantic inconsistency between training and evaluation object definitions. These findings underscore an important distinction: physics-consistent image synthesis can improve detector robustness, but resolving semantic mismatch requires complementary strategies such as class redefinition, hierarchical labeling, or instance-level domain alignment.
4.5.5. Failure Case and Limitation Analysis
While JTA-GAN demonstrates superior performance in most underwater scenarios, it still faces challenges in extreme conditions.
Figure 10 illustrates typical failure cases where the synthesized images may not fully reach the desired level of realism. These failures primarily occur in two scenarios:
Extreme Turbidity and Information Loss: When the input land image contains regions with very low contrast or intricate textures, the generator may estimate a near-zero transmission map (T(x)), leading to excessive haze synthesis that obscures the object’s semantic features.
Complex Lighting Artifacts: Since our physical model employs a global ambient light vector A, it cannot fully simulate localized, dynamic lighting effects such as caustics or non-uniform refractions found in shallow water.
Identifying these failure modes provides a clear boundary for the current physics-informed approach and suggests that incorporating more complex optical models could be a focus for future development.
4.6. Ablation Study
To quantify the contribution of each component in JTA-GAN, we evaluate three ablation settings consistent with the models actually trained during development (V3, V4, V5). The comparison focuses on training stability, reconstruction behavior, and the final SUIM detection performance using YOLOv8s.
Table 4 presents the ablation results across various model configurations, demonstrating the necessity of each proposed component in our framework.
As suggested, we have expanded the discussion on dataset balance in the revised manuscript. We conducted experiments using various sample ratios to identify the optimal configuration. In our early trials, we tested an imbalanced ratio of approximately 12:1 (1479 land images vs. 120 underwater images). However, the results showed that such a significant disparity led to discriminator domination, where the discriminator in the target domain converged too rapidly. This caused the generator to receive vanishing gradients, resulting in the failure of transmission map (T) estimation and noisy synthesis.
Our empirical findings indicate that discriminator saturation typically occurs when the imbalance ratio exceeds 10:1. By adjusting the samples to a more balanced ratio of 1.3:1 (1197:890), we successfully stabilized the adversarial competition, allowing JTA-GAN to learn the physical parameters effectively. We have added these details to
Section 4.1.2 to clarify the selection of this threshold.
The stability of JTA-GAN training is highly sensitive to the balance between the land domain (XL) and the underwater domain (XS). We conducted experiments using various sample ratios to determine the operational boundaries of our framework. In preliminary trials, we tested an imbalanced configuration of 1479:120 (approx. 12:1). However, the results were unsatisfactory as the discriminator in the underwater domain saturated prematurely, leading to vanishing gradients and generator collapse. Through further empirical testing, we observed that discriminator domination consistently occurred when the land-to-underwater ratio exceeded 10:1. In contrast, the adopted ratio of 1197:890 (approx. 1.3:1) maintained a competitive adversarial game, ensuring that the generator could accurately disentangle scene radiance (J) from transmission (T) and ambient light (A).
To rigorously evaluate the indispensability of the physics layer and the asymmetric architecture, we extended our ablation study to include variants V1 and V2. V1 represents a standard black-box CycleGAN without any physical constraints, which results in a significantly lower mAP (10.8%) due to unrealistic global color shifts. V2 incorporates the physics layer but employs a symmetric inverse mapping (); as expected, this configuration proved to be numerically unstable, frequently encountering gradient explosions in highly turbid regions where T approaches zero. These results empirically confirm that the physics layer provides the necessary inductive bias for realistic synthesis, while the asymmetric design is the prerequisite for training stability in underwater environments.
4.7. Computational Efficiency Analysis
To evaluate the practical feasibility of the proposed framework for real-time underwater applications, we measured the computational overhead and latency on an NVIDIA RTX 4080S GPU. As summarized in
Table 5, the JTA-GAN training process for 200 epochs was completed in approximately 4.03 h (72 s per epoch), demonstrating that the physics-informed architecture does not introduce prohibitive overhead during the synthesis phase. For the downstream task, the JTA-GAN-trained YOLOv8s detector achieves a total per-image latency of 3.0 ms on the SUIM validation set (comprising 0.5 ms preprocessing, 1.9 ms inference, and 0.6 ms post-processing). This corresponds to a high-throughput rate of approximately 333.3 FPS (Frames Per Second), which significantly exceeds the real-time requirements of most AUVs. Furthermore, the peak GPU VRAM consumption was recorded at 8.0 GB, with system RAM usage ranging between 10.0 and 18.0 GB. These metrics indicate that our model is highly compatible with the hardware constraints of modern underwater robotic platforms.
5. Discussion
To further clarify the causal links between our methodology and the observed performance gains, we provide a targeted analysis of our key design choices. First, the integration of the physics-informed decoder (
J-
T-
A decomposition) is directly responsible for closing the domain gap between land and underwater environments. Unlike CycleGAN, which produces generic color shifts, our model enforces distance-dependent degradation, allowing the YOLO detector to learn features that are physically consistent with real-world underwater optics, resulting in a 6.5% mAP improvement over the CycleGAN baseline. Second, the adoption of the asymmetric dual-generator architecture is the prerequisite for the high-quality synthesis observed in V5. By avoiding the numerical singularities inherent in symmetric physical inversion, we achieved a stable training signal that prevents artifacts and preserves object-level discriminative features. Finally, the perceptual loss (LPIPS) specifically addresses the structural fidelity of underwater objects. The qualitative clarity of synthesized textures (as seen in
Figure 7) correlates with the enhanced detection precision for detailed classes like ‘Human’ and ‘Robot,’ as it prevents the ‘oil-painting’ artifacts that often confuse supervised detectors.
5.1. Interpretation of the Experimental Results
The experimental results demonstrate that the effectiveness of synthetic data for underwater object detection is highly dependent on whether the generation process respects the underlying physical characteristics of the target domain. While unpaired image-to-image translation frameworks such as CycleGAN are often assumed to be beneficial for domain adaptation, our results clearly indicate that naive, appearance-driven synthesis can be counterproductive.
Specifically, detectors trained on CycleGAN-generated underwater images consistently underperformed even the land-only baseline. This degradation can be attributed to the tendency of CycleGAN to satisfy adversarial objectives through global color shifts and texture hallucination, rather than modeling physically meaningful underwater degradation. Such artifacts distort object boundaries and introduce spurious textures, which negatively impact feature learning in downstream detectors such as YOLOv8. These findings highlight that synthetic data augmentation is not inherently beneficial; without appropriate constraints, it may amplify domain noise rather than reduce domain discrepancy.
In contrast, JTA-GAN consistently improves detection performance across all evaluated settings. By explicitly disentangling scene radiance, transmission, and ambient light, the proposed framework generates underwater images that preserve semantic structure while introducing realistic degradation patterns. The observed mAP improvement—from 13.2% (land-only) to 17.3% (JTA-GAN)—confirms that physics-consistent synthesis produces training data that is both visually plausible and functionally useful for detection models. Importantly, these gains are achieved without reliance on depth supervision or paired data, underscoring the practicality of the proposed approach.
Physics-Based Synthesis as Adversarial Data Augmentation: From a representation learning perspective, the success of JTA-GAN can be interpreted as a form of physically grounded adversarial data augmentation. The fundamental challenge in domain adaptation is the misalignment of feature distributions between source and target domains. Traditional style transfer methods (like CycleGAN) attempt to align these distributions by modifying the global appearance (texture and color). However, they often inadvertently alter the semantic content (e.g., distorting object shapes), which confuses the detector.
JTA-GAN addresses this by keeping the semantic content (scene radiance J) rigid while injecting physical noise (turbidity T and attenuation A) in a structurally consistent manner. When a YOLO detector is trained on this data, it is essentially being trained to be invariant to these specific physical degradations. The network learns to ignore the “veiling light” and focus on the underlying structural features that remain constant across domains. This explains why the mAP improvement is so significant (from 13.2% to 17.3%): the detector is not just learning to recognize “blue objects,” but is learning to extract robust features that persist even when the signal is attenuated by the medium. Furthermore, because the synthesis is constrained by the LPIPS perceptual loss, the semantic consistency of the bounding boxes is preserved. The generated objects remain strictly aligned with their labels, ensuring that the detector is provided with high-quality, reliable supervision signals, thereby solving the “label shift” problem often encountered in GAN-based augmentation.
5.2. Analysis of Category-Specific Performance and Semantic Mismatch
A notable observation is the consistently low detection performance for the boat category across all experimental configurations. This behavior should not be interpreted as a failure of the proposed synthesis model. Instead, it reflects a fundamental semantic mismatch between the training and evaluation datasets.
During training, the boat class primarily consists of small-scale surface vessels such as yachts and fishing boats derived from terrestrial COCO images. In contrast, the SUIM evaluation set predominantly contains large underwater structures, including wrecks and submerged ruins, which differ substantially in scale, geometry, and visual context. As a result, even physically realistic underwater synthesis cannot bridge this high-level semantic gap. This conclusion is further supported by the fact that increasing detector capacity (YOLOv8l vs. YOLOv8s) does not significantly improve boat detection performance.
The confusion matrix analysis (
Figure 9) reinforces this interpretation by revealing systematic misclassification patterns rather than random noise. These errors suggest that semantic inconsistency—rather than insufficient visual realism—is the dominant limiting factor for this category. Consequently, the low boat mAP should be viewed as an inherent dataset limitation, not as evidence of ineffective domain translation.
5.3. Stability, Loss Design, and the Role of Perceptual Supervision
Another critical insight from the experiments concerns training stability and loss formulation. Our ablation studies demonstrate that balanced domain sampling is a prerequisite for stable adversarial learning in asymmetric GAN architectures. Severe imbalance between land and underwater domains leads to discriminator domination and generator collapse, whereas a more balanced configuration enables meaningful T-map estimation and stable convergence.
Moreover, the inclusion of LPIPS perceptual loss plays a decisive role in preserving high-frequency structure during cycle reconstruction. Models trained solely with L1-based cycle consistency exhibit painterly artifacts and over-smoothed textures, particularly in the reverse mapping. Incorporating perceptual supervision mitigates these effects by aligning feature-level representations, resulting in reconstructions that are visually sharper and semantically more faithful [
36]. This improvement is not merely cosmetic; it directly contributes to better downstream detection performance by maintaining discriminative object features.
5.4. Limitations and Implications for Underwater Domain Adaptation
Despite its advantages, the proposed JTA-GAN framework has several inherent limitations. First, the simplified optical model assumes a single global ambient light vector and does not explicitly account for wavelength-dependent attenuation or spatially varying illumination beyond transmission effects. While this approximation is sufficient for effective data augmentation, it does not capture the full complexity of underwater light propagation.
Second, the framework operates on single images and does not model temporal consistency, which may limit its applicability to video-based underwater perception tasks. Finally, as demonstrated by the boat category results, physics-consistent synthesis alone cannot resolve high-level semantic mismatches between training and evaluation datasets.
Nevertheless, these limitations also clarify an important implication: effective domain adaptation for underwater detection requires both physically grounded low-level modeling and semantically aligned training data. JTA-GAN addresses the former by constraining image synthesis through interpretable physical components, thereby preventing the harmful artifacts observed in black-box GANs. The remaining challenges point toward complementary research directions, such as semantic-aware data selection and multi-modal supervision, rather than deficiencies in the proposed generative framework itself.
6. Conclusions and Future Work
This paper presents JTA-GAN, a physics-guided generative framework for unpaired land-to-underwater domain adaptation aimed at improving underwater object detection. Unlike conventional image-to-image translation methods that rely solely on appearance alignment, JTA-GAN explicitly embeds a simplified underwater image formation model into the generative process. By disentangling scene radiance, transmission (T), and ambient light, the proposed framework synthesizes underwater images that are both physically interpretable and structurally consistent.
A key design choice of JTA-GAN is its asymmetric dual-generator architecture. The forward generator is constrained by a differentiable physics layer to model underwater degradation in a well-posed manner, while the inverse generator adopts a stable black-box design to avoid the numerical instability inherent in reversing the physical model. Together with balanced domain sampling, PatchGAN discriminators, perceptual supervision via LPIPS, and physics-based regularization, this architecture achieves stable training and high-fidelity image synthesis without requiring depth supervision or paired data.
Comprehensive experiments demonstrate that synthetic data quality is a decisive factor in underwater detection performance. While CycleGAN-based synthesis degrades YOLO performance due to unrealistic artifacts, detectors trained with JTA-GAN–generated images achieve substantial improvements, reaching 17.3% mAP50–95 on the SUIM benchmark, compared to 13.2% for the land-only baseline. These results confirm that physics-consistent image synthesis can effectively reduce domain gaps and enhance the robustness of underwater perception systems. Furthermore, category-wise and confusion-matrix analyses reveal that remaining performance bottlenecks are largely attributable to semantic mismatches between training and evaluation datasets rather than limitations of the proposed generative model.
Several directions can be explored to further extend the proposed framework. First, the current simplified optical model assumes a single global ambient light vector and does not explicitly model wavelength-dependent attenuation. Incorporating multi-spectral or wavelength-aware formulations may further improve realism under diverse underwater conditions.
Second, extending JTA-GAN to temporal or video-based synthesis would enable the generation of temporally consistent underwater sequences, which are crucial for applications such as autonomous underwater vehicles and long-term monitoring. Integrating motion-aware constraints or temporal coherence losses is a promising avenue for future research.
Third, the semantic mismatch observed in certain categories (e.g., surface vessels versus underwater wrecks) suggests that semantic-aware data curation or class-level adaptation strategies could complement physics-guided synthesis. Combining JTA-GAN with category-specific domain alignment or hybrid real–synthetic datasets may further enhance detection robustness.
Finally, optimizing computational efficiency and exploring lightweight implementations could facilitate real-time deployment on embedded or edge devices, expanding the applicability of physics-informed generative learning to practical underwater inspection and surveillance scenarios.