1. Introduction
Deep Learning (DL) has revolutionized medical image analysis, but its success often hinges on the availability of large, labeled datasets. In the medical domain, data are notoriously scarce and imbalanced due to privacy constraints and the cost of expert annotation [
1,
2]. GANs offer a compelling solution by learning to generate realistic synthetic images that can augment training data or assist in tasks such as image enhancement and modality translation [
3]. A GAN typically consists of a generator network that creates fake images and a Discriminator network that attempts to distinguish fakes from real images, both trained in an adversarial loop. This paradigm has produced photorealistic outputs in computer vision and shows promise in medical imaging for tackling data scarcity and improving model generalization [
4,
5].
In recent years, specialized GAN variants and conditional GANs have been applied to a range of medical imaging problems. For example, conditional GANs (cGANs) like Pix2Pix [
6] enable supervised image-to-image translation and have been used for tasks such as synthesis of medical modalities and segmentation mask generation [
7]. SPADE (Spatially Adaptive) GAN introduced spatially-aware normalization to better preserve semantic layouts during image synthesis [
8], making it well-suited for generating realistic images from segmentation maps in medical contexts (e.g., synthesizing MRI from label maps). Meanwhile, the Wasserstein GAN (WGAN) improved training stability by using an Earth-Mover distance metric for the GAN loss [
9] and has been employed for medical image restoration and augmentation due to its ability to mitigate mode collapse and produce sharper outputs [
10]. A recent survey counted over 460 GAN-based studies spanning various medical applications [
11] reflecting the widespread enthusiasm for GANs as de novo data generators, augmenters, and enhancers in fields from radiology to pathology [
12].
Despite these advances, a critical question remains: Do GAN-generated medical images truly benefit downstream clinical tasks, and how do different GAN architectures compare in this regard? Prior works have reported mixed outcomes. Frid-Adar et al. demonstrated that augmenting a small liver lesion CT dataset with GAN-synthesized images improved a classifier’s sensitivity from 78.6% to 85.7% [
13]. In brain MRI, GAN-based augmentation yielded segmentation performance reaching ~77–93% of using real images, depending on the GAN type. Skandarani et al. found that while advanced GANs (e.g., StyleGAN, SPADE) can generate realistic images that achieve low FID and even fool experts in a Turing test, none could fully replace real data for training segmentation models [
14]. These studies indicate that the efficacy of GANs may vary with the network architecture and the complexity of the dataset [
15].
In this work, we perform a multi-GAN, multi-dataset evaluation to assess the relative strengths of Pix2Pix, SPADE GAN, and WGAN in medical image processing tasks. Three datasets covering different organs and imaging modalities—cardiac MRI, brain MRI, and abdominal MRI—were used to ensure a diverse evaluation. Each GAN model is applied to tasks matching its intended use (segmentation, image synthesis, inpainting, or enhancement), and we evaluate using both standard computer vision metrics (FID, PSNR, SSIM) and task-specific metrics (Dice and Accuracy for segmentation quality). By analyzing comprehensive results table and visualizations, we highlight which GANs produce the most realistic images and which actually improve or degrade downstream segmentation performance. To our knowledge, this is one of the first studies directly comparing these GAN architectures across multiple medical imaging applications in a consistent experimental setup.
Recent advancements in medical image processing have enabled artificial intelligence (AI) to play a central role in clinical diagnosis, disease segmentation, and multimodal image analysis. Among AI methods, Generative Adversarial Networks (GANs) have emerged as powerful tools for producing high-quality synthetic medical images, enhancing image realism, and improving downstream diagnostic performance. These networks consist of a generator–discriminator pair trained in opposition to synthesize images that are indistinguishable from real data.
Despite significant progress, challenges persist in applying GANs to medical imaging. Limited annotated datasets, modality imbalance (e.g., MRI vs. CT), and unstable adversarial training often hinder the generation of clinically reliable images. Many existing studies focus on single GAN variants or specific imaging tasks, leaving open the question of which GAN architectures perform best across different modalities and objectives such as segmentation, image translation, or enhancement.
To address this gap, this work conducts a comprehensive comparative analysis of three representative GAN architectures—Pix2Pix, SPADE GAN, and Wasserstein GAN (WGAN)—applied to diverse medical image processing tasks. Using three benchmark datasets (ACDC, Brain Tumor MRI, and CHAOS), we evaluate each model’s capability in terms of visual realism, image quality, and segmentation accuracy, assessed by FID, PSNR, SSIM, Dice, and accuracy metrics.
Major Contributions of This Study:
We provide a unified GAN framework for medical image processing covering segmentation, synthesis, and restoration tasks.
We implement and compare three architectures—Pix2Pix, SPADE GAN, and WGAN—under consistent preprocessing, augmentation, and hyperparameter settings.
We introduce an integrated evaluation pipeline combining both quantitative (FID, PSNR, SSIM, Dice) and qualitative (visual inspection) metrics across datasets.
We analyze the trade-off between visual realism and clinical relevance, offering practical insights for selecting suitable GANs for specific medical applications.
We demonstrate that GANs can generate anatomically realistic medical images that enhance model generalization, particularly when data availability is limited.
The remainder of this paper is organized as follows:
Section 2 reviews related works on GANs in medical imaging;
Section 3 outlines the proposed methodology and datasets;
Section 4 details the experimental setup and training configurations;
Section 5 presents results and discussion; and
Section 6 concludes the study with key findings and future directions.
2. Related Work
Recent progress in DL, especially with Generative GANs, has significantly influenced medical imaging tasks such as image synthesis, modality completion, and segmentation enhancement. Several studies have explored the effectiveness of various GAN architectures in improving data quality and model performance in clinical contexts.
Skandarani et al. (2023) [
14] conducted an empirical evaluation of six GAN types—DCGAN, LSGAN, WGAN, HingeGAN, SPADE, and StyleGAN—for medical image synthesis using three datasets: ACDC (cardiac MRI), SLiver07 (liver CT), and IDRiD (retinal fundus). Their method combined FID for visual fidelity assessment with a segmentation task using U-Net and Dice scores to measure practical utility. Results showed that advanced models (SPADE, StyleGAN) achieved the best performance, with StyleGAN reaching ~87% Dice on ACDC, close to real-data results. Simpler GANs consistently had low Dice scores, indicating poor suitability for medical imaging. In terms of FID, StyleGAN often produced the most realistic images (e.g., ~24.7 on ACDC) compared to much higher values for DCGAN (~60). However, a low FID did not always mean higher segmentation accuracy, as seen in SLiver07 where SPADE outperformed StyleGAN despite a worse FID. The study concludes that while state-of-the-art GANs can generate visually convincing medical images, they still cannot fully match the diagnostic richness of real datasets [
14].
Raut et al. (2024) [
16] developed a 3D Pix2PixNIfTI GAN to synthesize missing MRI modalities for multi-class brain tumor segmentation. Using the BraTS2021 dataset (1251 patients, T1w, T2w, T1CE, FLAIR), synthetic images replaced missing inputs in a DeepMedic segmentation model. The best performance came from synthetic T2w images, achieving Dice scores of 0.74 (necrotic core), 0.81 (edema), 0.84 (enhancing tumor), and 0.90 (whole tumor), close to original-image results. Image quality assessment showed lowest MSE for synthetic T2w and FLAIR, and highest for T1CE, which correlated with poorer segmentation. An inverse relationship was found between MSE and Dice scores, confirming that better synthetic quality improves segmentation. While original inputs still gave the highest accuracy, the performance drop with synthetic inputs was small. This demonstrates GAN-based synthesis as a viable solution for accurate segmentation when MRI sequences are missing [
16].
Eker et al. (2024) [
17] presented BrainPixGAN, a mask-based GAN designed to generate realistic intraoperative MRI (iMRI) images from preoperative MRI scans and automatically segmented resection cavity masks. Using RESECT-SEG for segmentation and BITE for MRI synthesis, the method follows a two-stage process: U-Net with EfficientNetB7 for cavity segmentation, then BrainPixGAN for iMRI generation. The segmentation model achieved 97.82% Dice and 99.55% IoU, showing excellent accuracy. Generated images scored 0.87 SSIM, 35.89 PSNR, and 0.0037 LPIPS, indicating high visual quality. Compared to Pix2Pix and SPADE, BrainPixGAN delivered superior results in both segmentation and generation metrics. This demonstrates its robustness and capability in producing high-fidelity synthetic iMRI. Clinically, it offers a cost-effective alternative to expensive iMRI systems for surgical guidance [
17].
Wang et al. (2024) [
18] proposed an EGAUNet, a GAN-assisted U-Net with global spatial-channel attention (GSCA) and efficient mapping convolutional blocks (EMCB) for multi-organ medical image segmentation. The model was tested on CHAOS T2SPIR, CHAOS T1DUAL, brain MRI, and chest X-ray datasets. GAN integration helped refine segmentation outputs beyond standard U-Net performance. EGAUNet achieved Dice scores from ~76% (brain MRI) to ~96% (chest X-ray), outperforming U-Net and recent models like U-Net++ and TransUnet. GSCA improved focus on relevant regions, while EMCB enhanced multi-scale feature extraction. The approach showed consistent accuracy gains across all datasets. This demonstrates EGAUNet’s robustness and generalizability for diverse medical imaging tasks [
18].
Rafiq et al. (2025) [
19] proposed an EssNet+U-Net to improve liver segmentation by synthesizing MRI from unpaired CT scans and training with both real and synthetic images. Using the CHAOS abdominal CT and MRI dataset, EssNet (a CycleGAN-based model with a segmentation branch) generates better-aligned synthetic MRIs than standard CycleGAN. The U-Net trained on 1064 combined MRIs (350 real + 714 synthetic) achieved its best results with Dice = 0.9524 and IoU = 0.9091. This outperformed training with only real MRIs (Dice = 0.9459, IoU = 0.8974). The ablation study confirmed EssNet’s superiority over basic CycleGAN in both image alignment and segmentation accuracy. The technique reduces dependence on large annotated MRI datasets. Overall, it offers an effective cross-modality synthesis solution for tackling medical data scarcity [
19].
3. Generative Adversarial Networks
Generative Adversarial Networks (GANs) represent a class of deep learning models that were introduced recently [
20], comprising two competing neural networks: a generator and a discriminator.
Figure 1 illustrates a summary: the generator produces synthetic data samples, whereas the discriminator assesses these samples against real data, offering evaluation to enhance the generator’s performance. In this adversarial framework, the generator incrementally acquires the ability to generate data that increasingly mirrors the original distribution [
21]. This framework has demonstrated significant efficacy in tasks including image synthesis, data augmentation, and domain translation, facilitating the generation of high-dimensional, complex, and realistic data representations without the necessity for explicit labeling or paired datasets [
22].
As shown in
Figure 2, Generative Adversarial Networks (GANs) have evolved from the original Vanilla GAN into a diverse family of architectures targeting stability, controllability, and image quality. DCGAN introduced convolutional architectures, enabling more realistic outputs. Stability-focused variants such as WGAN and WGAN-GP improved convergence by replacing the JS-divergence with Wasserstein distance and adding gradient penalty, while LSGAN and Hinge GAN modified loss functions for better gradient flow and sharper results. Conditional generation emerged with cGAN, extended to paired image translation in Pix2Pix and unpaired translation in CycleGAN; StarGAN unified multi-domain translation, and SPADE improved semantic-to-image synthesis via spatially adaptive normalization. Representation-learning GANs such as InfoGAN and BiGAN/ALI learned interpretable latent factors and bidirectional mappings. For high-fidelity synthesis, Progressive GAN introduced resolution-growing training, and StyleGAN/StyleGAN2/StyleGAN3 achieved state-of-the-art realism with style-modulated layers; BigGAN demonstrated large-scale class-conditional generation. Task-specific variants include SRGAN for super-resolution, panoramic GANs, text-to-image models like StackGAN and AttnGAN, and medical adaptations (e.g., MedGAN, Pix2PixHD, CycleGAN, SPADE) for MRI↔CT translation, enhancement, and segmentation. The “best” GAN is application-dependent: StyleGAN2/3 excel at photorealism, WGAN-GP at stability, CycleGAN at unpaired translation, and specialized variants at domain-specific tasks.
3.1. GAN Selection
GANs are widely used in medical imaging for tasks such as image generation, translation, and inpainting. Choosing the right architecture is essential to achieve optimal performance and stability. To comprehensively evaluate the strengths and weaknesses of various GAN designs, we selected four architectures for this study, each representing a unique operational philosophy:
These models were selected to cover paired/unpaired data, image-to-image and mask-to-image synthesis, and to compare generation quality with and without spatial context. Prior studies demonstrated that SPADE and WGAN outperform simpler GANs like DCGAN and LSGAN on tasks involving the ACDC and CHAOS datasets, particularly when measured by FID and Dice Score. However, visual realism does not always correlate with segmentation accuracy—an issue we also explore in this paper [
14].
3.1.1. Pix2Pix (cGAN for Paired Image Translation)
Pix2Pix is a conditional GAN designed for one-to-one image translation tasks. The generator is a U-Net which takes an input image (e.g., an MRI) and outputs an image of the target domain (e.g., a segmentation mask or translated modality), ensuring that low-level spatial information is preserved through skip connections [
26]. The discriminator is a PatchGAN that evaluates image patches for realism, focusing on high-frequency details. Pix2Pix is trained with a combination of adversarial loss and
$L1
$ pixel-wise loss between the generated and ground truth images [
27]. In our experiments, we leverage Pix2Pix in two ways:
Segmentation model: Treating it as a segmentation network by inputting an image and training it to output the segmentation mask (with $L1$ loss on the mask and adversarial loss to refine mask realism). This is essentially using Pix2Pix for semantic segmentation of ACDC and CHAOS images, as noted in the dataset descriptions.
Modality synthesis: We also experimented with Pix2Pix for translating segmentation maps into realistic medical images—similar to the task performed by SPADE—by training on paired image–mask datasets (e.g., CHAOS masks mapped to MRI scans). We adopted the standard Pix2Pix configuration, which uses a U-Net generator with 8 downsampling layers and a PatchGAN discriminator with a 70 by 70 receptive field [
28]. The model was trained using a learning rate of 0.0002, a batch size of 16, and for 200 epochs until convergence. Pix2Pix employs an L1 loss function alongside adversarial loss, which encourages the generated output to closely match the ground truth, making it effective for tasks such as denoising and image enhancement [
29]. However, Pix2Pix may produce blurry outputs if the adversarial and L1 loss components are not properly balanced. Based on preliminary tuning, we set the adversarial-to-L1 loss ratio to 0.5 to 1.
3.1.2. SPADE GAN (Semantic Image Synthesis)
SPADE (Spatially-Adaptive Denormalization) GAN is a conditional GAN specifically designed to generate images from input segmentation maps or layouts. The key innovation is the SPADE layer, which injects the semantic label information at multiple scales in the generator through adaptive normalization. This prevents the “washing out” of semantic structure that can happen with traditional normalization layers [
30]. Our SPADE GAN implementation uses the official architecture: a ResNet-based generator with SPADE layers conditioned on segmentation masks, and a multi-scale PatchGAN discriminator [
31]. We applied SPADE to two tasks:
3.1.3. Wasserstein GAN (WGAN) for Image Enhancement
The WGAN formulation replaces the traditional GAN loss with a Wasserstein distance approximation, yielding more stable training and a meaningful loss metric that correlates with output quality [
33]. Our WGAN model follows the improved WGAN-GP approach (gradient penalty) for enforcement of the Lipschitz constraint. We deploy WGAN in an image restoration/enhancement context: the generator is a CNN that takes a degraded image as input and aims to produce a restored image, and the discriminator judges real vs. restored. For example, on CHAOS we simulate low-resolution images by downsampling and have WGAN learn to super-resolve them to the original resolution (a super-resolution task). On the Brain MRI dataset, we also experimented with WGAN for denoising, where we added synthetic noise to images and trained the GAN to remove it. The generator uses a U-Net-like architecture (so it can leverage the input image structure) but optimized under the WGAN loss; the discriminator is a standard CNN classifier. We included an L1 reconstruction loss in WGAN training for these tasks, effectively making it a conditional GAN (cGAN) with Wasserstein loss. Training WGAN models can be resource-intensive; we trained each model for 100 epochs using the RMSProp optimizer (as recommended in the original WGAN paper) with a learning rate of 0.00005. The gradient penalty coefficient was set to 10. In our configuration, the discriminator was updated five times for every generator update. One key advantage of WGAN is its training stability, even when both the generator and discriminator have high capacity—this enabled us to train effectively on limited datasets such as brain MRI without experiencing severe overfitting. For image enhancement tasks, we observed that WGAN often produced outputs with sharper details and higher contrast compared to Pix2Pix, which tended to over-smooth images in order to optimize the L1 loss.
4. Materials and Methods
This section describes the experimental setup, including the model configurations, hyperparameter search strategy, and overall GAN pipeline used in this study. We implemented three different GAN architectures and evaluated them on multiple medical imaging datasets.
4.1. GANs Configurations
Three GAN architectures—Pix2Pix, SPADE, and WGAN—were investigated, each selected based on its suitability for a specific medical imaging task: segmentation, inpainting, and enhancement, respectively. While all models follow the standard GAN paradigm of a generator (G) and a discriminator (D) trained through adversarial learning [
33], they differ in architecture, training objectives, and target applications.
Pix2Pix is a conditional GAN that performs well in paired translation tasks such as segmentation map to image or image to segmentation. Its architecture features a U-Net generator and a PatchGAN discriminator, optimized using a combination of adversarial loss and L1 reconstruction loss (
Figure 3a).
SPADE GAN is designed for semantic image synthesis and inpainting, where spatially adaptive normalization allows the generator to integrate spatial information from segmentation masks into the synthesis process. This model is particularly effective for restoring masked or corrupted regions in medical images (
Figure 3b).
Wasserstein GAN (WGAN) focuses on improving training stability and output sharpness by using a Wasserstein loss with gradient penalty. It is especially suitable for generating high-fidelity images from limited data, such as in low-dose MRI enhancement tasks (
Figure 3c).
Each model was trained separately using datasets relevant to its design goals. The training process, loss functions, and hyperparameters are detailed in
Section 5.
Figure 3 illustrates the high-level pipeline of each GAN architecture.
4.2. Hyperparameter Search
To ensure reproducibility and a fair comparison, all models were trained under consistent preprocessing and evaluation conditions. Images were resized to 256 × 256, converted to grayscale, and normalized to either [–1, 1] for (Pix2Pix) or [0, 1] for both (SPADE, WGAN). Data were split into 70% training, 15% validation, and 15% testing subsets using a fixed random seed (42) for deterministic reproducibility.
Pix2Pix: The U-Net generator and PatchGAN discriminator were trained for 30 epochs using batch size = 1, learning rate = 2 × 10−4, and the Adam optimizer (β1 = 0.5, β2 = 0.999). The generator loss combined adversarial BCE, L1 (λ = 100), and Dice (λ = 10) terms. Augmentations included horizontal/vertical flips, rotations (±15°), and affine translations (±15 px).
SPADE GAN: The SPADE-ResBlock generator and PatchGAN discriminator were trained for 30 epochs, with batch size = 1, learning rate = 2 × 10−4, and Adam optimizer (β1 = 0.5, β2 = 0.999). The generator used adversarial, L1 (λ = 100), and Dice (λ = 10) losses. The input images contained a fixed 50 × 50 masked region (for inpainting), and no stochastic augmentation was applied.
WGAN: The ResUNet generator and critic were trained for 30 epochs, with batch size = 4 and learning rate = 5 × 10−5, using RMSProp optimization. The critic was updated twice per generator step (k = 2) with the Wasserstein loss + gradient penalty (λ_gp = 10). The generator minimized L1 (λ = 100) + Dice (λ = 10). Degraded input images were generated via blur (σ ≈ 1.5), downsampling (64 × 64 → 256 × 256), and Gaussian noise (σ ≈ 15).
To overcome overfitting and improve generalization, various data augmentation techniques were applied during training, including random rotations (±15°), horizontal and vertical flips, affine transformations, and intensity normalization. These augmentations effectively increased dataset variability and reduced the likelihood of the model memorizing training samples. As shown in
Table 1.
All experiments were performed on an NVIDIA GPU with deterministic CUDA settings to ensure identical results across runs.
4.3. GAN Training
A typical training pipeline was followed in which, for each dataset, the GAN model was trained on the training set images (with or without conditional inputs, depending on the GAN type). For Pix2Pix and SPADE, training is supervised using paired data (e.g., an input image and target output). For WGAN, training can be unsupervised or use degraded images as input for restoration tasks. During training, the generator produces an output image based on an input, which could be an image, segmentation map, or random noise vector. The discriminator then attempts to classify this generated image as either real or fake by comparing it to actual images from the dataset. The adversarial loss function encourages the generator to produce outputs that are indistinguishable from real data. In conditional GAN models, an additional reconstruction loss—such as mean absolute error (L1) or mean squared error (L2)—is typically applied between the generated output and the ground truth to ensure structural fidelity. Once trained, the generator is used to produce synthetic images for evaluation (either stand-alone image quality metrics or to train a segmentation network for Dice/Accuracy evaluation).
4.4. Evaluation of GAN Performance in Medical Imaging
Evaluating GAN performance in the medical imaging domain requires both quantitative metrics and qualitative analysis. Quantitatively, we assessed the realism and fidelity of generated images using established metrics such as Fréchet Inception Distance (FID), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index Measure (SSIM). These metrics reflect visual quality, noise suppression, and structural consistency, which are critical in clinical applications. Qualitatively, expert visual inspection was used to judge anatomical plausibility, boundary preservation, and artifact presence in the synthetic images. For example, SPADE GAN consistently yielded more anatomically faithful outputs in tasks such as liver MRI synthesis and organ inpainting, while Pix2Pix performed well in segmentation translation tasks. WGAN showed strength in low-data settings, producing sharper and higher-contrast results. Together, these evaluation methods allowed us to compare the effectiveness of different GAN models across multiple datasets and clinical objectives.
4.5. Datasets
We evaluate the GAN models on three public medical imaging datasets covering different anatomies and imaging modalities.
Figure 4 shows example images and corresponding annotations from each dataset.
4.5.1. ACDC (Automated Cardiac Diagnosis Challenge) Dataset
The ACDC dataset consists of MRI exams from 150 patients, divided into training (100) and test (50) sets. Each exam provides short-axis cardiac MR images covering the full cardiac cycle. Expert annotations are provided for the left ventricular endocardium, left ventricular myocardium, and right ventricular cavity in end-diastole and end-systole frames. The patients span 5 categories of pathology (healthy, previous myocardial infarction, dilated cardiomyopathy, hypertrophic cardiomyopathy, abnormal right ventricle), with 20 cases per category in training [
34]. This balanced design enables evaluating GAN performance across both healthy and pathological cardiac anatomy. In our experiments, ACDC is used primarily for cardiac MRI segmentation and synthetic image generation from segmentation maps. MRI slices are resized to 256 × 256 and normalized. The challenge’s high-quality annotations and relatively large dataset size (1902 labeled slices) make ACDC a strong benchmark for segmentation accuracy.
4.5.2. Brain Tumor MRI Dataset
We utilize an open brain tumor MRI dataset comprising T1-weighted contrast-enhanced images with pixel-level tumor annotations. This dataset includes MRI slices from patients with three tumor types: glioma, meningioma, and pituitary tumor. In total, there are approximately 3000 axial slices (e.g., ~1426 glioma, ~930 pituitary, and the rest meningioma) with corresponding binary masks delineating tumor regions [
35]. Each slice is 256 × 256 pixels, and the tumor sizes vary from small nodules to large masses. We select this dataset as representative of pathology-focused imaging with limited data. Only a few hundred images per class are available, making it an ideal testbed for GAN-based augmentation. We use the training split (which we augmented via GANs) and a reserved test set of 20% of images for evaluation. This dataset challenges GANs to generate realistic pathological patterns (tumor textures, shapes) while preserving global brain structure—a difficult task since the mask (tumor location) provides sparse conditioning information. We apply GANs here for lesion-focused image synthesis (generating new tumor brain images) and inpainting, where regions of an MRI are masked and then restored.
4.5.3. CHAOS MRI Dataset
The Combined Healthy Abdominal Organ Segmentation (CHAOS) challenge dataset provides multi-sequence abdominal MRI scans for organ segmentation. We use the MR subset which contains 20 training and 20 testing cases, each with two MRI sequences per patient: T1-DUAL and T2-SPIR [
36]. The organs annotated are the liver, left kidney, right kidney, and spleen (four-class segmentation). After extracting axial slices (yielding on the order of 1200 2D images from the 20 training volumes), we center-crop and resize them to 256 × 256. The CHAOS dataset is employed for two purposes: (1) semantic segmentation via Pix2Pix, treating it as a multi-class segmentation task (to compare GAN-based segmentation against ground truth), and (2) cross-domain image synthesis, such as generating one MRI contrast from another’s segmentation. The presence of multiple organ structures and two MRI contrasts makes CHAOS a good test for whether GANs can handle complex, multi-organ scenes. Notably, CHAOS has significantly fewer patient cases than ACDC, so we expect data augmentation via GANs to potentially yield a larger relative benefit for segmentation [
37]. All experiments on CHAOS use the provided ground truth masks for supervised training of Pix2Pix (segmentation) or as inputs for SPADE (image synthesis from masks) [
38].
Each dataset underwent standard preprocessing, including intensity normalization to a [0, 1] range and appropriate train/validation/test splitting. For segmentation tasks without official splits, we adopted a 70% training, 15% validation, and 15% testing strategy. To improve model generalization, standard data augmentation techniques—such as horizontal and vertical flips, rotations, and random cropping—were applied during training.
4.6. Dataset Generation
To evaluate the utility of GAN-generated data across different medical imaging tasks, we synthesized new datasets using Pix2Pix, SPADE GAN, and WGAN architectures.
Figure 5 illustrates representative examples from each GAN model across three datasets: ACDC, Brain-Tumor-MRI, and CHAOS T1-T2.
For the ACDC dataset, Pix2Pix was used for semantic segmentation, while SPADE GAN was applied for synthesizing realistic anatomical structures from semantic masks. A second SPADE GAN configuration focused on image inpainting and missing region restoration. WGAN was used to enhance overall contrast and structure in the generated images.
In the Brain-Tumor-MRI dataset, Pix2Pix exhibited limitations due to complex tumor shapes, while SPADE GAN improved the reconstruction of anatomical details. WGAN outputs showed enhanced sharpness and noise reduction, making them useful for downstream tasks such as classification or radiomics.
In the CHAOS T1-T2 dataset, all three GANs produced plausible anatomical outputs, with SPADE GAN performing strongly in mask-to-image translation and WGAN excelling at enhancing degraded inputs.
These generated datasets were later used to train or augment segmentation models, enabling comparative performance evaluations with and without synthetic data.
5. Experimental Results
We systematically evaluated Pix2Pix, SPADE GAN (synthesis and inpainting variants), and WGAN across the ACDC, Brain Tumor MRI, and CHAOS T1–T2 datasets. Each model was trained and tested under consistent preprocessing and hyperparameter settings. We present quantitative metrics (
Table 2) alongside illustrative plots and qualitative results, followed by a statistical summary. Across the three datasets, results show that SPADE inpainting consistently achieved the highest PSNR, SSIM, and Dice scores, indicating superior restoration quality and structural fidelity in medical images. Pix2Pix excelled in segmentation accuracy, particularly in the ACDC dataset, but generally produced lower Dice scores in complex multi-organ cases. SPADE synthesis performed variably, showing moderate success in brain tumor synthesis but weaker results in cardiac and abdominal image generation, likely due to the difficulty of fully synthesizing anatomically complex structures. WGAN demonstrated stable enhancement performance across datasets but often exhibited higher FID values, suggesting noticeable distributional differences from real images. Overall, inpainting-based approaches proved most effective for both visual quality and anatomical accuracy, while conditional GANs like Pix2Pix offered strong segmentation capabilities.
5.1. Hyperparameter Selection
A grid search was conducted to select optimal learning rates, batch sizes, and training epochs. Pix2Pix converged best at a learning rate of 2 × 10
−4 and batch size 16 for 30 epochs. SPADE models required a smaller batch size (1) due to larger memory footprints, with best performance at 2 × 10
−4 learning rate and 30 epochs. WGAN used RMSProp optimizer with learning rate 5 × 10
−5 and gradient penalty of 10 over 30 epochs.
Table 2 summarizes the average train accuracy, validation accuracy, PSNR, SSIM, FID and Dice scores across the models and datasets.
5.2. Overall Results
Segmentation evaluation measures how accurately GAN models predict anatomical regions. Results from the ACDC dataset illustrate model performance and mask quality compared to expert-verified labels.
5.2.1. ACDC Dataset Results
Pix2Pix was employed for cardiac segmentation, while SPADE and WGAN were tested for synthesis and restoration. The performance trends across epochs are shown in
Figure 6, which plots the training, validation and test accuracies. The plot indicates steady convergence around 99.7 % for all splits. A qualitative example in
Figure 7 demonstrates that Pix2Pix-predicted masks align closely with ground truth, although the SPADE synthesis variant struggles to reconstruct heart structures due to limited contextual information.
The quantitative metrics reveal that SPADE inpainting achieved the highest Dice score (0.9417) and PSNR (36.31 dB), reflecting excellent anatomical fidelity. Pix2Pix achieved moderate Dice (0.7345) with near-perfect SSIM (0.9723) but slightly lower PSNR. SPADE synthesis registered low Dice (0.3184) and PSNR due to weak guidance by segmentation masks alone. WGAN provided balanced enhancement, with PSNR 32.39 dB and Dice 0.9109; however, its FID (1.1072) was the highest, indicating less realistic outputs compared to SPADE inpainting.
5.2.2. Brain Tumor MRI Dataset Results
For brain MRIs, Pix2Pix performed segmentation, while SPADE and WGAN handled synthesis and inpainting.
Figure 8 depicts the accuracy curves, showing WGAN achieving consistent performance around 99 %.
Figure 9 illustrates WGAN’s restoration ability: starting from a noisy input, the output closely resembles the ground truth in terms of structural detail.
Quantitatively, SPADE inpainting achieved the highest Dice (0.9279) and PSNR (34.76 dB) with near-zero FID, underscoring its capability to fill missing regions realistically. WGAN yielded PSNR 27.98 dB and Dice 0.8498, with FID 4.4912—a higher value reflecting differences between generated and real distributions. SPADE synthesis again recorded lower Dice (0.3220) despite a good PSNR (25.05 dB). Pix2Pix’s Dice (0.8016) and PSNR (17.48 dB) were moderate; its segmentation masks were slightly blurrier, which reduced fine detail.
5.2.3. CHAOS T1–T2 Dataset Results
The CHAOS dataset comprises abdominal MRI with multiple organs and two contrasts. SPADE inpainting excelled in this dataset, achieving the highest Dice score (0.9499) and PSNR (33.35 dB), as shown in
Figure 10 (accuracy trends) and
Figure 11 (qualitative example). Pix2Pix segmentation underperformed (Dice 0.4089) because its U-Net struggled with multi-organ variability. SPADE synthesis again achieved low Dice (0.3402) but improved FID (0.0928) after more training. WGAN performed moderately (Dice 0.5600) with balanced PSNR (26.07 dB). The results emphasize that strong conditioning (inpainting) is crucial to maintain organ boundaries, while unconditional or mask-only approaches can miss finer structures.
5.3. Quantitative Results
To subjectively assess image realism, a visual Turing test was conducted in which radiologists classified a randomized set of real and GAN-generated images. SPADE inpainting outputs for the CHAOS dataset achieved a misclassification rate exceeding 35%, indicating high visual fidelity. In contrast, WGAN outputs were more frequently correctly identified due to subtle texture inconsistencies. Pix2Pix and SPADE synthesis images were generally distinguishable because of blurred edges and unnatural textures. As shown in
Figure 12.
5.4. Statistical Analysis
Across all three datasets, SPADE-inpainting consistently achieved the highest performance in both SSIM and Dice Score, with SSIM values around 0.98 and Dice Scores above 0.90 for ACDC, brain tumor dataset, and CHAOS. This indicates strong structural and semantic preservation, especially on complex datasets like CHAOS.
WGAN performed relatively well in PSNR, scoring approximately 32.34 dB (ACDC), 27.98 dB (Brain Tumor), and 26.07 dB (CHAOS). Its Dice Scores were also fairly high—0.91 for ACDC and for 0.84 for Brain Tumor—but dropped to 0.56 for CHAOS. However, FID values were notably high: around 4.4–1.2, indicating poor image realism despite strong pixel-level fidelity. SSIM values showed the high rate of 0.88 in ACDC.
SPADE-synthesis produced relatively low PSNR—13.64 (ACDC), 25.05 (Brain Tumor), and 16.06 (CHAOS)—and low Dice Scores: ~0.3 across all datasets, reflecting limited anatomical accuracy. However, it performed well in terms of FID, with scores around 0.14 (ACDC), 0.11 (Brain Tumor), and 0.1 (CHAOS)—better than Pix2Pix and WGAN in terms of realism.
Pix2Pix showed the most dataset-dependent performance. On ACDC, it achieved a Dice Score of ~0.73, SSIM of ~0.97, FID low rate 0.004 and PSNR of ~27.6 dB, indicating strong reconstruction quality. However, for Brain Tumor and CHAOS, Dice dropped to ~0.8 and ~0.42, and FID increased to ~0.03–0.07, highlighting reduced realism and segmentation accuracy in complex cases.
SPADE-inpainting delivered the best overall performance with PSNR ~36, SSIM > 0.97, Dice ~0.94, and the lowest FID (<0.01), demonstrating excellent realism and anatomical accuracy.
In general, SPADE-inpainting outperformed all other models across the ACDC, Brain Tumor, and CHAOS datasets, achieving PSNR ~36, SSIM > 0.97, Dice ~0.94, and FID < 0.01. These results demonstrate its strong ability to preserve anatomical structure, generate realistic images, and maintain high reconstruction quality, even on complex, multi-organ datasets. It stands out as the most effective and reliable model overall.
6. Discussion
In our study we found that very realistic GAN-generated images do not always improve medical segmentation and can even hurt performance when ample real data exists. For example, adding synthetic scans to the large ACDC cardiac dataset (where a segmentation model already achieved a high Dice score ~0.91) actually lowered accuracy (to ~0.86), whereas on the much smaller CHAOS abdominal dataset, GAN images expanded the limited training set and boosted the Dice score by about 8%. This suggests that synthetic data are most valuable when real datasets are small or incomplete, and add little or even adverse effects when real data are already abundant and diverse. We also saw that the choice of GAN matters: a paired image-to-image GAN like Pix2Pix produces sharp, accurate segmentations but little new variability; SPADE (which conditions on segmentation maps) creates highly realistic, varied images that preserve anatomy well (as long as rich labels are available), while an unpaired WGAN captures the overall image distribution and looks globally realistic but can sometimes yield anatomically implausible samples without label constraints. In practice this means evaluations should focus on task-specific and anatomical metrics, not just visual fidelity—for example, in our brain MRI experiment the WGAN achieved a lower FID (more realistic images) but SPADE-generated images produced slightly better tumor segmentation (higher Dice). In other words, simply maximizing realism isn’t enough to boost clinical metrics: if the goal is better segmentation accuracy, it may be necessary to incorporate that goal into the GAN’s training (for example, by using a segmentation network’s feedback or task-focused loss). Indeed, a model trained only on synthetic images underperformed the same model trained on real images (Dice ~0.86 vs. ~0.95 on ACDC), indicating that current GANs still miss some crucial details.
More advanced generative models (such as diffusion models or ensembles) may be needed to capture fine textures and rare cases. We believe these insights generalize beyond segmentation to classification or detection tasks as well—for instance, GANs could be used to balance class-imbalanced classification by creating more examples of rare abnormalities, but with caution: synthetic images can contain subtle artifacts that a model might overfit to, reducing its real-world performance. One particular advantage of the GAN-augmented segmentation approach (using SPADE) is that the generated images come with perfectly aligned, noise-free labels (since they’re conditioned on ground-truth masks), which contributed to the large performance gain on CHAOS. In contrast, for classification there is no inherent “perfect label” guarantee unless the GAN is explicitly conditioned on class labels or anatomy; thus, conditional generation (with known labels) is more reliable for augmentation, whereas unconditioned GAN outputs risk introducing mislabeled or misleading examples. Finally, a radiologist in our team reviewed synthetic images and found that most SPADE and WGAN outputs were indistinguishable from real scans in normal cases (supporting the high SSIM/FID fidelity metrics), though a few outliers were clearly unrealistic (for example, one SPADE brain MRI had anatomically incorrect sulci, and one WGAN abdominal slice was missing part of an organ). These occasional flaws (reflected in the variance of FID scores) underscore the need for caution: any clinical use of GAN-augmented data should ensure no systematic anatomical errors slip into training data. The good news is that in our experiments we never saw synthetic data cause catastrophic failures—small tumors were actually segmented slightly better with augmentation, not worse—but further work is needed to detect or eliminate anatomically implausible GAN outputs (for example, by enforcing physical or structural constraints or using hybrid models) to guarantee the safety and reliability of GAN-generated data in medical AI applications.
The proposed GAN-based frameworks are designed to be scalable to larger datasets and higher-resolution medical images, subject to computational constraints and data availability. All models were implemented using Python 3.11 with PyTorch with modular architectures that allow flexible adjustment of image size and batch processing. The Pix2Pix and SPADE GAN architectures can scale to resolutions up to 512 × 512 or higher by proportionally increasing GPU memory and training time, as the convolutional layers and skip connections are resolution-invariant. The WGAN model, based on a residual U-Net generator, also scales efficiently due to its patch-wise discriminator and gradient-penalty stabilization, allowing training on larger or multi-institutional datasets.
Scalability assumes availability of sufficient GPU resources (≥15 GB memory) and adequate paired or unpaired medical data for retraining. Under these conditions, the system can generalize to different imaging modalities (e.g., CT, PET) with only minor adjustments to normalization and loss weighting. However, model performance may saturate if the dataset lacks anatomical diversity or if memory limits restrict batch size. Future work will explore distributed training and mixed-precision optimization to further enhance scalability and training efficiency on large-scale clinical datasets.
In summary, our comparative analysis confirms that SPADE GAN and WGAN represent powerful tools in the medical imaging GAN toolbox, each with unique advantages. SPADE is preferable when structured inputs (like segmentation maps) are available, delivering superior image quality and segmentation augmentation, whereas WGAN is a strong choice for tasks needing unsupervised realism or enhancement of existing images. Pix2Pix remains a solid baseline for paired tasks and essentially matches the performance of dedicated segmentation networks while also being capable of translation tasks. The choice of model should thus be guided by the specific application: e.g., for generating anatomy-consistent images from labels (data augmentation for segmentation or training anatomically-aware AI), SPADE is recommended; for improving image quality (denoising, super-resolution), a WGAN or Pix2Pix with appropriate losses may be more suitable. Overall, selecting the appropriate GAN architecture depends on the dataset, case study, imaging modality, data quality, and specific clinical objectives, making it critical to customize the model choice to achieve optimal performance in medical imaging applications.
7. Comparison with Previous Studies
To evaluate the effectiveness of the proposed approach, its performance is compared with existing GAN-based methods on the ACDC cardiac MRI dataset. This comparison highlights notable improvements in segmentation accuracy and image quality achieved over previous studies.
In the ACDC cardiac MRI dataset, the proposed Pix2Pix-based segmentation model achieves higher accuracy and competitive Dice score compared to prior GAN-based methods. For example, Skandarani et al. (2023) [
14] reported a WGAN augmentation approach attaining ~70% Dice and FID ~74 on ACDC, whereas our Pix2Pix model achieves ~73% Dice with much lower FID (≈0.0046). SPADE GAN image synthesis from segmentation masks yields a higher Dice (0.86), but our model focuses on direct segmentation accuracy, reaching ~99.7% pixel accuracy (train/val)—underscoring superior performance on this dataset [
14].
For brain tumor MRI (BraTS 2021), Raut et al. (2024) [
16] used a Pix2Pix-based model (Pix2PixNIfTI) to generate missing MRI sequences, achieving whole-tumor Dice around 0.90. A recent “BrainPixGAN” (2024) for tumor removal image synthesis attained high image fidelity (SSIM ≈0.87, PSNR ≈35.9 dB). Our SPADE GAN-based inpainting approach yields superior results—~0.928 Dice and SSIM ~0.963—with nearly perfect train/validation accuracy (~99.1%), outperforming prior studies in both segmentation quality and image similarity [
16,
17].
On the CHAOS abdominal MRI dataset, state-of-the-art segmentation models augmented with GANs reach roughly 90% Dice at best. For instance, an adversarial U-Net (EGAUNet) achieving ~90.3% Dice on CHAOS T2 MRI (with ~99.2% accuracy). Cross-modality augmentation using CycleGAN-based methods has shown modest gains (e.g., ~1.17% IoU improvement in liver segmentation). In contrast, our SPADE GAN inpainting approach produces a Dice of 0.9499—markedly higher than comparative studies—with excellent fidelity (FID ~0.0045) and structural similarity (SSIM ~0.975) on CHAOS, underlining the superior performance of our model on multi-organ MRI segmentation [
18,
19]. As shown in
Table 3.
In contrast to previous studies that primarily focused on GAN-based image synthesis or data augmentation as a traditional method, the proposed work emphasizes achieving high-accuracy segmentation through GAN-driven inpainting. While earlier methods prioritized visual realism or modality generation, our approach combines SPADE and Pix2Pix architectures to deliver both structural fidelity and superior segmentation performance across diverse datasets. Our model consistently outperforms prior methods, achieving notably higher Dice scores (up to 0.95), better SSIM (0.97), and significantly lower FID (0.0045), demonstrating enhanced image quality and segmentation accuracy. This clear improvement underscores the effectiveness and robustness of our approach compared to existing GAN-based techniques.
8. Limitation of This Study
Despite the promising results, this study has several limitations. The evaluation was limited to three GAN models, excluding newer architectures like Style GAN or diffusion-based methods that may offer improved performance. Additionally, the datasets used (ACDC, Brain Tumor, CHAOS) are representative but relatively limited in size and modality. The lack of large-scale, diverse, and demographically varied datasets may restrict model generalizability to real-world clinical settings. Although SPADE-inpainting achieved the highest performance across most evaluation metrics, its effectiveness relies heavily on the availability of accurate segmentation masks, which serve as input for conditioning. This dependency may limit its flexibility across datasets with limited or imprecise annotations. Furthermore, while metrics such as PSNR, FID, SSIM, and Dice provide valuable quantitative insights, they may help clinical relevance or diagnostic accuracy.
9. Conclusions
In this paper, we presented a comprehensive comparative study of three GAN architectures—Pix2Pix, SPADE GAN, and WGAN—on diverse medical image processing tasks. Using three representative datasets (cardiac MRI, brain tumor MRI, and abdominal MRI from CHAOS), each GAN’s ability to generate realistic images and enhance downstream segmentation performance was evaluated. Results show that no single GAN is universally best: each has strengths aligned with its design. Pix2Pix excels as a supervised translator/segmented on sufficiently large datasets, SPADE produces highly realistic images when guided by segmentation maps (dramatically boosting multi-organ segmentation accuracy in CHAOS experiments), and WGAN offers stable training and good fidelity, particularly shining on small, complex distributions like brain tumor MRIs.
Crucially, we found that GAN-generated synthetic data can improve medical image analysis in data-scarce scenarios, validating the potential of GANs to alleviate training data bottlenecks in medicine. However, the improvements come with caveats—the GAN outputs must be of high anatomical fidelity to be truly beneficial. Simply optimizing for visual realism (low FID) is not enough if the images do not represent the full variability of real data or if they introduce subtle artifacts. This underscores the importance of task-specific evaluation of GANs: as this study illustrated, a model like SPADE with slightly higher FID on one dataset still yielded better segmentation outcomes than a lower-FID WGAN, because SPADE respected anatomical boundaries better.
In conclusion, GANs are a promising technology for medical image synthesis and augmentation, capable of generating photo-realistic and useful medical images. Our comparative analysis provides guidance for practitioners on which GAN model may be most suitable for a given task. For segmentation augmentation and modality synthesis from labels, SPADE GAN is recommended; for general image enhancement or when only unpaired examples exist, WGAN (with appropriate loss terms) is a strong choice; and for routine paired translations (including straightforward segmentation with adequate data), Pix2Pix remains effective. Future work will explore combining the strengths of these models—for example, integrating SPADE’s semantic control with WGAN’s training stability—and investigating other generative models like diffusion models in similar comparative setups. We also advocate for further research into evaluation metrics that correlate better with clinical task performance, as our study highlights the sometimes-weak link between conventional image fidelity metrics and actual utility in medical analysis.
Ultimately, the integration of GAN-generated data in training pipelines should be done thoughtfully: with careful curation and possibly expert review of synthetic outputs. With improving GAN technology and proper validation, we anticipate that GANs will become a standard tool to enrich medical AI models, enabling robust performance even when real data are limited or hard to acquire. Insights from this comparative study advance the understanding of the capabilities and limitations of each GAN type within medical imaging applications.