1. Introduction
Coronary artery disease (CAD) is a serious health challenge that is affecting a substantial number of individuals and imposes burdens on healthcare systems worldwide [
1,
2]. CAD is an artery narrowing and blockage, leading to severe complications like heart attacks and heart failure [
3,
4]. Collaborative efforts by researchers and professionals have resulted in advancements in treating and managing this disease. Evolving CAD prevention and management procedures embrace modern AI-driven architectures and cutting-edge diagnostics, including OCT [
5,
6,
7]. OCT provides high-resolution images of the coronary artery walls for detailed analysis of plaque composition and identification of vulnerable plaques [
8]. With OCT, different plaque components can be distinguished based on their optical properties, including fibrous tissue, lipid cores and calcium deposits [
9]. OCT has enabled clinicians to detect vulnerable plaques and assess stent deployment efficacy with exceptional precision [
10,
11]. Recent studies have further expanded the role of OCT by integrating it with automated systems to improve diagnostic accuracy and risk stratification in CAD [
12]. Moreover, intravascular imaging techniques, including OCT and intravascular ultrasound (IVUS), have been shown to complement each other in the assessment and management of CAD [
13,
14]. Despite its advantages, OCT faces limitations in the availability of diverse and annotated image datasets. Prior research has demonstrated that healed coronary plaques, imaged through OCT and validated histologically, exhibit considerable morphological variability, highlighting the need for synthetic data that reflect this diversity [
15].
Recent advancements in artificial intelligence (AI) have enhanced the analysis of OCT imaging, particularly for plaque classification, and 3D reconstruction in CAD [
16]. Techniques such as deep-learning-based identification of coronary calcifications [
17] and automated diagnosis of plaque vulnerability [
18] have demonstrated high diagnostic accuracy and potential for improving clinical outcomes. The integration of AI into cardiovascular imaging facilitates earlier detection and improved clinical decision making [
19,
20]. Moreover, AI-driven interpretation of coronary angiography and vulnerable plaque imaging has shown promise in enhancing image-based risk assessment and interventional planning [
21,
22]. However, the scarcity of diverse [
23] and labeled OCT datasets [
24] remains a critical limitation, restricting the generalizability and robustness of current AI models. In parallel, AI-based clinical decision support systems (CDSS) are evolving rapidly to incorporate such synthetic data for enhanced diagnostic reliability in real-world settings [
25,
26,
27]. In [
28], a fully automated, two-step deep-learning approach for characterizing coronary calcified plaque in intravascular optical coherence tomography (IVOCT) images is reported, where a 3D convolutional neural network (CNN) is used with a SegNet deep-learning model. Similarly, an automated atherosclerotic plaque characterization method that used a hybrid learning approach and self-attention-based U-Net architecture were reported elsewhere [
29,
30] for the better classification performance of coronary atrial plaques than existing methods.
Vision transformers (ViTs) are being employed for coronary plaque detection, especially in cases where global feature relationships are essential. In this regard, [
31] explored the use of ViTs for coronary plaque detection, where ViTs outperformed CNN-based models for large datasets. Instead of relying on lumen segmentation, the proposed method identifies the bifurcation image using a ViT-based classification model and then estimates the bifurcation ostium points using a ViT-based landmark detection model. The performance of the proposed ViT-based models is 2.54% and 16.08% higher than that of traditional non-deep-learning methods [
32]. For better generalization characteristics, a transformer-based pyramid network called AFS-TPNet was devised for robust, end-to-end segmentation of CCP from OCT images [
33]. Researchers also used physics-informed deep network QOCT-Net to recover pixel-level optical attenuation coefficients directly from standard IVOCT B-scan images [
34]. While these cutting-edge architectures delivered satisfactory results, from a clinical standpoint, acquiring large and diverse datasets of patients across different disease stages is stimulating, shadowing the real potential of proposed deep-learning architectures. To address this issue, data augmentation (DA) techniques were deployed to augment limited medical imaging training data that could be further fed to deep-learning algorithms for better insights. Generative adversarial networks (GANs) have offered promising solutions by generating high-quality images that closely resemble real OCT scans, thereby aiding in dataset augmentation, image enhancement and training of deep-learning classifiers. In [
35], pseudo-labeling, using model predictions as labels for unlabeled data, was employed as a data augmentation technique. This method demonstrated improvements in model performance by increasing the effective size of the training dataset. The StyleGAN2 and Cyclic GAN frameworks were used to generate high-resolution synthetic patches for an improved data augmentation performance in cases of low data availability across three different OCT datasets, encompassing a range of scanning parameters [
36,
37]. To improvise generalization across different datasets, sparsity-constrained GANs with baseline accuracy are available in the literature [
38].
However, recent works show that despite the high structural similarity between synthetic data and real images, a considerable distortion is observed in the frequency domain; therefore, dual- and triple-discriminator architectures, including Fourier acquisitive GAN (DDFA-GAN), have been proposed to generate more realistic OCT images [
39,
40,
41]. By applying multiple discriminators, the proposed models were jointly trained with the Fourier and spatial details of the images, and the results were compared with popular GANs, including deep convolutional GAN (DCGAN), Wasserstein GAN with gradient penalty (WGAN-GP) and least-square GAN (LS-GAN). In [
42], a multi-stage and multi-discriminatory generative adversarial network (Multi-SDGAN), designed specifically for super-resolution and segmentation of OCT scans, was proposed. This resulted in improved performance by satisfying all the discriminators at multiple scales and including a perceptual loss function. While the use of multiple discriminators in GANs has attracted much attention, due to overloading of a single generator in the network, the desired generalization and diversity in the generated images are still an ongoing problem.
The resolve of this study was to perform data augmentation on an OCT dataset using a novel dual-generator multiple-fusion discriminator network for synthesizing high-quality coronary images. The use of dual generators exploited in this paper helps achieve better generalization and generate a diversity of images of coronary arterial plaques. In our model, the two generators, G1 and G2, receive the same conditional input, y, but they generate different variations of the OCT images, as each generator acts as a regularizer for the other. In our model, one generator’s output can serve as a reference for improving the other, and this mutual regularization enhances the ability of both generators to generalize better across different features and conditions. Additionally, we avoided the need for a separate classification architecture and instead used a discriminator within our presented GAN architecture for classification of coronary atrial plaques into three classes. The objective functions of the generator and the five discriminators were set in competition against each other. We derived a novel objective function for DGDFGAN and optimized it during model training. This made the model training more stable and improved the quality and diversity of the generated images. Using assessments, we illustrated how populating real images with created instances during the training phase increased our confidence in reliable label prediction. Experimental results demonstrated that DGCGAN achieves optimal results in terms of the similarity and generalization characteristics. DGDFGAN offers a scalable and clinically applicable framework for OCT image synthesis and augmentation, as it directly tackles the core limitations of existing GAN-based methods and presents a promising path toward real-world clinical deployment of AI models with improved generalization and interpretability.
The main contributions of the paper are summarized as follows:
A novel dual-generator architecture is designed using mutual regularization to achieve better generalization for OCT images of coronary arterial plaques. A discriminator (D1) within the network is used for classification, obviating the need for a separate classification architecture, resulting in computational efficiency.
A novel architecture of dynamic multi-discriminator fusion is designed to play an adversarial game against two generators. We introduce a dynamic fusion mechanism to adjust the weighting of discriminators based on the specific conditions of the image.
Essentially, we incorporate adversarial loss, perceptual loss, L1 loss, cross-entropy and diversity loss into our loss function for enhanced realism in the generated images.
2. Methods and Materials
Generally, CGAN takes two input data, namely the latent variable z and the conditional constraints, into both generator G and discriminator D, as illustrated in
Figure 1. The G combines the initial noise input Z with label information to produce the generated output after reshaping to feature the transposed convolutions, batch normalization and activation functions, as presented in
Figure 1. The final layer of the generator typically uses an activation function.
2.1. OCT Dataset and Its Pre-Processing
A dataset of 51 patients was created using a commercially available OCT system. The OCT system available was the C7-XR (St. Jude Medical, St. Paul, MA, USA) using the C7 Dragonfly intravascular OCT catheter (St. Jude). This system provides spatial resolution up to 10 µm and tissue penetration up to 3 mm.
The focus of the study was on vessels affected by stenosis, though for simplicity, cases involving serial stenosis, mixed plaques and bypass graft stenosis were excluded. The ethical approval for this research was granted by the Galway Clinical Research Ethics Committee (GCREC), and informed consent was obtained from all participants.
Three clinicians independently annotated the OCT images, but final labels were determined through consensus agreement. The classification task was structured by designating a specific label in contrast to the others. Prior to model input, pre-processing steps were applied to the raw OCT images. Vulnerable plaques meeting the pre-defined fibrous cap thickness criteria were excluded from the analysis. Plaque characterization was based on signal intensity relative to the lumen, classifying plaques into three classes, namely “lipid plaques”, “calcified plaques” and “no plaque”. The pre-processing stage involved removal of motion artifacts, intensity normalization, cropping the region of interest to 256 × 256 and preparation of masks for fibrous, lipidic and calcified plaques.
2.2. Proposed Dual-Generator Multi-Fusion GAN
In our configuration of the dual generators presented in
Figure 2A, each generator learns different plausible ways to generate an OCT image based on the same conditioning input y. The two generators G
1 and G
2 in
Figure 2B learn to capture different aspects of the underlying OCT image structure. G
1 focuses on capturing the fine details of the OCT image, while G
2 emphasizes abstract features. This separation allows each generator to learn a different distribution, hence enabling more diverse image synthesis. Specifically, G
1 is trained to simulate plaque-specific characteristics, such as fibrous, lipidic and calcified plaques. Conversely, G
2 emphasizes structural consistency, image quality and perceptual realism.
With dual generators, the risk of mode collapse is also reduced because each generator learns a distinct aspect of the data. In cases where one generator fails to cover the diversity of the distribution, the other generator fills in the missing modes, which helps ensure more realistic and varied synthetic OCT images. Likewise, the two generators learn different modes of data distribution, thus helping the model better generalize for unseen OCT data.
In the presented formulation, the role and parameters of each discriminator are illustrated in
Figure 3. The ultimate aim of dynamic discriminator fusion is to improve the quality, realism and diversity of the generated OCT images. As presented in
Figure 3, Discriminator 1 (D
1) is used for the classification of coronary atrial plaques into three pre-defined classes, designed to focus on a different characteristic of the OCT image. D
2 has the same architecture as D
1 and is therefore represented by dotted lines. D
2 evaluates the temporal consistency of the generated OCT images. D
3 is a conditional discriminator that plays a critical role in enforcing conditional consistency during image generation. Specifically, D
3 evaluates whether the generated OCT image accurately reflects the input condition y, such as a specific plaque type. During training, D
3 receives both the generated image and its corresponding conditioning label, and its objective is to distinguish whether the image not only appears realistic but also corresponds correctly to the given condition. This ensures that the generator cannot produce arbitrary and mislabeled images, thereby maintaining semantic correctness and clinical reliability in the synthetic data.
The role of D4 is to ensure global realism of the generated images by taking into account spatial and temporal features. D5 embeds perceptual loss, which is computed as the distance between the feature maps of the real and generated images. This loss is combined with the adversarial loss to guide the discriminators on high-level perceptual differences.
2.3. Mathematical Formulation
We use dual-generator CGANs and multi-discriminators for OCT synthetic image generation. The generator takes a latent vector z (random noise) and a label y, and it produces a synthetic OCT image x.
where z is the latent vector (random noise), y is the label, and x is the generated OCT image.
The latent vector is sampled from a Gaussian distribution to introduce the variability into the generated images. In our model, z is a 1D latent vector of size 100, sampled from a standard Gaussian distribution. Vector z is concatenated with the condition vector y and passed through the generator, which outputs a 256 × 256 grayscale image. This latent vector controls diverse image synthesis, while y ensures class-specific structure.
The conditioning information in the form of one-hot vector, representing the class label of the OCT image, is concatenated with random noise z, which forms the initial input to the generators. Each discriminator receives the same generated image and label but has a slightly different focus in its evaluation. D
1 is used for classification of the generated image using the expression given in Equation (2).
where
represents the expected value (or average) of the output of the generator over the distribution of latent variables z;
Pdata is the real data distribution;
Pz is the distribution over the latent space;
G(z) refers to the output blended image generated from a latent code z; and y
c is the one-hot encoding vector indicating the true class of image x.
D2 performs label-specific evaluation of whether the generated image matches the expected distribution of the label using Equation (3).
where
D2(
x) is the probability that image x is real.
Discriminator 3 evaluates perceptual quality, which compares the high-level features of the generated and real images. This can be modeled using a perceptual loss based on the difference in feature representations between real and generated images, as given in Equation (4).
where
x is the real OCT image;
G(z) is the synthetic image generated from latent code z; and f(G(z)) indicates the features of the generated image.
Discriminator 4 evaluates the quality of the image, which can be linked to the luminance or image gradient loss. It attempts to measure how sharp and high-quality the generated images are using the L
1 loss.
where
x −
G(z) denotes the L
1 normalized between the real image
x and the generated image
G(z).
Discriminator 5 evaluates the consistency loss between the generated images to maintain style over different transformations of the input.
where z refers to a latent vector, and
G−1(G(z)) represents the inverse transformation.
The total combined loss L
T can be written as
where λ
1 controls the importance of classification loss in Discriminator 1; λ
2 controls the importance of realism loss via Discriminator 2; λ
3 controls the importance of perceptual loss in Discriminator 3; λ
4 controls the importance of quality loss in Discriminator 4; and λ
5 controls the importance of consistency loss via Discriminator 5. The values of lambda optimized and used in the simulations for the expression in Equation (7) are
= 0.7,
= 1.1,
= 0.2,
= 0.3 and
= 1.2. The lambda values were selected through a systematic hyperparameter tuning process. Initially, we conducted a grid search over a range of plausible values for each λ and the relative scale of each loss component in the preliminary experiments. The goal was to ensure that no single loss term dominated during training, which could lead to suboptimal convergence.
To refine these values, we monitored key performance metrics, including
SSIM,
FID and classification accuracy, across several validation runs. The chosen λ values yielded the best trade-off between image quality, structural fidelity and class-specific realism. Moreover, these values were cross-validated over multiple training runs to confirm their generalization across different subsets of the OCT dataset.
We use latent interpolation between the two generators to encourage diverse image generation. By interpolating between the latent space of G
1 and G
2, we force both generators to produce more diverse outputs, allowing them to explore a broader space of possibilities using Equation (9).
where z
1 and z
2 are two different latent codes, and α∈ [0, 1] is a blending factor that encourages diverse outputs. α modulates the latent vector before passing it to the generator. Its value is dynamic and sampled randomly from a distribution for the desired exploration in the latent space.
The generator now operates on the blended latent space, as outlined in Equation (10).
Gray wolf optimization (GWO) is used to explore the search space effectively for an optimal solution, where each wolf represents a candidate solution in the optimization space. We preferred GWO due to its adaptive exploration–exploitation balance and global search capabilities. This is particularly needed for our dual-generator framework, where balancing two generative paths is critical. GWO allows for more diverse convergence pathways and lower mode collapse frequency. Each wolf represents a set of parameters of the neural network, which is optimized by the GWO algorithm.
3. Results and Analysis
In this paper, the experiments were conducted according to the baseline architecture displayed in
Figure 3. Our original dataset contained 27 “no plaque” images, 22 “calcified plaque” images and 20 “lipid plaque” images. After augmentation by a factor of 100, the dataset’s size was increased by 6900 images, with 2700 “no plaque” images, 2200 “calcified plaque” images and 2000 “lipid plaque” images. The samples generated by our model for each class are presented in
Figure 4. Then, we applied cross-validation (10-CV) to estimate the generalization performance of the compared segmentation models on unseen data. During the implementation phase, the original OCT dataset acquired from real patients, labeled by experts and obtained under ethical approval/informed consent, was partitioned across all subsets. Overall, 60% of the images were used for training, 15% of the images were allocated for the validation phase, and the remaining 25% were used in the testing phase.
From a clinical perspective, the images generated by the proposed model, as indicated in
Figure 4, offer significant improvements in terms of realism and diagnostic utility as compared to conventional GAN-centric augmentation methods. The blue zones in
Figure 4 indicate the background with low backscatter intensity and represent areas farther from the catheter probe. The red zones are highly backscattering regions, indicating the presence of plaques. Further green zones are interfaces between dense and plaque-rich areas. The proposed model achieves greater diversity in synthetic OCT data generation as compared to reported GAN architectures due to its modular and task-specific design. The introduction of two distinct generators allows the model to independently and effectively capture the complex interplay between plaque types and baseline anatomical variability. Similarly, unlike conventional models that generate images from a single latent space, this framework leverages dual latent distributions and supports an interpolation via a blending coefficient (α). This mechanism enables the model to produce a continuum of clinically relevant OCT synthetic samples with high diversity.
The DGDFGAN model is used for experimentation, and we tried different combinations where the multi-stage aspect or the multi-discriminatory aspect was treated incrementally. During the training phase, each loss weight parameter (λ1–λ5) controls the relative standing of specific learning objectives, and their values directly influence the features to be emphasized. Similarly, the latent interpolation factor (α) and the blend factor define the relative contribution of each generator in the final image. Similarly, the learning rate and GWO were used for stable and adaptive convergence. The learning rate tuning dictates how responsively G1 and G2 adapt to the feedback from their respective discriminators. The extended OCT dataset offered high diversity, clinical realism and controlled augmentation, leading to more robust training for classification.
We analyzed the structural similarity index measure (SSIM) loss and the Fréchet inception distance (FID) score as an additional cost function. All experiments were run for 100 epochs for each fold. The SSIM index is a metric that evaluates the degradation of images from a perceptual point of view, and the FID score underlines the diversity impact. We also included the L1 loss over 10-fold cross-validation (10-CV). L1 loss measures the absolute difference between predicted and ground-truth values. It is used to enforce pixel-wise similarity in image generation tasks.
An inter-comparison of our proposed model with the cutting-edge Multi-SDGAN model is presented in
Table 1 in terms of
SSIM, L
1 and
FID scores. It is evident that employing the dual generator and multi-discriminatory fusion module improves the performance in both similarity and diversity aspects. The dual generator helps in achieving an improved diversity in the generated images, as each generator focuses on different aspects of OCT image synthesis. This reduces mode collapse, where a single generator might otherwise get stuck producing similar images. Each generator interacts with different discriminators, ensuring that features are well learned from multiple perspectives for an enhanced feature representation. By training on different discriminators, the generators learn robust representations that can generalize better to unseen OCT images. As illustrated in
Table 2, the two generators share the load, leading to faster convergence, resulting in lower
FID (better realism) and higher SSIM (better structural similarity) compared to other approaches, including DDFAGAN, MHWGAN and Multi-SDGAN.
We computed the
SSIM loss, which is a perceptual loss function for evaluating the similarity between the original and generated OCT images. It exploits contrast and structure rather than merely pixel-wise differences. The algorithm and description of related parameters are presented in
Figure 5.
In
Figure 5,
i represents the iteration counter;
SSIMprev is the
SSIM value from previous iteration; G is the generated image; x is the real image; μ
G, μ
x is the mean intensity of the generated and real images; σ
2Gσ
2x is the variance of the generated and real images; σ
Gx is the covariance between the generated and real images;
C1, C2 are the constants to stabilize calculations;
SSIM indicates the structural similarity index;
SSIMLoss is defined as 1-
SSIM; ε is the convergence threshold;
Maxiter are the maximum iterations;
LG is the generator loss; and λ
k represents the weight coefficients for each loss term.
The
FID score is used to measure similarity between the generated images and real OCT images by comparing their feature distributions. It exploits the mean and covariance of the feature embeddings.
FID considers both realism and diversity in order to create a better generative model. The algorithm used and the relevant variables are illustrated in
Figure 6.
In
Figure 6,
i represents the iteration counter;
FIDprev is the
FID score from previous iteration; μ
r,∑
r is the mean and covariance of real image feature representations; μ
g,∑
g is the mean and covariance of the generated images; ∑
r,∑
g are the covariance matrices of real and generated images;
represents the differences in feature distributions; ε is the convergence threshold;
Maxiter are the maximum iterations;
LG is the generator loss; and λ
k represents the weight coefficients for each loss term.
As illustrated in
Table 2, an increase of 10.81% in the
SSIM index was observed as compared to leading
SSIM measurements in the case of Multi-SDGAN [
42]. This is indicative of achieving better structural similarity between the original and generated OCT images. An improvement of 41.66% in the
FID score with baseline results of Multi-SDGAN confirms that our model significantly reduces the difference between the feature distributions of real and generated images. The L
1 score reported in
Table 2 is further suggestive of the fact that our model predictions are close to the ground truth. The dual-generator setup mitigates the problem of biased learning, where a single generator struggles to capture all variations in the OCT dataset. By working in parallel, the generators foster a more comprehensive understanding of data distribution. Consequently, the multi-discriminator framework enhances the
FID scores by ensuring a richer adversarial signal to produce images with better realism and diversity. We run simulations for computational cost/inference latency and accuracy to assess the feasibility of real-world clinical deployment, as presented in
Figure 7. DGDFGAN incurs higher computational cost (~2.3× baseline), primarily due to the dual generators and dynamic fusion of five discriminators. Latency increases to 78 ms per image (from a 40 ms baseline), still within acceptable limits for real-time clinical use (typically <100 ms). Despite higher latency, DGDFGAN shows a notable boost in classification accuracy, which supports its deployment in high-stakes clinical environments.
As indicated in
Figure 8, the baseline GAN shows faster loss reduction initially but exhibits more fluctuations across training epochs. DGDFGAN demonstrates smoother convergence with reduced noise due to the mutual regularization effect of the dual generators and the stabilizing influence of multiple discriminators. Similarly, the baseline GAN has higher and more oscillating variance, implying unstable synthesis behavior, whereas DGDFGAN maintains a lower and more stable variance over time. It can be inferred from
Figure 8 that the dual-generator and multi-discriminator framework effectively reduces instability by covering diverse feature distributions while regulating each other’s outputs. Our model allows for customization of plaque scenarios, which is not feasible with monolithic generators, especially in low-resource settings. However, the presented model is computationally intensive and contingent upon the biases in the source dataset, as indicated in
Table 3. Additionally, a sophisticated control scheme is required to handle the disentanglement of anatomical and pathological features.
As illustrated in
Table 3, Pix2Pix GAN uses a single generator and a discriminator, which lack the detailed attention and diversity that are provided by ensemble models. However, networks like CycleGAN and StyleGAN can produce diverse outputs through multi-domain learning, but they fail to capture the extent of specialization in different domains. DDFA-GAN and Multi-SDGAN were re-implemented following their original architecture specifications, with hyperparameters tuned to match our dataset’s size and characteristics. StyleGAN2 was implemented using the official open-source codebase and fine-tuned on our OCT dataset, with modifications to the input resolution and training steps to align with our dual-generator setup. To ensure a fair comparison, all models were trained under identical data splits, hardware conditions and evaluation metrics (
FID,
SSIM and classification accuracy). The proposed model uses five discriminators, each focusing on different aspects of the generated images, hence guiding the generators to explore different features of the image space for broader diversity.
Our approach, while data-constrained, is designed to maximize diversity and realism through dual generators that explicitly model distinct features. To address pathological variability, two expert cardiologists blindly assessed 200 randomly sampled synthetic images. The box plot in
Figure 9 represents the distribution of realism scores assigned by expert cardiologists to both real and synthetic OCT images, rated on a 5-point Likert scale. Real images consistently scored higher, with minimal variability and a median close to 4.5. In contrast, synthetic images, although scoring slightly lower in median realism (4.3), still fell within an acceptable range for clinical use. The wider interquartile range of synthetic images suggests slightly more variability in expert perception. Nonetheless, the overlap between the score distributions of real and synthetic images supports the claim that the synthetic data generated by the DGDFGAN framework achieve near-realistic visual fidelity.
Figure 10 compares the misclassification rates for three plaque types when models are trained on real versus synthetic OCT images. For real images, the misclassification rates are low across all plaque types, with fibrous plaques at 2% and lipidic and calcified plaques at 1% each. Synthetic images show a slightly higher rate, with fibrous plaques at 4%, lipidic plaques at 3% and calcified plaques at 2%. Despite this modest increase, the misclassification rates remain within clinically acceptable thresholds. The ablation study evaluates several configurations that are derived from our proposed model for OCT image augmentation in terms of diversity, accuracy and realism. The results are presented in
Figure 11, where the baseline model is compared with different settings, including three optimized scenarios.
The baseline configuration, integrating both generators (G1 and G2) with all five discriminators, delivers the best diversity, realism and highest performance across all metrics. It achieves a PSNR of 28.7 dB, a SSIM value of 0.94, a FID score of 9.6 and a Dice score of 0.91. This configuration outperforms every other variant and is indicative of component synergy. In Ablation 1, where G2 (the anatomical generator) is removed, the model achieves a PSNR of 26.1 dB, a SSIM value of 0.89, a FID score of 14.3 and a Dice score of 0.84. The drop in the SSIM and Dice score suggests diminished anatomical fidelity. The removal of D1 in Ablation 2 results in a PSNR of 27 dB, a SSIM of 0.91 and FID and Dice scores of 12.8 and 0.87, respectively. While image quality remains decent, the classification performance suffers, indicating that D1 is essential for ensuring class-discriminative features in the synthetic images. Removing D2 in Ablation 3 results in PSNR = 27.2 dB, SSIM = 0.90, FID = 13.1 and Dice score = 0.86. This moderate drop indicates that local fine-grained textures are compromised. The Ablation 4 version achieves PSNR = 26.5 dB, SSIM = 0.88, FID = 15.0 and Dice score = 0.83. The removal of D3 drastically impacts visual quality, increasing the FID score and reducing PSNR. The setup of Ablation 5 produces PSNR = 26.3 dB, SSIM = 0.87, FID = 14.7 and Dice score = 0.82. The low SSIM and Dice score reflect disruption in OCT layer alignment, whereas the removal of D5 in Ablation 6 negatively impacts perceptual quality.
The optimized scenarios A, B, C and D were examined, with scenario A only including G1, D1, D2 and D3. This scenario exhibits PSNR = 27.3 dB, SSIM = 0.91, FID = 12.2 and Dice score = 0.87. Scenario B only includes G2, D1, D3 and D4, with PSNR = 27.4 dB, SSIM = 0.92, FID = 11.8 and Dice score = 0.88. Configuration C only takes into consideration G1, G2, D1, D2 and D3 while discarding other blocks. The implementation of scenario D for optimized ablation embeds G1, G2, D1, D3 and D5, with PSNR = 27.9 dB, SSIM = 0.92, FID = 10.4 and Dice score = 0.88. All ablation configurations, including scenarios A, B, C and D, perform reasonably well, but they underperform compared with the proposed baseline model.