Perspective Transformation and Viewpoint Attention Enhancement for Generative Adversarial Networks in Endoscopic Image Augmentation

Janutėnas, Laimonas; Šešok, Dmitrij

doi:10.3390/app15105655

Open AccessArticle

Perspective Transformation and Viewpoint Attention Enhancement for Generative Adversarial Networks in Endoscopic Image Augmentation

by

Laimonas Janutėnas

and

Dmitrij Šešok

^*

Department of Information Technologies, Vilnius Gediminas Technical University, Saulėtekio Al. 11, LT-10223 Vilnius, Lithuania

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(10), 5655; https://doi.org/10.3390/app15105655

Submission received: 23 April 2025 / Revised: 12 May 2025 / Accepted: 15 May 2025 / Published: 19 May 2025

(This article belongs to the Special Issue Deep Learning in Medical Image Processing and Analysis)

Download

Browse Figures

Versions Notes

Abstract

This study presents an enhanced version of the StarGAN model, with a focus on medical applications, particularly endoscopic image augmentation. Our model incorporates novel Perspective Transformation and Viewpoint Attention Modules for StarGAN that improve image classification accuracy in a multiclass classification task. The Perspective Transformation Module enables the generation of more diverse viewing angles, while the Viewpoint Attention Module helps focus on diagnostically significant regions. We evaluate the performance of our enhanced architecture using the Kvasir v2 dataset, which contains 8000 images across eight gastrointestinal disease classes, comparing it against baseline models including VGG-16, ResNet-50, DenseNet-121, InceptionNet-V3, and EfficientNet-B7. Experimental results demonstrate that our approach achieves better performance in all models for this eight-class classification problem, increasing accuracy on average by 0.7% on VGG-16 and 0.63% on EfficientNet-B7 models. The addition of perspective transformation capabilities enables more diverse examples to augment the database and provide more samples of specific illnesses. Our approach offers a promising solution for medical image generation, enabling effective training with fewer data samples, which is particularly valuable in medical model development where data are often scarce due to challenges in acquisition. These improvements demonstrate significant potential for advancing machine learning disease classification systems in gastroenterology and medical image augmentation as a whole.

Keywords:

data augmentation; generative adversarial networks; medical diagnostic imaging; image recognition

1. Introduction

In recent years, Generative Adversarial Networks (GANs) have been increasingly used in medical diagnostic image analysis. In the last four years, significant work has been devoted to the detection of COVID-19 [1,2] as well as its application in medical diagnostics for detecting various benign and malignant tumors [3]. A key application of GANs in medical imaging is image augmentation, a crucial technique used to artificially expand limited datasets and enhance the generalization capability of deep learning models. In the medical domain, where obtaining large annotated datasets is challenging due to privacy concerns, acquisition costs, and the need for expert annotations, augmentation techniques play a vital role in overcoming data scarcity [4,5].

GANs are used to generate synthetic medical images, which help supplement datasets when high-quality annotated medical images are lacking. These generated images complement existing data and enable the development of increasingly reliable diagnostic models. In the work [6], the authors explain six different applications of GANs in medical image analysis: medical image segmentation; classification; reconstruction; detection; synthesis; noise reduction.

Endoscopic examinations serve as the gold standard for diagnosing and assessing a wide range of gastrointestinal diseases. These real-time video examinations provide high-definition visualization of the GI tract’s interior, revealing specific visual signatures that are critical for accurate diagnosis. For instance, polyps appear as mucosal outgrowths with distinctive color and surface patterns that may develop into cancer if untreated. Inflammatory conditions display equally characteristic features—esophagitis presents as red mucosal tongues projecting from the Z-line, while ulcerative colitis exhibits bleeding, swelling, and ulceration with white fibrin coating. The precise assessment of disease severity and sub-classification directly influences treatment decisions and follow-up care protocols. Despite their clinical value, these examinations require both expensive equipment and highly trained personnel. The development of computer-aided detection and diagnostic systems could potentially democratize access to expert-level analysis, reduce inequalities in care delivery, optimize the use of limited medical resources, and allow clinicians to dedicate more time to patient care rather than documentation [5].

Early detection of various cancers (colorectal cancer CRC, lung cancer and others) through screening has demonstrated significant benefits, with studies showing steady declines in both CRC incidence and mortality rates. This decline is attributed to improved risk factor management, early cancer detection through screening programs, precancerous polyp removal via colonoscopy, and advances in surgical and treatment approaches [7]. Most colorectal cancers develop through either the adenoma–carcinoma sequence or from sessile serrated lesions, presenting valuable opportunities for cancer prevention through timely intervention [5,7].

Despite significant advancements in medical imaging technologies, substantial challenges persist in developing automated analysis systems for diagnostic applications. Further improvement of deep learning models in medical image analysis is majorly bottlenecked by the lack of large-sized and well-annotated datasets [8]. Deep neural network architectures require extensive training data containing all possible variations to achieve high generalization ability and robustness [9]. However, medical images are notoriously scarce due to several interrelated factors: insufficient patients for certain diseases, patients’ unwillingness to allow the use of their images, lack of appropriate medical equipment, and inability to obtain images meeting desired criteria [8,9]. Even when sufficient images are acquired, proper annotation requires specialized domain knowledge from medical professionals, making the process both time-consuming and expensive [8]. This data scarcity problem leads to imbalanced datasets causing overfitting, biased results, and diminished diagnostic accuracy [9]. The challenge is particularly pronounced when attempting to develop computer-aided diagnosis systems for different organs using images from various modalities, as each presents unique characteristics requiring specialized analytical approaches [8,9]. While data augmentation techniques have emerged as a common solution to address these limitations, the effectiveness of specific augmentation methods varies significantly depending on the disease type and neural network architecture employed. Unlike conventional computer-aided detection schemes that require manual feature development, deep learning models can progressively identify and learn hidden patterns inside regions of interest through their hierarchical architecture, potentially overcoming these data limitations when properly implemented [8].

Among various deep learning architectures, Generative Adversarial Networks (GANs) have emerged as particularly powerful tools for medical image generation, with recent studies demonstrating their capacity to produce realistic synthetic images that can effectively augment limited training datasets while preserving clinically relevant features [10]. GANs operate through a competitive framework between two neural networks: a generator that creates synthetic images and a discriminator that attempts to distinguish real from generated data. This adversarial training process enables GANs to produce realistic medical images over time. In the scope of healthcare, GANs address fundamental challenges posed by limited dataset availability, privacy restrictions, and imbalanced class distributions. GAN-based augmentation extends beyond simple transformations to generate entirely new samples based on learned data distributions, providing enhanced diversity and quality of training datasets. Various GAN architectures have been successfully implemented across different medical imaging modalities. However, there are persistent challenges with GAN implementations, including mode collapse, gradient vanishing problems, and difficulties in maintaining training stability, emphasizing the need for continued research to optimize these approaches for medical imaging applications [11].

Previous research has explored GAN architectures for endoscopic image augmentation, with StarGAN demonstrating particular promise in this domain [12]. StarGAN addresses a critical limitation of conventional GANs by enabling multi-domain image-to-image translation using only a single model architecture [13], which significantly improves scalability for medical imaging applications. GAN-based augmentation can substantially enhance the classification performance of endoscopic imaging systems. Their research validated the effectiveness of applying data augmentation based on generative networks to endoscopic images of various gastrointestinal conditions. Their study empirically demonstrated that StarGAN augmentation yielded superior outcomes compared to traditional augmentation methods when applied to datasets showing different types of gastrointestinal diseases. This work established a foundation, and our current research builds upon it, introducing further architectural refinements to address the unique challenges posed by endoscopic image synthesis [12].

This study advances GAN-based endoscopic image synthesis by modifying the StarGAN architecture to address domain-specific challenges in gastrointestinal imaging. Our approach introduces specialized perspective and viewpoint transformation layers to the StarGAN model, enabling the network to generate synthetic endoscopic images that better account for the unique positional variations encountered during actual endoscopic procedures. These viewpoint-aware synthetic images represent the diversity of angles and distances from which lesions might be visualized in clinical practice, resulting in augmented datasets that more comprehensively capture the variability clinicians encounter during diagnostic procedures. Additionally, we replace traditional ReLU activators with LeakyReLU activation functions throughout the network, which demonstrate improved gradient flow during training and effectively mitigates the vanishing gradient problem that commonly plagues GAN stability in medical imaging applications. This modification allows for more efficient learning of the subtle tissue textures and color variations crucial for accurate diagnostic interpretation. Together, these architectural enhancements to the StarGAN framework produce synthetic endoscopic images that maintain clinically relevant features while expanding dataset diversity, ultimately improving downstream diagnostic model performance.

The main contributions of this paper can be summarized as follows:

1. We introduce a novel Perspective Transformation Module for StarGAN that enables the generation of endoscopic images from diverse viewing angles.

2. We develop a Viewpoint Attention Module that focuses on diagnostically significant regions in the generated images, enhancing the clinical relevance of synthetic data.

3. We implement LeakyReLU activation functions throughout the network, replacing traditional ReLU activations to improve gradient flow and network stability.

4. We demonstrate empirical improvements in classification accuracy across multiple state-of-the-art deep learning architectures on the eight-class Kvasir dataset, with gains ranging from 0.298 to 0.704 percentage points.

5. We provide a comprehensive evaluation of our enhanced StarGAN architecture against baseline models, validating its effectiveness for medical image augmentation in multiclass classification tasks.

2. Related Work

2.1. Basic Transformation for Image Augmentation

Rotation is one of the most widely implemented augmentation techniques, involving rotating images by various angles (typically 90°, 180°, or 270°) to increase viewpoint variability. This technique is particularly valuable for object recognition tasks. According to the document, rotation has been implemented in numerous recent studies (2022–2023), including works by Üreten et al. [14], Zhang et al. [15], and Li et al. [16], who demonstrated its effectiveness in enhancing detection accuracy in various medical imaging applications.

Scaling and resizing transformations modify image dimensions to simulate varying distances or resolutions. These techniques have been applied in recent studies such as Zhang et al. [17], who used scaling as part of their minimal training data strategy for diverse medical image synthesis, and He et al. [18], who employed proximal updates for differentiable automatic data augmentation by training medical image segmentation models with varying image scales.

Translation involves shifting images horizontally and vertically to simulate changes in object position. This technique helps models become invariant to the location of features within an image. Translation has been frequently employed in recent medical imaging research, including studies by Platscher et al. [6] for stroke lesion segmentation, and Wang et al. [19] for cross-modality LGE–CMR segmentation using image-to-image translation-based augmentation.

2.2. GAN Augmentation

Article [20] presents a detailed analysis of GAN algorithms, highlighting their strengths and limitations. It includes an analysis of experimental studies to provide insight into the current development of GANs in medical imaging. However, the authors acknowledge that applying GANs to medical imaging presents unique challenges that must be addressed to enhance productivity.

Similarly, the authors [21] conducted an extensive review of scientific articles on the use of GANs in medicine. They included 121 studies in their final methodological review, covering GAN-based applications in seven areas of medical imaging: synthesis, classification, segmentation, conversion, reconstruction, denoising, and lesion detection. Their findings highlight several critical issues associated with GANs, such as pattern collapse, instability, and lack of interpretability, which remain unresolved.

In the paper [22], the authors present a synthetic data augmentation technique using GANs. They use a convolutional neural network (CNN) model to classify X-ray images and generate synthetic X-ray images using a deep convolutional generative adversarial network (DCGAN) model. The authors’ methodology improved the CNN model’s performance by 3.2 percent.

In the paper [10], considering that medical images differ from conventional RGB images in terms of complexity and dimensionality, the authors analyzed an adaptive generative adversarial network, namely MedGAN. They used Wasserstein loss as a convergence metric to measure the convergence degree of the generator and discriminator. After adaptively training MedGAN and generating medical images according to this metric, they used it to build multi-frame medical data learning models for disease classification and lesion localization.

2.3. Endoscopic Image Augmentation

GANs have been increasingly used particularly in endoscopy. Endoscopy is a medical procedure that utilizes specialized equipment to examine internal organs or perform interventional procedures using real-time imaging. It is most commonly used to investigate diseases of the digestive tract.

In the work [23], the authors proposed a Swin Transformer encoder-based StyleGAN (STE-StyleGAN) for unbalanced endoscopic image enhancement, which consists of an adversarial learning encoder and a generator. The encoder extracted multi-scale features layer by layer from endoscopic images. After that, a self-attention mechanism was applied to the generator, which adds detailed information to the image layer by layer through the encoded features. Experimental investigation with real data showed that the images generated by STE-StyleGAN achieved a Fréchet Inception Distance (FID) value of 100.4, and the model achieved an accuracy of 86%

Endoscopic images are difficult to analyze quickly and thoroughly because they often rely on human vision. In the work [12], the authors aimed to create a classification system for digestive system diseases. During the study, a total of six models were trained: VGG-16, ResNet-50, DenseNet-121, InceptionNet-V3, EfficientNet-B7, and ViT. To increase the amount of medical data, data augmentation was applied using two adversarial generative neural network-based models. The authors’ experimental studies demonstrated that InceptionNet-V3 showed the best performance improvement with augmentation based on a StarGAN.

2.4. StarGAN

StarGAN, introduced by Choi et al. [13] represents a significant advancement in generative adversarial networks by enabling multi-domain image-to-image translation using a single model architecture. Unlike previous approaches that required separate networks for each domain transformation, StarGAN employs a unified framework with one generator and discriminator, where the generator translates input images to target domains specified by domain labels, while the discriminator both distinguishes real from fake images and classifies domain attributes. This unified approach dramatically improves scalability and training efficiency. Building on this foundation, Park et al. [12] successfully applied StarGAN specifically to endoscopic image augmentation for gastrointestinal disease classification. Their approach implemented the original StarGAN architecture without modifications to generate synthetic endoscopic images for data augmentation. They trained classification models on both the original dataset and an augmented dataset containing StarGAN generated images. Their research empirically demonstrated that StarGAN-based augmentation yields superior results compared to traditional data augmentation methods when applied to datasets showing various gastrointestinal conditions. Their work established that InceptionNet-V3 showed the most significant performance improvements when enhanced with StarGAN augmented training data, providing a strong foundation for applying generative approaches to medical image classification tasks where limited training data represents a persistent challenge.

2.5. Deep Learning Models for Medical Image Classification

This research evaluates several state-of-the-art convolutional neural network architectures. VGG-16 [24], developed by the Visual Geometry Group at Oxford, features 16 layers with uniform 3 × 3 convolutions and has been widely used as a baseline model despite its relative simplicity. ResNet-50 [25] introduced residual connections to address the vanishing gradient problem in deeper networks, enabling effective training of 50-layer networks. DenseNet-121 [26] further improved feature propagation by implementing dense connections where each layer receives inputs from all preceding layers, reducing the number of parameters while maintaining strong performance. InceptionNet-V3 [27] employs parallel convolutional filters of different sizes to capture features at various scales, making it particularly effective for medical images with structures of varying dimensions. EfficientNet-B7 [28], the most recent architecture tested, uses compound scaling to systematically balance network depth, width, and resolution, achieving state-of-the-art performance with fewer parameters. In our study, these architectures serve as backbone networks for endoscopic image classification, with each model initialized using pre-trained weights and customized with a fully connected layer added to the end.

3. Materials and Methods

3.1. Dataset

The Kvasir dataset (Kvasir Dataset v2) is a publicly available endoscopic image collection designed for computer-aided detection and classification of gastrointestinal diseases. It contains a total of 8000 images distributed evenly across eight classes, with 1000 images per class. All images are encoded in a JPEG format, with pixel resolutions ranging from 720 × 576 to 1920 × 1072. The dataset is organized into three main categories: anatomical landmarks (normal cecum, normal pylorus, and normal Z-line), pathological findings (esophagitis, polyps, and ulcerative colitis), and images representing the polyp removal process (dyed lifted polyps and dyed resection margins). This dataset provides standardized, annotated images for developing and evaluating machine learning algorithms for endoscopic image classification, particularly for detecting abnormalities in the digestive system [5].

3.2. Architectural Enhancements over Original StarGAN

The proposed model enhances the original StarGAN framework with several architectural modifications aimed at improving the quality of generated medical images and addressing the specific challenges of medical image augmentation.

3.2.1. Attention Mechanisms

A significant enhancement to the original StarGAN is the incorporation of two complementary attention mechanisms: Perspective Transformation Module (PTM) and Viewpoint Attention Module (VAM).

Perspective Transformation Module (PTM)

The Perspective Transformation Module is a specialized neural network module that learns to generate and apply perspective transformations to input images or feature maps. Its functionality can be broken down into several key components:

Feature Extraction Component. This component processes the input data to extract relevant information about spatial relationships and visual structures within the image. It reduces the spatial dimensions while increasing the feature richness, effectively compressing the visual information into a representation that is suitable for predicting transformation parameters.
Transformation Parameter Prediction. Based on the extracted features, this component generates a set of six parameters that define an affine transformation matrix. These parameters control various aspects of the transformation, including rotation, scaling, shearing, and translation effects. The component is initially configured to predict an identity transformation (no change) and learns to predict appropriate transformations during training.
Randomization Mechanism. To introduce diversity in the transformations, this component adds controlled randomness to the predicted parameters. It applies random rotations, scaling factors, and small translations that vary from one input to another. This variability ensures that the module can produce diverse viewpoint changes rather than a single fixed transformation.
Transformation Application. The final component constructs a transformation grid based on the combined predicted and randomized parameters. It then applies this transformation to the input data using interpolation techniques that ensure the resulting output maintains visual coherence. This process warps the input according to the specified perspective change while preserving the overall visual content.

In essence, the module functions as a learnable perspective transformer that can adapt its behavior based on input characteristics while incorporating controlled randomness to produce diverse transformations. The entire process is differentiable, allowing the module to be integrated into larger neural network architectures and trained end-to-end. The whole architecture is displayed in Figure 1.

Viewpoint Attention Module (VAM)

The Viewpoint Attention Module implements a self-attention mechanism within neural networks designed to focus on specific regions in feature representations. This module enables selective emphasis of important spatial locations within visual data. Core components and their functions include the following:

Query, Key, and Value Projections. The module utilizes three parallel transformations of the input data:

○: Query Projection transforms the input features to represent “what we’re looking for” at each position. This projection typically reduces the feature dimensionality to create a more compact representation.
○: Key Projection creates a representation of “what information is available” at each position, with similar dimensionality reduction as the query projection.
○: Value Projection represents the actual information content at each position that will be selectively emphasized or attenuated by the attention mechanism. This projection typically maintains the original feature dimensions.

Attention Computation. The attention mechanism operates by:

○: Computing similarity scores between query and key representations, measuring how relevant each position is to every other position.
○: Normalizing these similarity scores using SoftMax to create attention weights that sum to 1.
○: These weights effectively create a probability distribution indicating which regions should receive greater focus.

Feature Aggregation. After computing attention weights, the module:

○: Uses these weights to create a weighted combination of value features.
○: This aggregation process allows information to flow between different positions based on their computed relevance.
○: Positions with higher attention scores contribute more strongly to the final representation.

Functional Significance. The Viewpoint Attention Module enhances neural networks by:

○: Enabling long-range dependencies between distant spatial positions.
○: Creating a dynamic, content-dependent focus mechanism.
○: Selectively emphasizing relevant features while suppressing irrelevant ones.
○: Providing a flexible way to incorporate global context into local feature processing.
○: Maintaining training stability through gradual introduction of the attention mechanism.

This attention-based approach allows the network to develop a more nuanced understanding of spatial relationships and focus on the most informative regions for the task at hand, particularly valuable when processing viewpoint-dependent information in computer vision applications. The entire module architecture is presented in Figure 2.

3.2.2. Integration of LeakyReLU Activation

In our modified architecture, we replaced all of the Generator ReLU activation functions with LeakyReLU activations (with a negative slope coefficient of 0.2) and changed all of the Discriminator LeakyReLU activations to coefficient 0.2, also in the Residual Block. This modification addresses the “dying ReLU” problem that can occur in the new adjusted model implementation when handling the subtle intensity variations common in medical images. LeakyReLU allows for small negative gradients to flow through the network, resulting in more stable training dynamics and better preservation of fine details in generated medical images. This change is particularly beneficial in the encoder blocks of the generator network, where preserving gradient information for low-intensity features is crucial for maintaining pathological markers in synthesized medical images.

3.2.3. Module Integration into StarGAN

The Perspective Transformation Module and Viewpoint Attention Module are strategically inserted in the middle of the Generator’s processing pipeline, between the down-sampling and residual blocks of the network.

After the input image and domain information are processed through the initial convolution and the two down-sampling layers, the resulting feature maps are passed through the Perspective Transformation Module. This module applies spatial transformations to the features, helping the model adapt to different viewpoints.

Immediately following this, the transformed features are fed into the Viewpoint Attention Module, which applies attention mechanisms to focus on the most relevant regions of the feature maps.

Only after these two attention-based modules have processed the features do they proceed to the residual blocks, which maintain the transformed and attention-weighted representation while adding identity connections for better gradient flow.

This architecture places the attention mechanisms exactly where they can have the most impact after initial feature extraction but before the deeper processing of residual blocks, allowing the network to focus on relevant spatial information early in the processing pipeline. The Enhanced Generator architecture is described in Figure 3.

3.3. Experiment Setup

For the experimental setup, the dataset was split into training and validation sets with an 80:20 ratio. With 1000 images per class, this translated to 800 training images and 200 validation images for each category. Using the new Enhanced GAN, the model was trained on those 800 training images per class. We used a learning rate of 0.00007 for the Generator and 0.00005 for the Discriminator. In the reference paper [12], specifics regarding generated image size and quantity were not provided, so we generated 224 × 224 size images, creating one additional image for each training image. Following the methodology in [12], we did not generate images for the normal-cecum and normal-pylorus classes as these already demonstrated relatively higher classification performance due to their clear characteristics.

As in paper [12], additional augmentation techniques implemented during training included random vertical and horizontal flipping, with rotation angles randomly selected between −90° and 90°, along with random image cropping. Transfer learning was utilized across all models, which were initialized with pre-trained weights and customized by adding a fully connected layer at the end. Most images were resized to 224 × 224 × 3 pixels for compatibility with the models, with the exception of InceptionNet-V3, which required a larger input size of 299 × 299 × 3 pixels due to its first convolutional layer specifications. The batch size was generally set at 64, except for EfficientNet-B7, which used a smaller batch size of 32 due to memory constraints.

Some images contained artifacts such as letters or green screens from endoscope equipment in the lower-left corner. However, since these elements appeared consistently across most categories, they were not expected to significantly impact model learning, and no specific preprocessing was applied to remove them, maintaining consistency with the methodology in [12].

Five deep learning models were evaluated in this study using PyTorch (v2.1, Meta Platforms, Inc., Menlo Park, CA, USA) as the implementation framework. Each model underwent 300 epochs of training. Optimizer selection varied by model architecture, with Adam optimizer proving most effective for VGG-16 and DenseNet-121, while RMSprop delivered superior performance for ResNet-50, InceptionNet-V3, EfficientNet-B7. The learning rate was initialized at 0.0001, with CosineAnnealingLR employed as the scheduler. CrossEntropyLoss served as the loss function, and the CosineAnnealing scheduler was configured with a maximum of 50 iterations and a minimum learning rate of 0. For every single model architecture and for GAN model training and generating, we used NVIDIA RTX A4500 (NVIDIA Corporation, Santa Clara, CA, USA), except for EfficientNet-B7, where we used RTX 4090 (NVIDIA Corporation, Santa Clara, CA, USA) because the RTX A4500 lacked the memory size to run the model.

All hyperparameters are listed in Table 1.

4. Experiments and Results

4.1. Augmented Images

Figure 4 displays comparison of the three image sets, showing significant changes achieved through advanced augmentation techniques. The original endoscopic images provide the clinical baseline with authentic tissue coloration and lighting. StarGAN-generated images demonstrate the initial capability of generative models but are practically identical to the originals in terms of structure and positioning, only providing minimal changes in color tones and brightness. This limited transformation restricts their effectiveness in dataset diversification. In contrast, the enhanced model, leveraging more sophisticated rotation and translation transformations alongside an attention mechanism, produces visibly improved results with substantially greater variability. By incorporating diverse viewing angles, positions, and feature emphasis, this approach creates crucial viewpoint invariance—essential for endoscopic imaging where camera angles frequently change—while simultaneously emphasizing tissue boundaries and structural characteristics with heightened contrast. Although the enhanced model occasionally produces more vivid coloration that may appear less natural, these very characteristics help accentuate subtle patterns and tissue changes with diagnostic significance. The visual enhancements directly contribute to the model’s improved accuracy by generating genuinely diverse training data while maintaining focus on clinically relevant features, ultimately supporting better generalization across diverse endoscopic presentations.

4.2. Experiment Results

Experimental evaluation was conducted across five state-of-the-art deep learning architectures: VGG-16, ResNet-50, DenseNet-121, InceptionNet-V3, and EfficientNet-B7. All models were trained for the multiclass classification task involving all eight classes from the Kvasir v2 dataset: anatomical landmarks (normal cecum, normal pylorus, and normal Z-line), pathological findings (esophagitis, polyps, and ulcerative colitis), and polyp removal process images (dyed lifted polyps and dyed resection margins). As mentioned before, the dataset was augmented by generating an additional image to the training dataset, using our Enhanced GAN model.

Table 2 presents results comparing the performance of five state-of-the-art deep learning architectures with reported accuracy in reference [12] and after data augmentation using an Enhanced GAN model. The reported confidence intervals are based on 20 runs of training made for each model architecture, ensuring robust statistical validation of the observed improvements. Below is an analysis of the results:

All five models (VGG-16, ResNet-50, DenseNet-121, InceptionNet-V3, and EfficientNet-B7) showed improvements in accuracy after augmenting the training dataset with additional GAN-generated images.
VGG-16 showed the largest relative improvement, increasing from 93.43% to 94.13%, resulting in a gain of 0.704 percentage points with a confidence interval of ±0.118.
ResNet-50 improved from 94.18% to 94.68%, with a gain of 0.504 percentage points (±0.106).
DenseNet-121 improved from 94.50% to 94.81%, with a gain of 0.318 percentage points (±0.131).
InceptionNet-V3 had the highest baseline accuracy and achieved 95.22% after augmentation, a gain of 0.298 percentage points (±0.082). This model had the narrowest confidence interval, suggesting more consistent improvement.
EfficientNet-B7 showed the second-largest improvement and achieved the highest overall accuracy after augmentation (95.25%), with a gain of 0.636 percentage points (±0.097).

These results demonstrate that the Enhanced GAN-based data augmentation consistently improved classification performance across different model architectures, with improvements ranging from approximately 0.3 to 0.7 percentage points. The confidence intervals indicate that all improvements are statistically significant, as none of them include zero.

The observed improvements across all tested architectures highlight several important findings. First, the fact that older architectures like VGG-16 showed the largest relative gains suggests that data augmentation particularly benefits models with fewer parameters or less sophisticated feature extraction capabilities.

Conversely, more recent architectures like InceptionNet-V3 showed smaller but still significant improvements, likely because they already incorporate various forms of architectural regularization that help prevent overfitting. The narrow confidence interval for InceptionNet-V3 also suggests that this architecture produces more consistent results with augmented data.

EfficientNet-B7’s performance is particularly noteworthy, as it achieved both the highest overall accuracy and a substantial improvement from augmentation. This suggests that its efficient scaling of network depth, width, and resolution works synergistically with GAN-augmented data to capture more nuanced features in the images.

5. Conclusions

This study presents a novel enhancement to the StarGAN architecture for endoscopic image augmentation through the integration of Perspective Transformation and Viewpoint Attention Modules. Our approach addresses the critical challenges of generating diverse and diagnostically relevant synthetic medical images while maintaining essential clinical features.

Experimental results demonstrate consistent improvements across all evaluated deep learning architectures, with accuracy gains ranging from 0.298 to 0.704 percentage points. The most substantial improvements were observed in older architectures like VGG-16, suggesting that our augmentation approach particularly benefits less complex networks. Meanwhile, EfficientNet-B7 achieved the highest overall accuracy of 95.25% when trained with our enhanced augmentation.

Qualitative analysis reveals that our model produces more varied images than the original StarGAN by introducing diversity in viewing angles and feature emphasis while preserving diagnostic characteristics. This variability better represents the conditions encountered in real endoscopic examinations and addresses the persistent challenge of data scarcity in medical imaging applications.

Our Enhanced StarGAN architecture represents a meaningful advancement in medical image augmentation, particularly for endoscopic applications where viewpoint variations are common. By improving the quality and diversity of synthetic training data, this approach contributes to the development of more accurate and robust computer-aided diagnostic systems, with potential benefits extending across various medical imaging domains.

Future work should explore more advanced geometric transformations within our GAN framework, particularly incorporating controlled shearing and non-rigid deformations to better simulate the elastic properties of gastrointestinal tissues. Developing condition-specific transformation parameters could enhance the clinical relevance of generated samples by applying different degrees of perspective distortion based on anatomical region or pathology. Combining our approach with style-based GANs may also improve control over important visual characteristics like illumination and tissue coloration that vary considerably in real endoscopic examinations, further addressing data scarcity while generating increasingly realistic synthetic images for model training.

Author Contributions

Conceptualization and methodology, L.J. and D.Š.; software, L.J.; writing—original draft preparation, L.J.; visualization, investigation, and editing, L.J. and D.Š.; writing—review, supervision, project administration, and funding acquisition, D.Š. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Kvasir v2 dataset used in this study is available at https://www.kaggle.com/datasets/plhalvorsen/kvasir-v2-a-gastrointestinal-tract-dataset (accessed on 15 May 2025). Any additional data presented in this study are available on request from the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fedoruk, O.; Klimaszewski, K.; Ogonowski, A.; Kruk, M. Additional Look into GAN-Based Augmentation for Deep Learning COVID-19 Image Classification. Mach. Graph. Vis. 2023, 32, 107–124. [Google Scholar] [CrossRef]
Dash, A.; Swarnkar, T. CoVaD-GAN:An Efficient Data Augmentation Technique for COVID CXR Image Classification. In Proceedings of the 2023 2nd International Conference on Ambient Intelligence in Health Care (ICAIHC), Bhubaneswar, India, 17–18 November 2023; pp. 1–7. [Google Scholar] [CrossRef]
Al-Adwan, A. Evaluating the Effectiveness of Brain Tumor Image Generation Using Generative Adversarial Network with Adam Optimizer. Int. J. Adv. Comput. Sci. Appl. 2024, 15. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Pogorelov, K.; Randel, K.R.; Griwodz, C.; Eskeland, S.L.; de Lange, T.; Johansen, D.; Spampinato, C.; Dang-Nguyen, D.-T.; Lux, M.; Schmidt, P.T.; et al. KVASIR: A Multi-Class Image Dataset for Computer Aided Gastrointestinal Disease Detection. In Proceedings of the 8th ACM on Multimedia Systems Conference; MMSys’17, Taipei, Taiwan, 20–23 June 2017; Association for Computing Machinery: Taipei, Taiwan, 2017; pp. 164–169. [Google Scholar] [CrossRef]
Kancharagunta, K.B.; Nayakoti, R.; Nukarapu, S. Generative Adversarial Networks in Medical Image Analysis: A Comprehensive Survey. In Proceedings of the International Conference On Innovative Computing And Communication, Zhengzhou, China, 18–20 October 2024; Springer: Singapore, 2024; pp. 367–398. [Google Scholar] [CrossRef]
Shaukat, A.; Kahi, C.J.; Burke, C.A.; Rabeneck, L.; Sauer, B.G.; Rex, D.K. ACG Clinical Guidelines: Colorectal Cancer Screening 2021. Am. J. Gastroenterol. 2021, 116, 458–479. [Google Scholar] [CrossRef]
Chen, X.; Wang, X.; Zhang, K.; Fung, K.-M.; Thai, T.C.; Moore, K.; Mannel, R.S.; Liu, H.; Zheng, B.; Qiu, Y. Recent advances and clinical applications of deep learning in medical image analysis. Med. Image Anal. 2022, 79, 102444. [Google Scholar] [CrossRef] [PubMed]
Goceri, E. Medical image data augmentation: Techniques, comparisons and interpretations. Artif. Intell. Rev. 2023, 56, 12561–12605. [Google Scholar] [CrossRef]
Guo, K.; Chen, J.; Qiu, T.; Guo, S.; Luo, T.; Chen, T.; Ren, S. MedGAN: An Adaptive GAN Approach for Medical Image Generation. Comput. Biol. Med. 2023, 163, 107119. [Google Scholar] [CrossRef]
Islam, T.; Hafiz, S.; Jim, J.R.; Kabir, M.; Mridha, M. A Systematic Review of Deep Learning Data Augmentation in Medical Imaging: Recent Advances and Future Research Directions. Healthc. Anal. 2024, 5, 100340. [Google Scholar] [CrossRef]
Park, H.-C.; Hong, I.-P.; Poudel, S.; Choi, C. Data Augmentation Based on Generative Adversarial Networks for Endoscopic Image Classification. IEEE Access 2023, 11, 49216–49225. [Google Scholar] [CrossRef]
Choi, Y.; Choi, M.; Kim, M.; Ha, J.-W.; Kim, S.; Choo, J. StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8789–8797. [Google Scholar] [CrossRef]
Üreten, K.; Maraş, H.H. Automated Classification of Rheumatoid Arthritis, Osteoarthritis, and Normal Hand Radiographs with Deep Learning Methods. J. Digit. Imaging 2022, 35, 193–199. [Google Scholar] [CrossRef]
Zhang, G.; Dang, H.; Xu, Y. Epistemic and aleatoric uncertainties reduction with rotation variation for medical image segmentation with ConvNets. SN Appl. Sci. 2022, 4, 56. [Google Scholar] [CrossRef]
Li, X.; Hu, X.; Qi, X.; Yu, L.; Zhao, W.; Heng, P.-A.; Xing, L. Rotation-Oriented Collaborative Self-Supervised Learning for Retinal Disease Diagnosis. IEEE Trans. Med. Imaging 2021, 40, 2284–2294. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, Q.; Hu, B. Minimalgan: Diverse medical image synthesis for data augmentation using minimal training data. Appl. Intell. 2023, 53, 3899–3916. [Google Scholar] [CrossRef]
He, W.; Liu, M.; Tang, Y.; Liu, Q.; Wang, Y. Differentiable Automatic Data Augmentation by Proximal Update for Medical Image Segmentation. IEEE/CAA J. Autom. Sin. 2022, 9, 1315–1318. [Google Scholar] [CrossRef]
Wang, W.; Yu, X.; Fang, B.; Zhao, Y.; Chen, Y.; Wei, W.; Chen, J. Cross-Modality LGE-CMR Segmentation Using Image-to-Image Translation Based Data Augmentation. IEEE/ACM Trans. Comput. Biol. Bioinform. 2022, 20, 2367–2375. [Google Scholar] [CrossRef] [PubMed]
Showrov, A.; Aziz, M.; Nabil, H.R.; Jim, J.; Kabir, M.; Mridha, M.F.; Asai, N.; Shin, J. Generative Adversarial Networks (GANs) in Medical Imaging: Advancements, Applications and Challenges. IEEE Access 2024, 12, 35728–35753. [Google Scholar] [CrossRef]
Liu, J.; Li, K.; Dong, H.; Han, Y.; Li, R. Medical Image Processing based on Generative Adversarial Networks: A Systematic Review. Curr. Med. Imaging Former. Curr. Med. Imaging Rev. 2023, 20, 31. [Google Scholar] [CrossRef]
Adhikari, R.; Pokharel, S. Performance Evaluation of Convolutional Neural Network Using Synthetic Medical Data Augmentation Generated by GAN. Int. J. Image Graph. 2021, 23, 2350002. [Google Scholar] [CrossRef]
Deng, B.; Zheng, X.; Chen, X.; Zhang, M. A Swin Transformer Encoder-Based StyleGAN for Unbalanced Endoscopic Image Enhancement. Comput. Biol. Med. 2024, 175, 108472. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE Computer Society: Los Alamitos, CA, USA, 2016; pp. 2818–2826. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2019, arXiv:1905.11946. [Google Scholar] [CrossRef]

Figure 1. Perspective transformation module.

Figure 2. Viewpoint attention module.

Figure 3. Generator comparison: original vs. model with attention.

Figure 4. Comparison of generated images.

Table 1. Hyperparameter settings for models and training.

Component	Parameter	Value
Enhanced GAN	Generator Learning Rate	0.00007
	Discriminator Learning Rate	0.00005
	Image Size	224 × 224
Classification Models	Batch Size (VGG-16, ResNet-50, DenseNet-121, InceptionNet-V3)	64
	Batch Size (EfficientNet-B7)	32
	Initial Learning Rate	0.0001
	Image Size (VGG-16, ResNet-50, DenseNet-121, EfficientNet-B7)	224 × 224 × 3
	Image Size (InceptionNet-V3)	299 × 299 × 3
	Optimizer (VGG-16, DenseNet-121)	Adam
	Optimizer (ResNet-50, InceptionNet-V3, EfficientNet-B7)	RMSprop
	Scheduler	CosineAnnealingLR
	Scheduler Max Iterations	50
	Scheduler Min Learning Rate	0
	Loss Function	CrossEntropyLoss
	Training Epochs	300
Data Augmentation	Rotation Range	−90° to 90°
	Flipping	Random Horizontal and Vertical
	Cropping	Random

Table 2. Experiment results and comparison to previous paper.

Model	Reported Accuracy	Achieved Accuracy	Improvement	Confidence Interval
VGG-16	0.9343	0.9413	+0.704	±0.118
ResNet-50	0.9418	0.9468	+0.504	±0.106
DenseNet-121	0.9450	0.9481	+0.318	±0.131
InceptionNet-V3	0.9493	0.9522	+0.298	±0.082
EfficientNet-B7	0.9462	0.9525	+0.636	±0.097

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Janutėnas, L.; Šešok, D. Perspective Transformation and Viewpoint Attention Enhancement for Generative Adversarial Networks in Endoscopic Image Augmentation. Appl. Sci. 2025, 15, 5655. https://doi.org/10.3390/app15105655

AMA Style

Janutėnas L, Šešok D. Perspective Transformation and Viewpoint Attention Enhancement for Generative Adversarial Networks in Endoscopic Image Augmentation. Applied Sciences. 2025; 15(10):5655. https://doi.org/10.3390/app15105655

Chicago/Turabian Style

Janutėnas, Laimonas, and Dmitrij Šešok. 2025. "Perspective Transformation and Viewpoint Attention Enhancement for Generative Adversarial Networks in Endoscopic Image Augmentation" Applied Sciences 15, no. 10: 5655. https://doi.org/10.3390/app15105655

APA Style

Janutėnas, L., & Šešok, D. (2025). Perspective Transformation and Viewpoint Attention Enhancement for Generative Adversarial Networks in Endoscopic Image Augmentation. Applied Sciences, 15(10), 5655. https://doi.org/10.3390/app15105655

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Perspective Transformation and Viewpoint Attention Enhancement for Generative Adversarial Networks in Endoscopic Image Augmentation

Abstract

1. Introduction

2. Related Work

2.1. Basic Transformation for Image Augmentation

2.2. GAN Augmentation

2.3. Endoscopic Image Augmentation

2.4. StarGAN

2.5. Deep Learning Models for Medical Image Classification

3. Materials and Methods

3.1. Dataset

3.2. Architectural Enhancements over Original StarGAN

3.2.1. Attention Mechanisms

Perspective Transformation Module (PTM)

Viewpoint Attention Module (VAM)

3.2.2. Integration of LeakyReLU Activation

3.2.3. Module Integration into StarGAN

3.3. Experiment Setup

4. Experiments and Results

4.1. Augmented Images

4.2. Experiment Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI