FQ-UWF: Unpaired Generative Image Enhancement for Fundus Quality Ultra-Widefield Retinal Images

Ultra-widefield (UWF) retinal imaging stands as a pivotal modality for detecting major eye diseases such as diabetic retinopathy and retinal detachment. However, UWF exhibits a well-documented limitation in terms of low resolution and artifacts in the macular area, thereby constraining its clinical diagnostic accuracy, particularly for macular diseases like age-related macular degeneration. Conventional supervised super-resolution techniques aim to address this limitation by enhancing the resolution of the macular region through the utilization of meticulously paired and aligned fundus image ground truths. However, obtaining such refined paired ground truths is a formidable challenge. To tackle this issue, we propose an unpaired, degradation-aware, super-resolution technique for enhancing UWF retinal images. Our approach leverages recent advancements in deep learning: specifically, by employing generative adversarial networks and attention mechanisms. Notably, our method excels at enhancing and super-resolving UWF images without relying on paired, clean ground truths. Through extensive experimentation and evaluation, we demonstrate that our approach not only produces visually pleasing results but also establishes state-of-the-art performance in enhancing and super-resolving UWF retinal images. We anticipate that our method will contribute to improving the accuracy of clinical assessments and treatments, ultimately leading to better patient outcomes.


Introduction
Ultra-widefield (UWF) retinal images have emerged as a revolutionary modality in ophthalmology [1,2].As depicted in Figure 1, UWF provides an extensive field of view that enables the visualization of both central and peripheral retinal areas.This enables early detection and monitoring of peripheral retinal conditions that are often missed in standard fundus images.However, various artifacts, low macular area resolution, large data size, and lack of interpretation standardization act as impediments to widespread clinical use of UWF images.
Image enhancement techniques have the potential to improve UWF image quality, empowering healthcare professionals to make more accurate diagnoses and treatment plans.Ophthalmologists may better detect subtle early changes in the macular area and identify peripheral early signs of disease, leading to better patient outcomes.But since UWF images contain multiple degradation factors scattered throughout the fundus in a complex manner, image enhancement is a significant challenge.Many recent conventional image enhancement techniques are based on supervised learning and require a ground truth (GT) dataset of well-aligned low-and high-quality image pairs for training.Achieving this paired dataset is a significant challenge in the case of UWF, where precise alignment between image pairs is extremely difficult.The application of deep learning algorithms has facilitated promising results in a wide range of image enhancement tasks, including super-resolution, image denoising, and image deblurring [3].A variety of methods tailored for enhancement of retinal fundus images have also been proposed [4,5].These methods can automatically learn and apply complex transformations to improve the visualization of critical structures such as blood vessels, the optic disc, and the macula.Despite the necessity, there has yet to be a comprehensive deep-learning-based enhancement method for UWF images.
We thus propose a comprehensive image enhancement method for UWF images, with the specific goal of improving the quality of conventional fundus images.Figure 2 presents sample results of the proposed method.As image quality can be subjective, we compare manual annotations of drusen from fundus images and UWF images after applying our enhancement method.Experimental evaluation demonstrates that the similarity between annotations after enhancement is considerably improved compared to annotations made on images before enhancement.Quantitative measurements of image quality are also assessed, demonstrating state-of-the-art results on several datasets.Based on our goal and the experimental findings, we refer to the enhanced images as fundus quality (FQ)-UWF images.We believe that our approach has the potential to improve the accuracy of clinical assessments and treatments, ultimately leading to better patient outcomes.
The proposed method is based on the generative adversarial network (GAN) framework to avoid the requirement of pairs of aligned high-quality images in pixelwise supervision.We employ a dual-GAN structure to jointly perform super-resolution, enhancing the low resolution of the macula in UWF, which has a critical impact on clinical practice.As image pairs are not required, training data are acquired by simply collecting sets of UWF and fundus images.We also incorporate appropriate attention mechanisms in the network for enhancement with regard to various degradations such as noise, blurring, and artifacts scattered throughout the UWF.
We summarize our contributions as follows: • We establish a method for UWF image enhancement and super-resolution from unpaired UWF and fundus image sets.We evaluate the clinical utility in the context of detecting and localizing drusen in the macula.

•
We propose a novel dual-GAN network architecture capable of effectively addressing diverse degradations in the retina while simultaneously enhancing the resolution of UWF images.

•
The proposed method is designed to be trained on unpaired sets of UWF and fundus images.We further present a corresponding multi-step training scheme that combines transfer learning and end-to-end dataset adaptation, leading to enhanced performance in both quantitative and qualitative evaluations.

Retinal Image Enhancement
Due to the relatively invariable appearance, methods based on traditional image processing techniques continue to be proposed [6,7].But the majority of methods leverage deep neural networks, as in [5,8], and especially GANs in particular [4].
Pham and Shin [9] considered additional factors such as drusen segmentation masks to not only improve image quality but also preserve crucial disease information during the enhancement process, addressing a common challenge in existing image enhancement techniques.To overcome the challenges of constructing a clean true ground truth (GT) dataset for retinal image data, particularly due to factors such as alignment, Yang et al. [4] introduced an unpaired image generation method for enhancing low-quality retinal fundus images.Lee et al. [5] proposed an attention module designed to automatically enhance low-quality retinal fundus images afflicted by complex degradation based on the specific nature of their degradation.

Blind and Unpaired Image Restoration
Blind image restoration is a computational process aimed at enhancing or recovering degraded images without prior knowledge of the degradation model or parameters.Traditionally, methods for blind image restoration have employed approaches involving the prediction of the estimation of degradation model parameters [10] or the degradation kernels [11].Recently, there has been a trend towards directly generating high-quality images through training using deep learning models [12].Shocher et al. [13] conducted super-resolution without relying on specific training examples of the target resolution dur-ing the model's training phase.Yu et al. [14] proposed a blind image restoration toolchain for multiple tasks with reinforcement learning.
Unpaired image restoration focuses on learning the difference between pairs of image domains rather than pairs of individual images.Multiple methods using GAN-based models [15] have been proposed [16,17] to learn the mapping between the low-quality and high-quality images while also incorporating a cycle-consistency constraint [18] to improve the quality of the generated images.

Hierarchical or Multi-Structured GAN
Recently, there has been significant progress in mitigating the instability associated with GAN training, leading to the emergence of various proposed approaches that involve connecting two or more GANs for joint learning.Several works showed stable translation between two different image domains using coupled-GAN architectures [19].Further works extended their usage to multiple domains or modalities [20,21].And more works extended this approach beyond random image generation to tasks such as image restoration [16], and exploration into more complex architectures has also been proposed [22].

Transfer Learning for GANs
Pre-trained GAN models have demonstrated considerable efficacy across various computer vision tasks, particularly in scenarios characterized by limited training data [23,24].Typically trained on extensive datasets comprising millions of annotated images, these models offer a foundation of learned features.Through the process of fine-tuning on novel datasets, one can capitalize on these pre-trained features, leading to the attainment of state-of-the-art performance across a diverse spectrum of tasks.
Early works confirmed successful generation in a new domain by transferring a pretrained GAN to a new dataset [25,26].Other works enabled transfer learning for GANs with datasets of limited size [27,28].Li et al. [29] proposed an optimization method for transfer learning for GAN that was free from biases towards specific classes and resilient to mode collapse and achieved by fine-tuning only the class embedding layer, which is part of the GAN architecture.Mo et al. [30] proposed a method wherein the lower layers of the discriminator are fixed; then, it is partitioned into a feature extractor and a classifier.Subsequently, only the classifier is fine-tuned.Fregier and Gouray [26] performed transfer learning for GAN on a new dataset by freezing the low-level layers of the encoder, thereby preserving pre-trained knowledge to the maximum extent possible.

Overview of FQ-UWF Generation
To get a final enhanced FQ-UWF result I FQ−UWF , we split the process of FQ-UWF generation into two steps: (i) degradation enhancement (DE) and (ii) super-resolution (SR).Figure 3  G SR performs ×4 super-resolution on I E−UWF to get I FQ−UWF .G SR and D SR are trained in the same manner as G DE and D DE , respectively, with the pair of I FQ−UWF and I f undus .For D SR , we also impose cyclic constraints, as in [18,31], by applying the G SR operation to not only I E−UWF but to I DS− f undus as well.For each module, we empirically determined appropriate network architectures.The following subsections describe further details of each module.

Architecture Details 3.2.1. G DE
We apply U-net [32] as the base architecture, as U-net has been proven to be effective for medical image enhancement [33].Within the encoder-decoder structure of U-net, we embed attention modules to better enhance local degradation or artifacts scattered throughout the input image.We apply the attention layer structure proposed by [5], as it has been demonstrated to be effective for retinal image enhancement.The network structure is depicted in the top row of Figure 4.
The Conv box comprises a 3 × 3 convolutional layer so that the spatial size of the feature is reduced to 1/4, where both the height and the width of the feature are reduced to 1/2, and the channel dimension is doubled.The Deconv box comprises a 3 × 3 deconvolutional layer so that the spatial size of the feature is quadrupled, where both the height and the width of the feature are doubled, and the channel dimension is halved.The attention (Att) box comprises a sequentially connected batch normalization, activation, operationwise attention module, and activation, where the operation-wise attention module enables the degradations to be better attended.

G SR
The network structure is depicted in the middle row of Figure 4.The FeatureExtractor box comprises a 3 × 3 convolutional layer followed by activation.The Conv + BN box comprises a 3 × 3 convolutional layer followed by batch normalization.The Conv + Shu f f le box comprises a 3 × 3 convolutional layer followed by a pixel shuffler for expanding the height and width of the feature by a factor of two each.Channel calibration is designed for reducing the dimension of the feature to three, maintaining the spatial dimension of the feature.The Residual Block comprises series of Conv + BN, activation, Conv + BN, and residual connections for element-wise summing.We note that this structure is adopted from [15].

Loss Functions and Training Details
Given that end-to-end training of an architecture composed of multiple networks is highly challenging, we take three steps to train the full network architecture composed of (i) G DE training, (ii) G SR training, and (iii) overall fine-tuning.

G DE Training
We first impose adversarial loss on G DE and D DE as follows: The identity mapping loss is important when performing tasks such as super-resolution or enhancement, as it helps to maintain the style (color, structure, etc.) of the source do-main's image while applying the target domain's information [18].Thus, we use the loss function defined as: We especially impose L2 regularization [34] loss L R on the weight of G DE to retain knowledge by preventing the abrupt change of the weight as much as possible when we use pre-trained G DE with other datasets.Finally, the loss function L E to adapt the G DE to the fundus-UWF retinal image dataset is defined as: where λ I and λ R control the relative importance of L I and L R , respectively.For more efficient adversarial training, we initialize the network parameters by pretraining using [5].We then freeze the encoder parameters and only update the decoder parameters.

G SR Training
In this step, we freeze all trainable parameters in G DE to generate I E−UWF from I UWF .After the adaptation process for G DE is done, we apply adversarial loss to G SR , which takes I E−UWF from G DE as an input and outputs the FQ-UWF result I FQ−UWF , which is defined as: We also impose a cycle constraint [18], which maintains consistency between the two domains, resulting in more realistic and coherent image translations on I f undus → I DS− f undus → I FQ−UWF .This can be denoted as follows: As mentioned in [17], by applying one-way cycle loss, the network can learn to handle various degradations by opening up the possibility of one-to-many generation mapping.Overall, the loss function for G SR training is expressed as follows: where λ C controls the relative importance of L C .

Overall Fine-Tuning
In the previous training steps, G DE and G SR are trained independently.But to ensure stability and integration between the two generators, a final calibration process is performed on the entire architecture.Additionally, to improve the network's performance in clinical situations, where the diagnosis of lesions is mainly based on the macular region rather than the periphery of the fundus, we again employ the same loss combinations as follows, only using patches from the macular region to fine-tune the entire model:

Experiments 4.1. Datasets and Settings
We used 3744 UWF images and 3744 fundus images acquired from the Kangbuk Samsung Medical Center (KBSMC) Ophthalmology Department from 2017 to 2019.Although UWF and fundus images were acquired in pairs, we anonymized and shuffled the image sets and did not use information of paired images during training.To train the model proposed in this paper, we used 3370 UWF and 3370 fundus images (unpaired).We set the scaling factor for super-resolution to 4, which was close to the approximate average difference in resolution between the UWF and fundus images.To test the model, we used 374 UWF images that were not used during training.

Implementation Details
We use the AdamW [35] optimizer with learning rate = 1e − 3, β 1 = 0.9, β 2 = 0.999, and ϵ = 10 −8 to train G DE and G SR , with weight decay every 100K iterations with a decay rate of 1e − 2. We set the learning rate to be halved every 200K iterations and the batch size as 16, and we train the model for more than 5 × 10 6 iterations using an NVIDIA RTX 2080Ti GPU.We feed two 128 × 128-sized I UWF and I f undus patches that are randomly extracted from the UWF and fundus retinal images, respectively.During training, we apply additional dataset augmentations using rotation and flipping for I UWF and I f undus .
We set λ I , λ R , and λ C , which adjust the degree of importance of L I , L R , and L C to be 0.5, 0.1, and 0.5, respectively.

Evaluation Metrics
As we do not assume paired images for training, we avoid the use of reference-based metrics such as the PSNR [38] or SSIM [39] that require paired GTs.Instead, we measure the LPIPS [40] and the FID [41].Both metrics indicate a closer distance between the two images when their values are smaller.
Additionally, given the nature of retinal images with various degradations, achieving sharp images is also an important consideration.To measure this, we measure γ [42,43].A lower value of the γ metric implies a higher level of sharpness in the generated images, and therefore, the model is considered to deliver higher performance.We further substantiate the statistical validity of our comparisons by employing two-sided tests.We first utilize ANOVA [44] to ascertain whether there were significant differences in the means among groups.Subsequently, to identify specific groups where differences exist, we employ Bonferroni's correction [45].These analyses are conducted using p-values for confirmation.
Furthermore, we attempt to measure the clinical impact of our method by comparative evaluation of the visibility of drusen in the I UWF images before improvement, the I FQ−UWF images after improvement, and the I f undus images.In this process, medical practitioners annotated drusen masks in the order of I UWF → I FQ−UWF → I f undus to minimize potential biases that might arise.

Experiments on the KBSMC Dataset
Figure 2 depicts samples of the enhancement by the proposed method.Improved clarity of vessel lines and background patterns can be observed.

Domain Distance Measurement Results
Table 1 shows the γ, LPIPS, and FID results of the baselines for comparison and our method.The proposed method yields the best results in terms of the γ and LPIPS metrics and the second-best results in terms of the FID. Figure 5 shows the corresponding sample results before and after the improvements with the given methods.We can see visible improvements in the patterns of vessels and the macula.This is corroborated by the γ values in Table 1.The low p-values < 0.001 in the table show the statistical significance of our method in terms of LPIPS, FID, and γ.

Enhancement Results for Severe Degradations
Figure 6 illustrates the comparison with various unpaired super-resolution methods and our method for the challenging scenario wherein the input image is corrupted with the following synthetic degradations: (i) Gaussian blur with σ = 7, where the image is degraded with a Gaussian blur kernel of size σ × σ as in [46]; (ii) Illumination with γ = 0.75, where the brightness of the image is unevenly illuminated by gamma correction with γ as in [47]; (iii) JPEG compression with rate = 0.25, where the compression ratio = rate as in [48]; (iv) Bicubic downsampling with scale = 0.25, where the size of neighborhoods for interpolation is scale × scale as in [49].Table 2 presents the corresponding results in terms of the r, LPIPS, and FID metrics.When considering these results collectively, our method demonstrates the most consistent and effective improvement across the majority of degradation types.

Drusen Detection Results
Figure 7 presents samples of I UWF , I FQ−UWF , and I f undus images with corresponding manually annotated drusen region masks.Quantitative comparative evaluations of the drusen region masks for I UWF and I FQ−UWF are presented in Table 3.Assuming the I f undus drusen mask as GT, we measure the mean average precision (mAP) as the intersection over union (IoU) [50] averaged across the number of images.The increase in mAP highlights the improved diagnostic capabilities through the enhanced I FQ−UWF images.

Ablation Study
Table 1 illustrates the performance results of method variations such as the inclusion of pre-trained G DE through L E for training, the utilization of G DE and G SR , and the consideration of their configuration order.When utilizing pre-trained G DE before super-resolution without a separate degradation enhancement process, significantly better results in terms of γ, LPIPS, and FID metrics were observed compared to cases where only super-resolution was performed.And training G DE via L E and utilizing it for super-resolution led to overwhelmingly superior results.Also, the configuration order of G DE and G SR shows a substantial numerical difference, justifying the subsequent structure of the modules.
Table 4 shows the performance changes when specific components of the loss functions that constitute the entire network are used.According to these results, the most significant performance improvement in our model, which is composed of both G DE and G SR , is achieved when fine-tuning G DE to suit the I DS− f undus image domain.Furthermore, we can observe that utilizing G DE , even when employing the bicubic upsampling method, outperformed the results using only the SRM network.This suggests that super-resolution without adequate degradation removal has limitations in enhancing retinal images.Figure 8 illustrates the importance of the process for removing degradations before super-resolution.We can see that using the improved I E−UWF through the G DE to generate I FQ−UWF showcases a significantly superior enhancement capability compared to generating I FQ−UWF directly from I UWF without the prior degradation removal process.

Discussion
The proposed method can be trained on unpaired UWF and fundus image sets.By reducing dependency on paired and annotated data, our method becomes more pragmatic for integration into real-world medical settings, where the acquisition of such data is often a logistical challenge.The enhanced image quality facilitated by our approach holds the potential to significantly improve diagnostic accuracy.The ability to detect subtle changes in the retinal structure, often indicative of early-stage pathologies, is critical for timely interventions and effective disease management.
Despite the promising outcomes, our study prompts further investigation into several critical areas.The robustness and generalizability of our model need to be rigorously examined across a spectrum of imaging conditions, including instances with various ocular pathologies and diverse qualities of image acquisition.The influence of different imaging devices and settings on our model's performance demands scrutiny to ensure broad applicability in clinical settings.
To validate the real-world impact of our enhancement method, collaboration with domain experts and comprehensive clinical validation are imperative.Ophthalmologists' insights will provide essential perspectives on how the enhanced image quality translates into improved diagnostic accuracy and treatment planning.The feasibility of implementation in diverse clinical settings warrants further exploration considering factors such as computational requirements, integration with existing diagnostic workflows, and userfriendly interfaces for healthcare professionals.

Figure 1 .
Figure 1.Conventional fundus image vs. ultra-widefield (UWF) image.(a) UWF images drastically increase the capability to observe the retina and can cover over 80%, which is more than a five-fold increase compared to (b) conventional fundus images.The diagrams in the left of (a,b) are reproduced from https://www.optomap.com/optomap-imaging/accessed on 1 March 2022.

Figure 2 .
Figure 2. Sample results of the proposed UWF enhancement method.The top row depicts the input UWF images, and the bottom row depicts the FQ-UWF images enhanced by the proposed method.Numbered boxes are enlarged sample views of representative local regions.The clarity of anatomical structures such as vessels is greatly improved in the FQ-UWF images.
presents a visual overview of the framework.The order of the processes is tailored to maximize the quality of the output FQ-UWF images.The generator networks of each process, which we respectively denote as G DE and G SR , are coupled with adversarial discriminator networks D DE and D SR that are designed to enforce that the generators' output images have similar image characteristics as the fundus images from the training set.G DE performs degradation enhancement on input image I UWF to get I DE−UWF .Training of G DE is guided by D DE so that the D DE output score values are similar for the given pair of I E−UWF and I DS− f undus , which is a ×4 bicubically downsampled version of I f undus .D DE is trained to make the score value of the given pair of images significantly differ.

Figure 3 .
Figure 3.The overall architecture of the proposed method.I UWF with severe degradations and artifacts is first enhanced to I E−UWF via G DE , for which the output is fed to G SR to generate ×4 up-scaled I FQ−UWF .I f undus is down-scaled to I DS− f undus with a scaling factor of 4. D DE and D SR measure the similarity between I E−UWF and I DS− f undus to train G DE and the similarity between I FQ−UWF and I f undus to train G SR , respectively.

Figure 4 .
Figure 4.The detailed structure of generators and discriminators.The detailed structure of generator G DE , G SR , and the discriminator shared between D DE and D SR is illustrated.Note that even though D DE and D SR utilize the same structure, they are fundamentally distinct discriminative networks.3.2.3.D DE and D SRThe structures of the discriminator models D DE and D DE are depicted in Figure4.The FeatureExtractor box comprises a 3 × 3 convolutional layer followed by activation.The Conv + BN box comprises a 3 × 3 convolutional layer followed by batch normalization.The Conv Block comprises series of Conv + BN and activation.At the final layer of the network, there exists a score function for evaluating the similarity of input images, accompanied by a Dense layer aimed at reducing the dimension of the feature to a single scalar score value.We follow the structure of the discriminator in[15] for D DE .The input images for D DE are pairs of downsampled real fundus images I DS− f undus and generated enhanced low-resolution UWF images I E−UWF .The input images for D SR are pairs of real fundus images I f undus and generated FQ-UWF images I FQ−UWF .

Figure 5 .
Figure 5.The enhanced FQ-UWF results.Input I UWF images are improved using various methods.

Figure 6 .
Figure 6.The enhanced FQ-UWF results.Different types of degradation are applied to I UWF images.Degraded images are improved using various methods.

Figure 8 .
Figure 8.The interim improvement results (a) Input image, (b) I UWF , (c) I E−UWF , (d) I FQ−UWF , and (e) direct super-resolution results using G SR of (b).

Table 1 .
Quantitative evaluation of KBSMC dataset.Values are mean ± standard deviation.For γ, LPIPS, and FID, smaller values indicate better performance.Bold values denote the most effective method corresponding to each evaluation metric.

Table 2 .
Cont.Values are mean ± standard deviation.For γ, LPIPS, and FID, smaller values indicate better performance.Bold values denote the most effective method corresponding to each evaluation metric and each degradation type.
Values are mean ± standard deviation.For γ, LPIPS, and FID, smaller values indicate better performance.Bold values denote the most effective method corresponding to each evaluation metric.