A Residual Dense Attention Generative Adversarial Network for Microscopic Image Super-Resolution

With the development of deep learning, the Super-Resolution (SR) reconstruction of microscopic images has improved significantly. However, the scarcity of microscopic images for training, the underutilization of hierarchical features in original Low-Resolution (LR) images, and the high-frequency noise unrelated with the image structure generated during the reconstruction process are still challenges in the Single Image Super-Resolution (SISR) field. Faced with these issues, we first collected sufficient microscopic images through Motic, a company engaged in the design and production of optical and digital microscopes, to establish a dataset. Secondly, we proposed a Residual Dense Attention Generative Adversarial Network (RDAGAN). The network comprises a generator, an image discriminator, and a feature discriminator. The generator includes a Residual Dense Block (RDB) and a Convolutional Block Attention Module (CBAM), focusing on extracting the hierarchical features of the original LR image. Simultaneously, the added feature discriminator enables the network to generate high-frequency features pertinent to the image’s structure. Finally, we conducted experimental analysis and compared our model with six classic models. Compared with the best model, our model improved PSNR and SSIM by about 1.5 dB and 0.2, respectively.


Introduction
Regardless of the magnification and numerical aperture of the objective lens used, the imaging throughput of current microscopes is typically only in the order of ten megapixels [1].This leads to a compromise between high resolution and Field of View (FOV) when imaging.However, in biomedical research such as histopathology, hematology, and neuroscience, there is a growing need for high-resolution imaging of large samples.In order to accurately resolve cellular-level life activities at the scale of the entire sample, a balance between the global structure and microscale local details is required, along with quantitative analysis.Faced with this challenge, Super-Resolution (SR) imaging techniques were developed.Single Image Super-Resolution (SISR) reconstruction [2] aims to reconstruct a High-Resolution (HR) image from an input Low-Resolution (LR) image.This technique is widely used in critical fields, including bright field micrographs [3], fluorescent imaging [4,5], remote sensing images [6,7], and surveillance videos [8].Traditional SISR reconstruction algorithms can be divided into three categories: interpolation-based [9,10], reconstruction-based [11,12], and learning-based [13].Interpolation-based algorithms have the advantages of simple principle and easy implementation.However, these methods have difficulty recovering the detailed information of the image, and the reconstructed image is seriously distorted.Reconstruction-based algorithms first generate the constraints for SR reconstruction according to the imaging process of LR images, establish the corresponding mathematical model, and finally reconstruct HR images.However, when the magnification is high, it is still difficult to recover enough high-frequency information.The learning-based reconstruction algorithm mainly establishes the mapping relationship between LR and HR hyperspectral images, and then reconstructs according to the mapping relationship.In recent years, with the rapid development of deep learning, the excellent learning ability of convolutional neural networks also brings new opportunities for the development of SR imaging technology.
Specifically, Dong et al. [14] used deep learning in image SR reconstruction by introducing Convolutional Neural Networks (CNNs).This groundbreaking work paved the way for subsequent innovations.Kim et al. [15] introduced the Very Deep Super Resolution (VDSR) model, which leverages a deeply constructed neural network combined with residual learning for image reconstruction.This approach effectively addresses the issue of image size reduction caused by successive convolutions.Lai et al. [16] proposed the Laplacian Pyramid Super-Resolution Network (LapSRN) model, which employs a Laplacian pyramid to progressively reconstruct the image and perform feature extraction through a residual structure.Ledig et al. [17] advanced the field by incorporating residual blocks within the structure of a Generative Adversarial Network (GAN) to create the Super-Resolution Generative Adversarial Network (SRGAN) model.The use of GANs significantly improves the visual perception quality of the generated images, making them more closely resemble real images.Lim et al. [18] developed the Enhanced Deep Super-Resolution (EDSR) model, which refines the residual structure found in the SRGAN model for improved performance.Zhang et al. [19] contributed by introducing the Residual Dense Block (RDB) structure and the Residual Dense Network (RDN) model, further advancing the capabilities of SR reconstruction.
In terms of microscopic image reconstruction, Zhang et al. [20] proposed Registration-Free GAN Microscopy (RFGANM) workflow by combining an SRGAN network with an optical microscope and degradation model to achieve deep learning SR in large FOV, and improve the resolution of wide-field microscopy and light-sheet fluorescence microscopy images.Wang et al. [21] used a GAN network to develop SR techniques for cross-modal fluorescence microimaging by simulating the mapping relationships between different imaging techniques, such as wide-field fluorescence imaging, confocal to STED, and TIRF to TIRF-SIM, through deep learning.Van Sloun [22] applied deep learning to ultrasound microscopy SR study and proposed the Deep-ULM model, which is based on the U-Net network to reduce the effect of diffraction and obtain HR images in real time.Li et al. [23] achieved high-quality reconstruction of ordinary wide-field fluorescence images to SIM SR imaging results using a Deep Fourier Channel Attention (DCFA) network, greatly contributing to the development of Super-Resolution in microscopic images.
It can be seen that, although great progress has been made in the SR of microscopic images, most of them are focused on fluorescence images in a special way.We know that the features of fluorescence microscopy images are significantly different from those of cell images.Additionally, despite the above SR methods having achieved good results, there are some shortcomings in any of them.These include the absence of specialized datasets for microscopic images, which hampers model training; the inadequate use of hierarchical features in LR images by GAN-based SR models; insufficient utilization of information across convolutional layers due to varying receptive fields; a lack of prioritization in reconstructing feature map information, leading to a suboptimal focus on critical details; and the reliance on pixel-level reconstruction errors as loss functions, which fails to capture high-frequency details and often results in overly smooth and potentially inaccurate images.
In order to respond to these challenges, this paper has carried out corresponding optimization work on cell microscopy images.The specific contributions are described as follows: (1) We produce a dataset of high-and low-resolution images of four cell types using microscope acquisition.You can find the dataset at https://github.com/wxsssss/RDAGAN/tree/main (accessed on 15 January 2024).( 2) In order to fully utilize the hierarchical features of the original image, we propose a Residual Dense Attention Generative Adversarial Network (RDAGAN), whose generator uses RDABs with increased attention mechanisms.(3) To reconstruct the high-frequency features associated with HR images, we add a feature discriminator to the original discriminator and optimize the loss function.(4) Our proposed optimized model is compared with six classic models.Compared to the best performing model among them, PSNR and SSIM have improved by 1.5 dB and 0.2, respectively.

Proposed Methodology
Our model is based on the aforementioned RFGANM [20] that has been used for microscopic cell image SR, constructing a generator network architecture composed of four main sections: the feature extraction part, the residual dense attention part, the dense feature fusion part, and the reconstruction part.The input LR image first passes through the feature extraction section, where initial features are extracted.These features are then processed by the residual dense attention section.Here, the RDB extracts rich local features through densely connected convolutional layers, fully leveraging the hierarchical features of the layers.The local feature fusion within the RDB allows for an adaptive and more effective integration of previous and current local features, thus stabilizing the training of broader network.An attention mechanism is added after each RDB, first applying a channel attention module to obtain a weighted result, followed by a spatial attention module to refine the weighting.This enables the network to concentrate on information that is most beneficial for image reconstruction.The RDB and the attention module together constitute the Residual Dense Attention Block (RDAB).After extracting dense local features through a series of residual dense attention modules, dense feature fusion is proposed to mine multi-level features from a global perspective.Additionally, a feature discriminator [24] is introduced to differentiate the detailed features of SR and HR images, encouraging the reconstructed image to generate more high-frequency features instead of noise.

Microscopic Cell Image Dataset
We used a microscope to capture cell images using both 40× and 20× objective lenses simultaneously to establish our dataset.We use the EasyScan NFC 300 microscope from Motic, which is a high-performance fully automatic digital scanning microscope equipped with advanced optical and imaging technology, to provide high-quality microscopic images.The microscope features fast and precise autofocus capabilities to ensure image clarity throughout the scanning process.It employs a high-quality infinity-corrected optical system that delivers clear and distortion-free images.The optical system incorporates advanced correction design to minimize common optical aberrations such as spherical aberration, chromatic aberration, and coma, providing accurate imaging quality.Additionally, the system undergoes rigorous color correction and transmission curve optimization to ensure uniform light transmission across different wavelengths, reducing color distortion and light loss.We perform center cropping on the captured images.HR images were cropped to a resolution of 1024 × 1024 pixels, while LR images were cropped to 512 × 512 pixels, reflecting the 2× objective relationship.In constructing the dataset, we faced challenges in precisely matching HR and LR images.To address this, we employed Bicubic Downsampling (BD) to generate corresponding LR images from their HR counterparts.These LR images were then used for both reconstruction purposes and performance comparison.Finally, we created 800 pairs of images for training data and 200 pairs of image val data for reconstruction for each of four cell images.You can refer to Appendix A for relevant characteristics of microscopic cell images.

Residual Dense Blocks for Fusion CBAM
The structure of our generator network is shown in Figure 1, consisting of four parts: a shallow feature extraction network, a RDAB, a Dense Feature Fusion (DFF), and an upsampling network.The input and output of the generator network are represented by I LR and, respectively, I SR .A convolutional layer is used to extract shallow features F 0 of the LR image and perform global residual learning, while F 0 also serves as an input to the RDAB.F 0 denotes the convolution operation on the LR image.Assuming that there are D RDABs in the generator, the Dth RDAB can be represented as follows: where  After extracting the hierarchical features with a set of D RDABs, further DFF is performed, including Global Feature Fusion (GFF) and Global Residual Learning (GRL).DFF fully utilizes the features of all previous layers; F DF is the output feature mapping of the composite function H DFF of the DFF module, which can be expressed as follows: Local and global features are extracted and fused in the LR space to obtain F DF , which is used as an input to the upsampling network.The upsampling network used in this invention is the same as SRGAN.The entire network of the generator as a whole can be represented as follows: The structure of the RDAB is shown below in  Denote the input and output of the dth RDAB by F d−1 and F d , respectively.The outputs of the previous RDAB and each convolutional layer of that RDAB are directly connected to all subsequent layers, which not only preserves the feed-forward properties but also extracts the local dense features.F d,c denotes the output of the cth convolutional layer in that RDAB, and F d,C denotes the output of the last convolutional layer of the densely connected layers in that RDAB.The purpose of LRL is to further improve the information flow, and the final output of the dth RDAB is denoted as follows: where H d LFF (•) denotes the fusion of all layer feature information in the dth RDAB.DFF is feature fusion and residual learning for each RDAB feature obtained to utilize hierarchical features in a global manner.DFF includes GFF and GRL.The global features are extracted by fusing the features of all RDABs.
The RDAB introduces a continuous memory mechanism that allows previous RDABs to directly access each layer of the current RDAB.By using LFF, it is possible to train the deep network stably, while LRL can further optimize the information flow and gradient.In addition the RDAB uses GFF and GRL to further extract global features.Convolutional Block Attention Module (CBAM) [25] is a simple and efficient forward convolutional neural attention module.CBAM consists of two independent sub-modules: the channel attention module and the spatial attention module.The output of the convolutional layer first passes through the channel attention module to obtain a weighted result, which then goes through the spatial attention module for final weighting.This approach not only saves parameters and computational power but also ensures that it can be integrated as a plug-and-play module into existing network architectures.The diagram of the attention module is shown in Figure 3.

Discriminator and Loss Function
In order to make the reconstructed image generate real high-frequency information, we add a feature discriminator under the original image discriminator [21].The image discriminator, similar to the discriminators in traditional GAN models, takes the SR image as input to determine whether it is a real HR image or a SR image.The feature discriminator inputs the SR image into a VGG network to extract intermediate feature maps.Since the extracted feature maps contain structural information, the feature discriminator distinguishes between SR images and real HR images based not only on high-frequency components but also on structural components.Through adversarial training with the generator and the two discriminators, our generator is trained to synthesize realistic structural features rather than random high-frequency noise.The structure of our discriminator is shown in Figure 4. Image discriminator D i and feature discriminator D f are next described.The image discriminator D i judges real HR images and SR images based on pixel values, while the feature discriminator D f distinguishes between real HR images and SR images through the mapping of feature maps.The generator loss function is as follows: C p is the perceptual loss [20].C i a is the generator's image GAN loss for synthesizing high-frequency details in the pixel domain.C f a is the generator's feature GAN loss for synthesizing structural details in the feature domain.λ is the weights of the GAN loss terms.To train the discriminators D i and D f , we minimize the losses C i d and C where D i (I) is the output of the image discriminator.The feature GAN loss C f a and feature discriminator loss L f d of the generator are defined as follows: where D f ϖ k is the output of the feature discriminator.During training, the generator can be made to produce realistic structural high-frequency details instead of noise artifacts.

Experimental Settings
We used cell images obtained from a 40× microscope as HR images and downsampled them using BD to obtain the corresponding LR images.Each of the four cells has 800 pairs of training datasets and 5 test datasets.The test datasets are each named Cell A, Cell B, Cell C, and Cell D. We use the Peak Signal-to-Noise Ratio (PSNR), Structure Similarity Index Measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS) to verify the authenticity of the reconstructed images and detect image distortion in various models.Calculating PSNR and SSIM in the YCbCr color space reduces the influence of chroma on the results and aligns better with human visual perception.We also calculated the number of parameters for each model and the reconstruction time for a single cell image.
During model training, the input HR images are randomly cropped to a size of 192 × 192, and the quantity is increased by random horizontal flips, and 90 • , 180 • , and 270 • rotations.The LR images are obtained by downsampling the corresponding HR images by ×2 using bicubic downsampling, resulting in a size of 96 × 96.We set the weight λ in Equation (8) as 10 −3 .For Φ m in Equations ( 11) and ( 12), we use the Conv5 layer of the VGG-19.We train our model using the ADAM optimizer with β 1 = 0.9 and β 2 = 0.999 .The learning rate is initialized to 2 × 10 −4 and multiplied by 0.5 after the 200-th epoch.Our model is implemented using the PyTorch framework and trained on an NVIDIA 3090 GPU, and training takes approximately 8 to 9 h to complete.

Comparison with Classic Models
We conducted ablation experiments on our proposed model.We trained it using the collected microscope image dataset and evaluated it on the Cell A dataset.As can be seen from the results in Table 1, the CBAM, RDB, and feature discriminator that we have added to the model have a better PSNR value and SSIM value.Then, we compare our model with six classic image SR models, including EDSR, Enhanced Super-Resolution Generative Adversarial Network (ESRGAN) [26], very deep Residual Channel Attention Network (RCAN) [27], RDN, SRResnet [17] (less a discriminator compared to SRGAN), and RFGANM.And, from Table 2, it can be seen that our model achieves the highest PSNR and SSIM for reconstructing images.This is because we added RDAB in the generator of our model, allowing it to fully utilize the hierarchical features of the original LR image and focus the reconstruction on the main information of the feature maps.To avoid generating artifacts during reconstruction, we added a feature discriminator, which enables the generator to produce high-frequency features relevant to the HR image, and we made corresponding improvements to the model's loss function.On the LPIPS metric, the best model is ESRGAN, with our model being a close second.Regarding model parameters, the best model is SRResnet, and our model differs by only 5.91 M. For single image reconstruction time, EDSR is the fastest, and our model is only 0.51 s slower.In summary, compared to other SR models, our model achieves the highest PSNR and SSIM, and good LPIPS, model parameters, and reconstruction times.
In addition, Figure 5 shows the reconstruction detail views of our proposed model and two classic models.It can be observed that the cell detail textures and edge parts reconstructed by our model are closer to the HR images.In summary, these results demonstrate that our network model has achieved excellent performance.

Conclusions
In this paper, we provide a microscopic images dataset.And the SRGAN model is improved by modifying the residual block of the generator to an RDB and combining the attention mechanism.Additionally, a feature discriminator is introduced to ensure that the RDAB,d represents the operation of the dth RDAB, which is a composite function of convolution and ReLU operations.Meanwhile, F d is generated by the dth RDAB fully utilizing the features from each convolutional layer within the block and F D can be considered as a local feature.The generator RDABs in this paper are 16 in total.

Figure 4 .
Figure 4. Architecture of our discriminator network.
f d , which correspond to C i a and C f a , respectively.The generator and the discriminator are trained by alternately minimizing C g , C i d , and C f d .The image GAN loss C i a and image discriminator loss C i d of the generator are defined as follows:

Figure 5 .
Figure 5.Comparison of reconstructed detail views for Cell A, Cell B, Cell C, and Cell D using our proposed model and two classical models.

Table 2 .
Comparison of PSNR, SSIM, LPIPS, model parameters, and reconstruction times for Cells A, B, C, and D using our proposed model and six classic image SR models (the red indicates the best result).