GAN-Based Low-Dose Chest X-Ray Super-Resolution with Hybrid Channel-Spatial Attention and Pooling Layer Removal

Li, Wenjia; Yao, Yafeng; Gao, Di; Yi, Ying

doi:10.3390/app16041797

Open AccessArticle

GAN-Based Low-Dose Chest X-Ray Super-Resolution with Hybrid Channel-Spatial Attention and Pooling Layer Removal

School of Mechanical Engineering and Electronic Information, China University of Geosciences, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(4), 1797; https://doi.org/10.3390/app16041797

Submission received: 23 January 2026 / Revised: 9 February 2026 / Accepted: 9 February 2026 / Published: 11 February 2026

(This article belongs to the Special Issue Application of Machine Vision in Biomechanical Engineering)

Download

Browse Figures

Versions Notes

Abstract

Chest X-ray (CXR) imaging is one of the most widely used techniques for screening and diagnosing pulmonary diseases. However, discerning subtle structural changes, such as small nodules, disordered pulmonary textures, tiny cavities, pleural thickening, or spiculation, is difficult using low-resolution images. Acquiring high-resolution CXRs typically requires higher radiation doses, posing a risk to patients. We propose a chest X-ray image super-resolution algorithm based on generative adversarial networks (GAN). Through adversarial training, our approach generates high-resolution CXRs with enhanced details and improved realism. We further incorporate a CSA hybrid attention module into the network, strengthening its ability to capture fine structures and improve texture fidelity. Moreover, we remove the pooling layer from the channel attention module to overcome limitations in super-resolution, thereby preserving spatial information more effectively. Experiments demonstrate our method’s superior performance and robustness, achieving a PSNR of 37.91 and SSIM of 0.9108 on the internal test set while consistently outperforming other methods on previously unseen external clinical datasets. After adversarial training, the method attains optimal visual performance, with LPIPS reduced to 0.0915, and the visual effect improved by 36.4% compared to low-resolution images. Ablation studies further verify the contribution of the proposed method to enhancing super-resolution capability. Overall, results indicate that the proposed method can obtain high-quality chest X-rays images from simulated low-quality inputs.

Keywords:

chest X-ray; generative adversarial networks; attention mechanisms; pooling free; super-resolution

1. Introduction

Chest radiography is a fundamental, non-invasive imaging technique that reveals early-stage diseases via subtle structural alterations, such as small nodules, disordered pulmonary textures, minute cavities, pleural thickening, or spiculation. However, low-resolution images are prone to significant detail loss and blurred boundaries, which can easily lead to missed or incorrect diagnoses of early lesions, limiting clinicians’ ability to accurately detect elusive and early-stage abnormalities during routine examinations [1]. In contrast, high-resolution images can present clearer spatial distributions, edge morphology, and internal structural characteristics of lesions in clinical practice, thereby significantly improving the detection rate of small foci, blurred margins, or complex overlapping regions that are otherwise indistinguishable. Nevertheless, routinely acquiring high-resolution chest X-rays requires an increased X-ray dosage, which poses additional cumulative health risks to patients, creating a challenging trade-off between image quality and safety. Super-resolution algorithms enhance images collected at low radiation doses [2], providing high-quality diagnostic images without raising radiation risks associated with hardware-based improvements. This is particularly valuable for patients requiring frequent follow-up to monitor disease progression and for radiation-sensitive populations such as children and pregnant women.

In recent years, methods in the super-resolution reconstruction field have constantly evolved, with deep learning techniques becoming increasingly pivotal in smart healthcare applications [3]. Traditional approaches such as bicubic interpolation [4] and sparse representation [5] are simple to implement but have limited capability to reconstruct complex structures and fine textures, often resulting in blocking artifacts and failing to provide actual additional detail. With the advent of deep learning, SRCNN [6] pioneered the use of convolutional neural networks for end-to-end image super-resolution, achieving notable performance gains; however, its simplistic feed-forward structure restricted its expressiveness. The introduction of residual networks, exemplified by VDSR [7] and EDSR [8], improved performance substantially by enabling deeper network architectures with residual connections, thereby stabilizing the training of deeper networks. Despite these advances, these methods often produce overly smoothed results that lack realistic perceptual quality.

Super-resolution methods based on generative adversarial networks (GAN) [9] employ adversarial training to encourage the generation of more realistic and detail-rich high-resolution images. SRGAN [10] was the first to apply GANs to super-resolution, while ESRGAN [11] further optimized the architecture with dense residual blocks and perceptual loss, improving texture detail restoration. Real-ESRGAN [12] introduced complex degradation modeling and a U-Net [13] discriminator to further enhance generalization and robustness. More recently, GAN-based frameworks have been specifically adapted for various medical modalities to meet precise clinical diagnostic demands. For instance, Cui et al. [14] demonstrated the value of GANs in diffusion-weighted MRI for accurate rectal cancer staging, while Jiang et al. [15] proposed a two-channel GAN specifically for the target reconstruction of pulmonary nodules. Similarly, PCT-GAN [16] was designed to restore fine trabecular bone microstructures from real CT images to handle unique noise characteristics. Nevertheless, medical images demand extremely high structural fidelity, and even minor generator errors may result in misdiagnosis. Although these methods have greatly improved perceptual quality and detail presentation, limitations remain in the recovery of high-frequency features and authentic anatomical structures.

Attention mechanisms have recently shown promising results in super-resolution. Increasing evidence demonstrates their ability to enhance networks’ focus on key features. RCAN [17] uses channel attention to improve the expression and focus on high-frequency features effectively. Benefiting from Swin Transformer [18], SwinIR [19] leverages efficient sliding-window multi-scale attention to capture both local details and global structures. In SwinFIR [20], the authors integrate fast Fourier convolution and residual modules to extend SwinIR, gaining a broader receptive field. HAT [21] combines channel and spatial self-attention, while LKHAT [22] incorporates reparameterization to further enhance local feature extraction, striking a balance between accuracy and efficiency. More recently, region-based attention mechanisms have been proposed to adaptively focus on high-difficulty restoration areas in medical images, reducing interference from irrelevant background noise [23].

Additionally, studies such as EDSR [8] and RepSR [24] indicate that Batch Normalization layers are suboptimal for super-resolution tasks. Batch Normalization standardizes feature statistics, which, while beneficial for many tasks, tends to introduce displeasing artifacts in super-resolution as it changes the distributional properties of features. Pooling layers, which aggregate features within local neighborhoods (e.g., via max or average pooling), similarly affect data distribution. While pooling can simplify features and enhance computational efficiency, it inevitably loses local detail and high-frequency features during downsampling, a significant drawback for medical image reconstruction, where fine structure preservation is critical.

To address these challenges, this paper presents a CSAEGAN-based chest X-ray super-resolution model. Built upon the GAN framework, the model aims to solve the super-resolution reconstruction problem under simulated low-dose imaging conditions through adversarial training. A novel CSA hybrid attention module is incorporated to enable the network to accurately capture critical pathological features and subtle structures in chest X-ray images. We further remove pooling layers from the channel attention modules to effectively preserve spatial details and high-frequency information, thereby enhancing the reconstruction quality of clinically significant features such as small pulmonary nodules, fine textures, and edge structures. Furthermore, we validate the model’s robust generalization ability beyond the training distribution using independent external datasets.

2. Theoretical Method

2.1. Network Framework

The proposed model in this paper adopts the generative adversarial network (GAN) as its basic framework. A GAN consists of a generator (G) and a discriminator (D) that are trained in a competitive manner [9]. The generator is responsible for synthesizing simulated data, while the discriminator determines whether the input data is real or generated. The two components engage in adversarial optimization. Both the generator and discriminator optimize their respective loss functions via backpropagation, updating their parameters using optimization algorithms such as gradient descent to continuously improve their performance. The training procedure can be defined as:

G_{m i n} D_{m a x} {L (G, D)} = G_{m i n} D_{m a x} {E_{x ~ P_{d a t a (x)}} [{l o g}^{D (x)}] + E_{z ~ P_{z (z)}} [{l o g}^{(1 - D (G (z)))}]}

(1)

In super-resolution tasks, the generator GG aims to generate a super-resolved image from a low-resolution input, i.e.,

I_{S R} = G (I_{L R}),

which should approximate the true high-resolution image

I_{H R}

as closely as possible. The discriminator D computes the probabilities for the generated super-resolved image and the true high-resolution image,

D (I_{S R})

and

D (I_{H R})

, respectively. The objective functions can be formulated as:

L_{a d v} = E [{l o g}^{D (I_{H R})}] + E [{l o g}^{(1 - D (G (I_{L R})))}]

(2)

Through this adversarial training mechanism, the generator learns to produce super-resolved images with richer details and more realistic textures.

2.2. Generator Architecture

As illustrated in Figure 1a, the proposed generator first applies a convolutional layer for preliminary feature extraction from the input low-resolution chest X-ray. The backbone consists of 23 serially connected iterative residual-in-residual dense blocks (CSA-RRDB). Each CSA-RRDB contains a dense block and a CSA hybrid attention module. Feature flow and fusion within each CSA-RRDB are achieved via feed-forward connections. At the output, an upsampling module and a convolutional layer are employed to reconstruct deep feature maps into a high-resolution chest X-ray.

As shown in Figure 1b, the dense block structure employs a series of convolutional units where the input to each layer is formed by concatenating the outputs of all preceding layers [25]. Specifically, let

g_{L}

be the output of the L-th layer in the dense block; the input to this layer is the concatenation of feature maps from layers 0 to L − 1:

g_{L} = H_{L} ([g_{0}, g_{1}, \dots, g_{L - 1}])

(3)

where

[g_{0}, g_{1}, \dots, g_{L - 1}]

denotes the concatenation along the channel dimension, and

H_{L} (\cdot)

represents the nonlinear transformation (e.g., convolution + activation function) at layer L. This structure enables fusion of both low-level and high-level features, promoting feature reuse and smoother gradient flow, thereby enhancing the model’s ability to represent complex structures and fine textures.

The CSA Block, depicted in Figure 1c, primarily consists of two components: a channel attention module [26] and a spatial attention module [27]. Traditional channel attention employs pooling layers to aggregate spatial information, which leads to spatial detail loss and reduced super-resolution performance. To address this, as indicated by the faded blocks in Figure 1c, we remove the max-pooling and average-pooling layers, directly computing channel weights from the input features. This design, tailored for medical images, avoids the loss of subtle structural information, effectively preserving key diagnostic features such as micro-nodules and fine textures in the chest X-ray. The channel attention calculation is simplified as:

M_{c} (g) = σ (2 M L P (g)) = σ (2 W_{1} (W_{0} (g)))

(4)

where MLP denotes a multi-layer perceptron,

W_{0}

and

W_{1}

are the two-layer weights, σ is the activation function, and σ is the Sigmoid activation.

For spatial attention, the module adaptively adjusts weights for different spatial locations, emphasizing key regions of the image. Specifically, max-pooling and average-pooling are performed across the channel dimension, yielding

g_{a v g}^{'}

and

g_{m a x}^{'}

(both of dimension H × W × 1). These are concatenated to form a H × W × 2 feature map, which is processed by a 7 × 7 convolution followed by a Sigmoid activation to obtain the spatial attention weights:

M_{s} (g) = σ (f^{7 \times 7} ([A v g P o o l (g); M a x P o o l (g)])) = σ (f^{7 \times 7} (g_{a v g}^{'}; g_{m a x}^{'}))

(5)

This enables the network to focus on abnormal regions of chest X-rays, such as tuberculosis lesions, tumor margins, or pleural effusions. The outputs of the two attention modules are fused with the original feature maps, yielding a comprehensively enhanced feature representation. This CSA hybrid attention framework significantly improves the network’s sensitivity to subtle pathological features in chest X-rays and augments its ability to capture fine structural details.

2.3. Theoretical Motivation

Although pooling layers are fundamental for achieving translation invariance in high-level vision tasks, their application in medical image super-resolution introduces a structural misalignment. Intuitively, Global Average Pooling (GAP) acts as a spatial low-pass filter. By averaging features across the entire image, it effectively “blurs out” critical high-frequency details, such as the sharp edges of micro-nodules or fine reticular patterns in fibrosis, rendering them indistinguishable from the background. This task necessitates translation equivariance rather than invariance to recover fine lesion details. We provide a theoretical proof demonstrating that pooling creates an information loss [28].

From a signal processing perspective, we first analyze how pooling degrades the signal representation. Let

F \in R^{H \times W \times C}

be the input feature tensor, where

H

,

W

, and

C

denote the height, width, and channel dimensions, respectively. The GAP operation computes a scalar descriptor

z_{c}

for channel

c

by aggregating spatial information:

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{w} F_{i, j, c}

(6)

In the frequency domain, applying a spatial box filter corresponds to multiplication by a Sinc function. Let

(F_{c}) (u, v)

represent the Fourier transform of the feature map at frequency coordinates

(u, v)

. The GAP output captures only the zero-frequency (DC) component:

z_{c} \propto (F_{c}) (0, 0)

(7)

Critically, this implies that for any frequency

(u, v) \neq (0, 0)

, the spectral information is nullified [29]. Physically, this means the network loses the ability to perceive spatial variations within a channel, treating complex textures essentially as uniform flat regions. Consequently, the computed attention weight

s_{c}

(derived from the descriptor

z_{c}

via the MLP) becomes a global scalar, broadcasting a uniform affine transformation across the spatial domain:

F^{'} (x, y) = s_{c} \cdot F (x, y)

(8)

This creates a “blind spot” for high-frequency textures, such as lung markings and subtle nodule edges.

We further formalize this loss using the Data Processing Inequality (DPI) [30]. Consider the Markov chain in a standard pooling-based module, X → pool → v → MLP → s, where

X

represents the local spatial features,

v

is the pooled vector, and

s

is the attention scalar. According to DPI, the inequality

I (s; X) \leq I (v; X)

holds, where

I (\cdot; \cdot)

denotes the mutual information. By decomposing the feature space X into a global mean component

X_{μ}

and a textural residual component

X_{σ}

such that

X = X_{μ} + X_{σ}

, and noting that GAP projects X orthogonally onto the subspace of constant functions, it follows that:

I (v; X_{σ}) = 0

(9)

This theoretically proves that standard channel attention vectors contain zero mutual information regarding spatial texture distribution. In other words, once the features are pooled into vector

v

, the specific spatial arrangement of the texture

X_{σ}

is irreversibly lost and mathematically impossible to reconstruct.

This forward-pass information loss creates severe pathologies in the backward optimization dynamics. Since the attention weight

s_{c}

is spatially uniform, the gradient of the loss L with respect to input feature

F_{u, v, c}

involves a summation over the entire spatial dimension

N = H \times W

:

\frac{\partial L}{\partial F_{u, v, c}} = \frac{\partial L}{\partial Y_{u, v, c}} \cdot s_{c} + \frac{1}{N} \sum_{i, j} \frac{\partial L}{\partial Y_{i, j, c}} \cdot F_{i, j, c} \cdot \frac{\partial s_{c}}{\partial z_{c}}

(10)

where

Y

denotes the output feature map. This formulation introduces two critical issues: (1) Gradient Dilution, where the term

1 / N

significantly attenuates the gradient magnitude; and (2) Gradient Conflict, where the summation

\sum_{i, j}

. aggregates gradients from all spatial locations. If distinct image regions require opposing adjustments (e.g., enhancement vs. suppression), their gradients will effectively cancel out. As a result, the optimizer receives ambiguous feedback, lacking the localized guidance needed to restore fine details.

To resolve these limitations, our CSAEGAN replaces pooling with a pixel-wise mapping

M (x, y) = σ (M L P (F (x, y)))

. Since the

1 \times 1

convolution is locally rank-preserving [31], it maintains the spatial entropy:

I (M; X) \approx E n t (X)

(11)

where

E n t (\cdot)

denotes the entropy function. This ensures the preservation of the full spectral bandwidth. Crucially, this decouples the gradient flow:

\frac{\partial L}{\partial f_{u, v, c}} = \sum_{k} \frac{\partial L}{\partial Y_{u, v, k}} \cdot F_{u, v, k} \cdot \frac{\partial m_{u, v, k}}{\partial F_{u, v, c}} + \frac{\partial L}{\partial Y_{u, v, c}} \cdot m_{u, v, c}

(12)

By eliminating the spatial summation (

1 / N

term), the network receives conflict-free, localized feedback, directly enabling the precise recovery of fine pathological textures.

2.4. Discriminator Architecture

The discriminator is based on the Relativistic average GAN (RaGAN) [32] framework. Unlike traditional GAN discriminators, which simply determine whether an individual input image is real or fake, the RaGAN discriminator evaluates the probability that a generated image is more realistic than the average real image, enhancing its discriminative power and alleviating issues such as gradient vanishing and mode collapse, thereby improving training stability and reconstruction quality. Specifically, the discriminator comprises a multi-layer convolutional neural network including a series of convolutions, activation, and normalization layers for progressive feature extraction and discrimination. After passing either real high-resolution chest X-rays or generator output super-resolved images through the discriminator, the output is a single-channel probability estimate. The core discrimination criterion of RaGAN can be expressed as:

D_{R a} (x_{r e a l}, x_{f a k e}) = σ (C (x_{r e a l}) - E_{x_{f a k e}} [C (x_{f a k e})])

(13)

D_{R a} (x_{f a k e}, x_{r e a l}) = σ (C (x_{f a k e}) - E_{x_{r e a l}} [C (x_{r e a l})])

(14)

where C(·) denotes the raw output of the convolutional network (prior to activation), σ is the Sigmoid function,

x_{r e a l}

and

x_{f a k e}

are the real and generated images, and E[·] denotes the expectation over the batch. Equation (6) computes the probability that a real image is more realistic than the average generated image, while Equation (14) measures the reverse. By leveraging average probabilities, RaGAN enables the discriminator to learn a more informative probability distribution, and encourages the generator to produce images closer to the true data distribution, improving the fidelity and detail of super-resolved images.

3. Experimental Result

3.1. Dataset and Experimental Settings

The chest X-ray (CXR) image dataset used in this study primarily originates from the public dataset published by M. E. H. Chowdhury et al. [33], comprising a total of 3886 images. This dataset covers various clinical scenarios, including 1200 COVID-19 positive images, 1341 normal images, and 1345 viral pneumonia images, providing a rich and diverse foundation for super-resolution reconstruction of chest diseases. To better simulate common image degradation scenarios in clinical practice, we generate paired low-resolution and high-resolution data from original high-resolution X-rays by applying a Gaussian blur in conjunction with bicubic interpolation. Specifically, we apply a Gaussian blur with a standard deviation of

σ = 1.0

to simulate the inevitable optical blur (e.g., point spread function) during image acquisition, followed by

4 \times

bicubic downsampling to mimic the limited spatial resolution of detectors. This numerical preprocessing facilitates improved adaptation of the super-resolution model to simulated degradation types.

For data partitioning, 256 images are randomly selected as a validation set for model tuning and hyperparameter selection, while 30 images are used as an independent test set for initial performance evaluation. To further validate the model’s generalization and clinical applicability, two external test sets [34] are introduced: the Normal test set includes 234 normal chest X-rays, and an additional set contains 390 images exhibiting varying degrees of pulmonary opacity (due to viral and bacterial infections). These external test sets are used solely for final performance evaluation to ensure result objectivity and generalizability. This diversified testing strategy allows for a comprehensive assessment of the proposed method’s performance across different clinical scenarios, particularly in terms of preserving diagnostically critical features.

Three metrics—LPIPS, PSNR, and SSIM—are used to comprehensively evaluate the generated images. PSNR measures the similarity between the original and reconstructed images [35] and is calculated as follows:

P S N R = 10 \log_{10} (\frac{{M A X}_{I}^{2}}{M S E}) = 20 {l o g}_{10} (\frac{{M A X}_{I}}{\sqrt{M S E}})

(15)

where MSE denotes mean squared error and MAX is the maximum possible pixel value. PSNR quantitatively evaluates image reconstruction quality; higher values indicate greater similarity between the reconstructed and original images. SSIM likewise measures image similarity, considering luminance, contrast, and structural information [36]. The simplified formula is as follows:

S S I M (x, y) = (\frac{2 μ_{x} μ_{y} + C_{1}}{{μ_{x}}^{2} + {μ_{y}}^{2} + C_{1}}) (\frac{2 σ_{x y} + C_{2}}{{σ_{x}}^{2} + {σ_{y}}^{2} + C_{2}})

(16)

Here, x and y denote sliding window data from two images;

μ_{x}

and

μ_{y}

are their means,

σ_{x}

and

σ_{y}

are their variances,

σ_{x y}

is their covariance, and

C_{1}

,

C_{2}

are constants to stabilize the calculation and prevent division by zero. LPIPS outperforms PSNR and SSIM in complying with human perceptual similarity [37], and measures perceptual differences between two images using deep learning models. The calculation is as follows:

d (x, x_{0}) = \sum_{l} \frac{1}{H_{l} W_{l}} \sum_{h, w} {∥ ω_{l} ⊙ ({\hat{y}}_{h w}^{l} - {\hat{y}}_{0 h w}^{l}) ∥}_{2}^{2}

(17)

where

H_{l}

and

W_{l}

are the height and width of the l-th layer feature map,

{\hat{y}}_{h w}^{l}

and

{\hat{y}}_{0 h w}^{l}

are the feature values at spatial location (h,w) in layer l, and

ω_{l}

is a learnable weighting parameter that adjusts the importance of different feature layers.

Experimental environment: The Adam optimizer was used for loss function optimization, with a learning rate of

10^{- 4}

. The num_workers parameter was set to 3, the batch size was 6, and the model was trained for 400,000 iterations. The super-resolution scaling factor was set to 4×, and all training was conducted on an NVIDIA RTX 4080 GPU.

3.2. Training Results

The training characteristics of the model are illustrated in Figure 2a, which shows the iterative changes in three types of losses for the generator during training. The pixel loss (

L_{1}

) remains consistently low and stable throughout the process, reflecting the model’s sustained ability to reconstruct basic anatomical structures. The perceptual loss (

L_{p e r c e p}

) is relatively higher and shows some fluctuations, which mainly arise from continuous optimization of high-level texture and semantic features. The adversarial loss (

L_{a d v}

) remains within a lower range but exhibits occasional sharp fluctuations, indicative of the dynamic competition between the generator and discriminator. Figure 2b presents the discriminator’s output scores for real samples (

D (x_{r}

)) and generated samples (

D (G (x))

). At the early stage of training, both exhibit significant fluctuations, indicating that the generator and discriminator are undergoing rapid adjustment and continuous confrontation. As training progresses, both scores show a stable upward trend and a gradual convergence, reflecting that the discriminator is achieving effective real-versus-fake discrimination, while the generator is constantly enhancing its ability to “fool” the discriminator. Towards the end of training, the output gap between the two further narrows, suggesting that a new equilibrium has been reached, characteristic of a well-trained GAN, which facilitates the generation of high-quality and indistinguishable super-resolved images. Overall, these training curves demonstrate the model’s good convergence and dynamic stability. The collaborative optimization of multiple loss components ensures a balance of structural accuracy, perceptual quality, and visual effect, laying the groundwork for subsequent high-quality chest X-ray reconstruction.

3.3. Analysis of Typical Case Reconstruction

To comprehensively assess the proposed method’s applicability and generalization to different pulmonary disease scenarios, we selected chest X-ray images covering seven typical lesion categories for super-resolution reconstruction assessment. Figure 3 presents super-resolution results for various clinical cases, including normal lungs, tuberculosis, lung abscess, emphysema, pleural effusion, primary syndrome, and lung cancer. For each case, the images are arranged from top to bottom as follows: low-resolution input, super-resolved output produced by the proposed method, and the corresponding high-resolution reference image.

Experimental results show that for all lesion types, the super-resolved images produced by our method, compared with the low-resolution input images, exhibit much crisper structural edges and significantly enhanced detail in local lesions, lung textures, airways, and nodules, greatly alleviating the blurring and detail loss common in low-resolution images. This is more evident in the locally magnified comparison regions. In cases such as tuberculosis, emphysema, or tumors characterized by heterogeneous shadows or blurred boundaries, the super-resolution output sharpens lesion contours, improves edge delineation, and effectively preserves early features such as subtle foci, spiculation, and texture irregularities, thereby providing stronger imaging support for the detection of complex or early-stage pathologies. In the majority of cases, the super-resolved output closely matches the true high-resolution references, showing excellent restoration of lung field structure, vascular pathways, and thoracic outlines with natural image texture. The proposed method robustly enhances both increased density areas (e.g., pleural effusion, masses) and decreased density areas (e.g., emphysema, cavities) without introducing noticeable artifacts, demonstrating good generalization and robustness to diverse image types.

These results clearly demonstrate that our adversarial training and channel–spatial attention-based super-resolution approach can reliably restore structural and textural details under various chest disease scenarios, providing a solid imaging foundation for subsequent clinical screening.

3.4. Quantitative and Qualitative Comparison with Other Models

To comprehensively evaluate the proposed method’s performance, we systematically compare it with current mainstream super-resolution approaches, including deep learning approaches (SRCNN, EDSR, SwinIR, CRAFT), and various GAN-based architectures (SRGAN, ESRGAN, MedSRGAN [38], Real-ESRGAN). The results are summarized in Table 1, with GAN-based models evaluated without their discriminator modules. The best results are highlighted in bold.

The experimental outcomes indicate that our method achieves the best or near-best performance across all three test sets and evaluation metrics. The aggregate results surpass those counterparts, comprehensively outperforming both traditional and state-of-the-art deep learning models, which reflects advantages in both imaging accuracy and perceptual quality.

Figure 4 displays the reconstruction results of different methods on representative chest X-rays. Overall, except for early methods, all mainstream deep learning approaches and our proposed method are able to restore the chest X-ray structures well. Although the differences among advanced methods in terms of global structure, edge continuity, and detailed texture restoration are subtle, local magnifications (highlighted by red boxes) show that our method preserves finer, more natural details with less local blurring and fewer artifacts, exhibiting superior texture restoration and noise suppression abilities.

While deep learning-based super-resolution has reached a mature stage where numerical gains on standard benchmarks appear marginal, our method consistently outperforms strong competitors in PSNR, SSIM, and LPIPS, as shown in Table 1. Although the numerical margins are subtle, in the context of medical imaging, they translate to critical gains in detail fidelity and structural preservation. Even minor improvements in these metrics often correlate with the clearer delineation of tissue boundaries and the retention of subtle pathological textures, which are pivotal for enhancing diagnostic confidence and reducing the risk of misinterpretation. Thus, these quantitative advantages provide a more reliable foundation for clinical observation.

3.5. Comparative Analysis of Generative Adversarial Network Methods

Given that the perceived quality of medical image generation depends not only on structural information but also on perceptual realism and the richness of details, this section introduces a discriminator and employs adversarial training to further enhance the generative model. We compare the performance of SRGAN, ESRGAN, Real-ESRGAN, and our proposed method under a complete GAN architecture.

As shown in Table 2 (where the best results are highlighted in bold), after introducing the discriminator for adversarial training, all GAN-based methods—including ours—exhibit reduced performance on traditional quantitative metrics such as PSNR and SSIM. This decline is a common phenomenon in GAN models, as they inherently pursue higher perceptual quality and realism. Despite this more challenging adversarial training setup, our proposed method still outperforms other GAN-based methods on most or all evaluation metrics. Driven by the discriminator, the model outputs images with higher subjective realism and richer visual details: although there is some loss in certain objective indicators, subjective assessment shows images that are more natural and exhibit greater detail hierarchies and realism. As illustrated in Figure 5, our method demonstrates notably higher visual quality in terms of tissue structure, edge sharpness, and local noise suppression.

This phenomenon reveals the dual requirements of medical image super-resolution tasks: on one hand, traditional quantitative indices reflect overall structure and SNR, which are suitable for measuring the generator’s structural capacity; on the other hand, GANs aided by discriminators can significantly improve subjective quality, restoring details that are closer to the true data distribution. Our proposed method performs excellently in both modes, ensuring structural restoration while also enhancing perceptual quality, demonstrating strong applicability and practical value.

3.6. Ablation Study

To validate the actual impact of the proposed model components, we conducted comprehensive comparative experiments in this subsection. As illustrated in Figure 6a, we compared the PSNR trends during training for models with (Proposed w/CSA Block) and without (Proposed w/o CSA Block) the CSA Block on the validation set. The results show that, at every stage of training—both early and late—the inclusion of the CSA Block leads to higher PSNR, and the improvement remains steady throughout. Compared to the version without the module, adding the CSA Block significantly enhances the model’s super-resolution reconstruction ability; moreover, with ongoing iterations, this advantage in PSNR persists.

These findings confirm our previous claims that the channel–spatial attention mechanism enables the network to more precisely capture key pathological features and subtle structures present in chest X-ray images. Given the complex pulmonary textures, small nodules, and various overlapping anatomical structures in chest radiographs, the model needs to attend to both local details and global context. The CSA Block simultaneously optimizes channel weighting and spatial attention, thereby improving the network’s perception of clinically significant regions, such as pulmonary margins, cardiopulmonary interfaces, and occult lesion areas, and enabling higher-quality super-resolution reconstructions.

Table 3 presents a quantitative performance comparison on the Datatest (X4) set between models with and without pooling layers. The results indicate that removing the pooling layer from the channel attention module yields consistent improvements across all evaluation metrics: PSNR increases from 35.2200 to 35.4274, SSIM rises from 0.8453 to 0.8463, and LPIPS decreases from 0.0923 to 0.0915. Similarly, the training curves in Figure 6b demonstrate that the pooling-free model maintains a consistent performance lead throughout the training process. While pooling operations are ubiquitous in general computer vision tasks, our findings suggest that in medical image super-resolution, the downsampling inherent to pooling inevitably discards local details and high-frequency features. This is particularly detrimental to the reconstruction of critical clinical characteristics, such as micro-nodules and disordered lung textures. By eliminating the pooling layer, the model more effectively preserves spatial details, thereby preventing the loss of crucial diagnostic information, especially for pathological features that are already extremely faint in low-resolution images.

Table 4 further quantifies the overhead of this architectural modification in terms of computational complexity. The data reveal that eliminating the pooling layer results in a marginal increase in Floating Point Operations (FLOPs), rising from 36.72 G to 36.91 G—an absolute increase of only 0.19 G (approximately 0.5%). Practically, the model maintains a rapid inference speed of 0.016 s for a 256 × 256 image. This result compels the conclusion that the proposed pooling-free strategy does not impose a significant computational burden. Furthermore, since standard pooling layers contain no learnable parameters, their removal preserves the full-resolution features without altering the model size, with both configurations maintaining 16.75 M parameters. Collectively, Table 3 and Table 4 demonstrate that the proposed strategy achieves a tangible improvement in reconstruction quality at a negligible computational cost.

3.7. Robustness Evaluation Under Extreme Low-Dose Noise Conditions

In practical radiography, image quality is compromised not only by limited resolution but also significantly by insufficient radiation dose. Particularly in scanning scenarios strictly adhering to the “ALARA” principle, the paucity of X-ray photons reaching the detector results in high-intensity Poisson noise. This signal-dependent noise differs fundamentally from the additive Gaussian blur used in standard benchmarks; it is highly correlated with tissue density and tends to obscure pathological details in low-contrast regions. To verify the model’s robustness against such extreme physical degradation, this section establishes a high-noise test environment simulating low-dose imaging. Specifically, we utilize a numerical approximation where signal-dependent Poisson noise with a dose scale of

α = 30.0

is injected to simulate quantum shot noise, superimposed with additive Gaussian noise (

σ_{n} = 10.0

) representing electronic thermal noise. Considering that mainstream super-resolution algorithms (e.g., SRGAN, SwinIR) often suffer from severe domain shift when encountering non-Gaussian quantum noise, we fine-tuned the model using data incorporating this specific mixed noise model. This experiment aims to determine whether the model, after learning specific photon noise priors, can effectively balance noise suppression with texture preservation during super-resolution reconstruction.

As illustrated in Figure 7, we compared the reconstruction performance of the model under two distinct degradation mechanisms. In the first row (Low-Dose Degradation Environment), the input image simulates imaging conditions under extremely low photon flux, superimposed with significant Poisson and mixed Gaussian blur. This degradation manifests as diffuse granular speckles across the entire field of view, severely disrupting the continuity of lung markings and rendering high-frequency fine structures, such as trabeculae, indistinguishable. In contrast, the CSAEGAN reconstruction (top right) demonstrates superior visual quality. The model successfully identifies and suppresses granular noise mixed with anatomical structures, resulting in a cleaner background (e.g., soft tissue shadows) without introducing noticeable artifacts or ringing effects. Crucially, a comparison with the results in the second row (Standard Benchmark Environment) reveals that even under high-intensity quantum noise interference, the model avoids the common “over-smoothing” phenomenon. The edges of vascular branches within the lung field remain sharp, and skeletal textures originally submerged in noise are clearly restored. The structural fidelity is highly consistent with the reconstruction results observed in the noise-free environment.

These experimental results compellingly demonstrate the robust capabilities of our proposed pooling-free CSA architecture. It proves effective not only in addressing conventional resolution degradation but also in handling extreme low-dose scenarios characterized by “blur + high quantum noise.” The model exhibits excellent noise resilience and detail reconstruction capabilities, providing strong technical validation for future image enhancement tasks in real-world low-dose clinical imaging.

3.8. The Improvement of Super-Resolution for Downstream Diagnostic Classification Tasks

First, a high-performance deep learning classification model (DenseNet-121), pre-trained on the large-scale public chest X-ray dataset Chest X-ray14, was employed as a proxy for an automated diagnostic tool. This classifier, having been extensively trained to identify multiple pulmonary diseases, serves as a simulation of a clinical expert system in its decision-making logic. The CheXNet model, which utilizes this architecture, has been extensively validated to exceed the average radiologist performance on the F1 metric for pneumonia detection [39].

To quantitatively validate the diagnostic value, we evaluated the classification performance on the reserved Chest X-ray14 test set, comprising 22,433 images. The Low-Resolution (LR) inputs were generated using Gaussian blur (σ = 1.0) and 4× bicubic downsampling, consistent with our degradation model. Table 5 presents the Area Under the Curve (AUC) scores, a standard metric reflecting the classifier’s discriminative ability, for 14 pulmonary diseases. The results show that the SR images generated by CSAEGAN achieved an average AUC of 0.7970, significantly outperforming the LR baseline (0.7454) and narrowing the gap with the HR ground truth (0.8404). Notably, substantial improvements were observed in pathologies relying on fine structural details, such as Pneumothorax (+0.1581) and Fibrosis (+0.0983), confirming the model’s ability to recover diagnostically relevant features.

Second, to interpret these quantitative gains, this study selected Gradient-weighted Class Activation Mapping (Grad-CAM) as the core analytical tool [40]. By analyzing the gradient flow within the model during a specific prediction, Grad-CAM generates a “saliency map.” This map, presented as a heatmap, visually highlights the regions in the input image that contribute most significantly to the classification decision, thereby providing a window into the model’s “reasoning” process and revealing its decision-making basis.

To visually validate this hypothesis, representative cases of pulmonary diseases were selected. The low-resolution (LR), super-resolved (SR), and high-resolution (HR) versions of these images were independently fed into the pre-trained classification model to generate corresponding Grad-CAM saliency maps, as depicted in Figure 8. When the classifier processed the LR images, the resulting attention maps commonly exhibited diffuse, scattered, and poorly localized characteristics. The activated regions in the heatmaps were broad and lacked a clear focus, often extending over non-pathological lung parenchyma or beyond the lung fields. This phenomenon clearly indicates that due to the loss of high-frequency information in LR images, the classifier was unable to pinpoint definitive pathological indicators, leading to an ambiguous and unreliable decision-making process.

In contrast, the Grad-CAM heatmaps corresponding to the SR images generated by the proposed CSAEGAN model demonstrated a qualitative leap in performance. The activated regions in these heatmaps became sharp, intense, and highly focused. Most importantly, these highlighted areas showed a high degree of spatial correspondence with the visible abnormal lesions in the chest radiographs. This improvement confirms that the SR process successfully restored the critical high-frequency details necessary for the classifier to make high-confidence decisions, while maintaining strong consistency with the HR images. The classifier was able to precisely “see” and localize the specific features driving its diagnosis, such as the texture of pulmonary infiltrates or the margins of nodules. This result directly validates the effectiveness of our proposed model architecture in preserving and enhancing the fine structures that are crucial for diagnosis.

4. Conclusions

To address the clinical demand for enhancing the spatial resolution of low-dose chest X-ray (CXR) images, this study proposes a generative adversarial super-resolution model with CSA hybrid attention. The introduced approach incorporates residual dense blocks and the CSA hybrid attention module into the backbone of the generator, while eliminating pooling operations in the channel attention to effectively preserve high-frequency and local structural details in medical images. In 4× super-resolution reconstruction tasks on public datasets, the proposed method achieved optimal objective evaluation metrics across three independent test sets. Notably, on the independent external dataset (comprising both ‘Normal’ and ‘Opacity’ subsets), our method maintains its lead, demonstrating strong robustness against domain shifts common in clinical deployment. Qualitative analysis further demonstrates superior edge definition and texture restoration for seven typical types of pulmonary lesions (such as nodules, cavities, and pleural effusion). In summary, this method enables the generation of higher-quality chest X-ray images without increasing radiation dose, thereby providing a solid imaging foundation for early screening and follow-up of pulmonary diseases.

Despite these promising results, this study has limitations. First, although we simulated low-dose noise using Poisson distributions, the training data relies on synthetic degradation from high-quality images, whereas real-world clinical degradation may involve more complex scattering and sensor artifacts. Second, while we relied on quantitative metrics (PSNR, SSIM) and perceptual proxies (LPIPS, downstream classification) to evaluate image quality, a large-scale subjective study involving radiologists was not conducted. However, the robust performance on independent external cohorts and the significant improvements observed in downstream diagnostic classification provide strong indirect evidence of the method’s clinical usefulness and generalizability. Future work will focus on validating the method with raw clinical data and expert observers.

Author Contributions

Conceptualization, W.L.; methodology, W.L.; software, Y.Y. (Yafeng Yao); validation, Y.Y. (Yafeng Yao); formal analysis, D.G.; investigation, D.G.; resources, Y.Y. (Yafeng Yao); data curation, W.L.; writing—original draft preparation, W.L.; writing—review and editing, Y.Y. (Ying Yi); visualization, D.G.; supervision, Y.Y. (Ying Yi); project administration, Y.Y. (Ying Yi) All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Umirzakova, S.; Ahmad, S.; Khan, L.U.; Whangbo, T. Medical image super-resolution for smart healthcare applications: A comprehensive survey. Inf. Fusion 2024, 103, 102075. [Google Scholar] [CrossRef]
Chi, J.; Sun, Z.; Meng, L.; Wang, S.; Yu, X.; Wei, X.; Yang, B. Low-Dose CT Image Super-Resolution With Noise Suppression Based on Prior Degradation Estimator and Self-Guidance Mechanism. IEEE Trans. Med. Imaging 2025, 44, 601–617. [Google Scholar] [CrossRef] [PubMed]
He, J.; Ma, H.; Guo, M.; Wang, J.; Wang, Z.; Fan, G. Research into super-resolution in medical imaging from 2000 to 2023: Bibliometric analysis and visualization. Quant. Imaging Med. Surg. 2024, 14, 5109–5130. [Google Scholar] [CrossRef] [PubMed]
Hou, H.; Andrews, H. Cubic splines for image interpolation and digital filtering. IEEE Trans. Acoust. Speech Signal Process. 1978, 26, 508–517. [Google Scholar] [CrossRef]
Yang, J.; Wright, J.; Huang, T.S.; Ma, Y. Image super-resolution via sparse representation. IEEE Trans. Image Process. 2010, 19, 2861–2873. [Google Scholar] [CrossRef] [PubMed]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2016; pp. 1646–1654. [Google Scholar] [CrossRef]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; IEEE: New York, NY, USA, 2017; pp. 1132–1140. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In NIPS’14, Proceedings of the 27th International Conference on Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014; pp. 2672–2680. [Google Scholar]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2017; pp. 105–114. [Google Scholar] [CrossRef]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops; Springer: Cham, Switzerland, 2018; pp. 63–79. [Google Scholar] [CrossRef]
Wang, X.; Xie, L.; Dong, C.; Shan, Y. Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 1905–1914. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Cui, J.; Miao, S.; Wang, J.; Chen, J.; Dong, C.; Hao, D.; Li, J. The super-resolution reconstruction in diffusion-weighted imaging of preoperative rectal MR using generative adversarial network (GAN): Image quality and T-stage assessment. Clin. Radiol. 2024, 79, e1530–e1538. [Google Scholar] [CrossRef] [PubMed]
Jiang, Q.; Sun, H.; Deng, W.; Chen, L.; Li, Q.; Xie, J.; Pan, X.; Cheng, Y.; Chen, X.; Wang, Y.; et al. Super resolution of pulmonary nodules target reconstruction using a Two-Channel GAN models. Acad. Radiol. 2024, 31, 3427–3437. [Google Scholar] [CrossRef] [PubMed]
Zhao, M.; Meng, N.; Cheung, J.P.Y.; Zhang, T. PCT-GAN: A Real CT Image Super-Resolution Model for Trabecular Bone Restoration. In Proceedings of the 2023 IEEE 20th International Symposium on Biomedical Imaging, Cartagena, Colombia, 18–21 April 2023; pp. 1–5. [Google Scholar] [CrossRef]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2018; pp. 286–301. [Google Scholar] [CrossRef]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 1833–1844. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Zhang, D.; Huang, F.; Liu, S.; Wang, X.; Jin, Z. Swinfir: Revisiting the swinir with fast fourier convolution and improved training for image super-resolution. arXiv 2022, arXiv:2208.11247. [Google Scholar] [CrossRef]
Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 22367–22377. [Google Scholar] [CrossRef]
Ma, Z.; Liu, Z.; Wang, K.; Lian, S. Hybrid attention transformer with re-parameterized large kernel convolution for image super-resolution. Image Vis. Comput 2024, 149, 105162. [Google Scholar] [CrossRef]
Wang, Z.; Chen, H.; Qian, Z.; Zhou, Y.; Zhang, H.; Zhao, D.; Wei, B.; Xu, Y. Region Attention Transformer for Medical Image Restoration. In Proceedings of the Medical Image Computing and Computer Assisted Intervention; Springer: Cham, Switzerland, 2024; pp. 603–613. [Google Scholar] [CrossRef]
Wang, X.; Dong, C.; Shan, Y. Repsr: Training efficient vgg-style super-resolution networks with structural re-parameterization and batch normalization. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 2556–2564. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. In NIPS’15, Proceedings of the 29th International Conference on Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; pp. 2017–2025. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. In Proceedings of the International Conference on Learning Representations; ICLR: San Juan, PR, USA, 2016. [Google Scholar] [CrossRef]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. FcaNet: Frequency channel attention networks. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 763–772. [Google Scholar] [CrossRef]
Nalmpantis, C.; Lentzas, A.; Vrakas, D. A theoretical analysis of pooling operation using information theory. In Proceedings of the 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), Portland, OR, USA, 4–6 November 2019; pp. 1729–1733. [Google Scholar] [CrossRef]
Kingma, D.P.; Dhariwal, P. Glow: Generative flow with invertible 1 × 1 convolutions. In NIPS’18, Proceedings of the 32nd International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2018; pp. 10236–10245. [Google Scholar] [CrossRef]
Jolicoeur-Martineau, A. The relativistic discriminator: A key element missing from standard GAN. arXiv 2018, arXiv:1807.00734. [Google Scholar] [CrossRef]
Chowdhury, M.E.H.; Rahman, T.; Khandakar, A.; Mazhar, R.; Kadir, M.A.; Mahbub, Z.B.; Islam, K.R.; Khan, M.S.; Iqbal, A.; Al Emadi, N.; et al. Can AI Help in Screening Viral and COVID-19 Pneumonia? IEEE Access 2020, 8, 132665–132676. [Google Scholar] [CrossRef]
Haghanifar, A.; Majdabadi, M.M.; Choi, Y.; Deivalakshmi, S.; Ko, S. COVID-CXNet: Detecting COVID-19 in frontal chest X-ray images using deep learning. Multimed. Tools Appl. 2022, 81, 30615–30645. [Google Scholar] [CrossRef] [PubMed]
Hore, A.; Ziou, D. Image quality metrics: PSNR vs. SSIM. In 2010 20th International Conference on Pattern Recognition; IEEE: New York, NY, USA, 2010; pp. 2366–2369. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar] [CrossRef]
Gu, Y.; Zeng, Z.; Chen, H.; Wei, J.; Zhang, Y.; Chen, B.; Li, Y.; Qin, Y.; Xie, Q.; Jiang, Z.; et al. MedSRGAN: Medical images super-resolution using generative adversarial networks. Multimed. Tools Appl. 2020, 79, 21815–21840. [Google Scholar] [CrossRef]
Rajpurkar, P.; Irvin, J.; Zhu, K.; Yang, B.; Mehta, H.; Duan, T.; Ding, D.; Bagul, A.; Langlotz, C.; Shpanskaya, K.; et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. arXiv 2017, arXiv:1711.05225. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed network architecture. (a) Architecture of the proposed network generator; (b) Structure of the Dense Block; (c) Network architecture of the CSA Block. Note that the faded blocks (MaxPool and AvgPool) indicate standard operations that are removed in our design to preserve spatial details.

Figure 2. Visualization of the training process. (a) Generator loss components (

L_{1}

,

L_{p e r c e p}

,

L_{a d v}

), with the inset zooming in on the

L_{1}

convergence. (b) Discriminator scores for real images

D (x_{r}

) and generated images

D (G (x))

, indicating the adversarial learning process.

Figure 2. Visualization of the training process. (a) Generator loss components (

L_{1}

,

L_{p e r c e p}

,

L_{a d v}

), with the inset zooming in on the

L_{1}

convergence. (b) Discriminator scores for real images

D (x_{r}

) and generated images

D (G (x))

, indicating the adversarial learning process.

Figure 3. Super-resolution reconstruction results for seven typical pulmonary diseases. From top to bottom: low-resolution input image (LR), partial enlargement of a low-resolution image (LR’), partial enlargement of a super-resolution image (SR’), super-resolution output generated by the proposed method (SR), and the corresponding high-resolution reference image (HR). The red frames highlight specific areas with rich details.

Figure 4. Visual comparison with other models.

Figure 5. Comparison before and after using the discriminator, the generative adversarial network after adding the discriminator is marked with *, and only the generator is marked without *.

Figure 6. PSNR (dB) comparison curves for ablation studies over 400,000 iterations. (a) Performance comparison between the model with and without the integrated CSA Block. (b) Performance comparison showing the effect of removing pooling layers from the channel attention module.

Figure 7. Robustness evaluation against low-dose noise (top) and Gaussian blur (bottom). The SR results (right) show significant improvement over the degraded LR inputs (left) compared to the HR reference (center). The red rectangles mark the areas selected for magnification.

Figure 8. Qualitative comparison of classifier attention using Grad-CAM. For each case, we present the low-resolution (LR) input, the super-resolved (SR) output generated by our proposed CSAEGAN model, and the high-resolution (HR) ground truth. The heatmaps overlaid on the images indicate the regions of highest importance for the classifier’s prediction.

Table 1. Quantitative comparison with other models.

Model	Normal (X4)			Opacity (X4)			Datatest (X4)
Model	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
SRCNN	33.45	0.8172	0.2785	33.08	0.8211	0.3501	34.83	0.8370	0.1742
SRGAN	35.95	0.9055	0.1813	35.62	0.9096	0.2666	37.80	0.9086	0.1032
EDSR	35.89	0.9033	0.1791	35.48	0.9056	0.2654	37.79	0.9078	0.1017
MedSRGAN	35.65	0.9092	0.1819	35.88	0.9058	0.2610	37.81	0.9091	0.1045
ESRGAN	36.10	0.9084	0.1771	35.74	0.9120	0.2602	37.89	0.9104	0.1039
Real-ESRGAN	33.90	0.8590	0.2554	33.73	0.8606	0.3476	35.10	0.8608	0.1843
SwinIR	36.04	0.9073	0.1774	35.67	0.9110	0.2638	37.86	0.9097	0.1028
CRAFT	35.92	0.9048	0.1827	35.58	0.9089	0.2689	37.77	0.9084	0.1038
Proposed	36.11	0.9087	0.1766	35.74	0.9121	0.2594	37.91	0.9108	0.1035

Table 2. Comparison of different generative adversarial network models, where an asterisk (*) indicates models trained with a discriminator (adversarial training).

Model	Normal (X4)			Opacity (X4)			Datatest (X4)
Model	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
SRGAN *	34.0256	0.8550	0.2185	33.9428	0.8625	0.3097	35.4160	0.8544	0.0945
MedSRGAN *	34.0845	0.8647	0.1878	34.0165	0.8669	0.2843	35.3435	0.8637	0.1336
Real-ESRGAN *	32.1557	0.7961	0.2439	32.0890	0.8108	0.3358	33.0966	0.8018	0.1696
ESRGAN *	33.8724	0.8521	0.1810	34.1876	0.8725	0.2783	34.9612	0.8410	0.0945
Proposed *	34.1476	0.8548	0.1791	34.3327	0.8733	0.2810	35.4274	0.8463	0.0915

Table 3. Data comparison before and after removing the pooling layer.

Model	Datatest (X4)
Model	PSNR↑	SSIM↑	LPIPS↓
Proposed w/pooling layer	35.2200	0.8453	0.0923
Proposed w/o pooling layer	35.4274	0.8463	0.0915

Table 4. FLOPs comparison after removing the pooling layer.

Model	Flops
Proposed w/pooling layer	36.72 G
Proposed w/o pooling layer	36.91 G

Table 5. Disease classification AUC comparison across resolution modalities.

Disease Type	LR Data	SR Data	HR Data
Atelectasis	0.7490	0.7849	0.8293
Cardiomegaly	0.8850	0.9014	0.9142
Effusion	0.8497	0.8698	0.8862
Infiltration	0.6310	0.6795	0.7119
Mass	0.7766	0.7996	0.8548
Nodule	0.6722	0.6820	0.7840
Pneumonia	0.6619	0.7164	0.7739
Pneumothorax	0.6149	0.7730	0.8684
Consolidation	0.7477	0.7735	0.8130
Edema	0.8243	0.8657	0.8920
Emphysema	0.7760	0.8745	0.9199
Fibrosis	0.6739	0.7722	0.8242
Pleural Thickening	0.6899	0.7527	0.7824
Hernia	0.8839	0.9135	0.9110
Average AUC	0.7454	0.7970	0.8404

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, W.; Yao, Y.; Gao, D.; Yi, Y. GAN-Based Low-Dose Chest X-Ray Super-Resolution with Hybrid Channel-Spatial Attention and Pooling Layer Removal. Appl. Sci. 2026, 16, 1797. https://doi.org/10.3390/app16041797

AMA Style

Li W, Yao Y, Gao D, Yi Y. GAN-Based Low-Dose Chest X-Ray Super-Resolution with Hybrid Channel-Spatial Attention and Pooling Layer Removal. Applied Sciences. 2026; 16(4):1797. https://doi.org/10.3390/app16041797

Chicago/Turabian Style

Li, Wenjia, Yafeng Yao, Di Gao, and Ying Yi. 2026. "GAN-Based Low-Dose Chest X-Ray Super-Resolution with Hybrid Channel-Spatial Attention and Pooling Layer Removal" Applied Sciences 16, no. 4: 1797. https://doi.org/10.3390/app16041797

APA Style

Li, W., Yao, Y., Gao, D., & Yi, Y. (2026). GAN-Based Low-Dose Chest X-Ray Super-Resolution with Hybrid Channel-Spatial Attention and Pooling Layer Removal. Applied Sciences, 16(4), 1797. https://doi.org/10.3390/app16041797

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GAN-Based Low-Dose Chest X-Ray Super-Resolution with Hybrid Channel-Spatial Attention and Pooling Layer Removal

Abstract

1. Introduction

2. Theoretical Method

2.1. Network Framework

2.2. Generator Architecture

2.3. Theoretical Motivation

2.4. Discriminator Architecture

3. Experimental Result

3.1. Dataset and Experimental Settings

3.2. Training Results

3.3. Analysis of Typical Case Reconstruction

3.4. Quantitative and Qualitative Comparison with Other Models

3.5. Comparative Analysis of Generative Adversarial Network Methods

3.6. Ablation Study

3.7. Robustness Evaluation Under Extreme Low-Dose Noise Conditions

3.8. The Improvement of Super-Resolution for Downstream Diagnostic Classification Tasks

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI