1. Introduction
Chest radiography is a fundamental, non-invasive imaging technique that reveals early-stage diseases via subtle structural alterations, such as small nodules, disordered pulmonary textures, minute cavities, pleural thickening, or spiculation. However, low-resolution images are prone to significant detail loss and blurred boundaries, which can easily lead to missed or incorrect diagnoses of early lesions, limiting clinicians’ ability to accurately detect elusive and early-stage abnormalities during routine examinations [
1]. In contrast, high-resolution images can present clearer spatial distributions, edge morphology, and internal structural characteristics of lesions in clinical practice, thereby significantly improving the detection rate of small foci, blurred margins, or complex overlapping regions that are otherwise indistinguishable. Nevertheless, routinely acquiring high-resolution chest X-rays requires an increased X-ray dosage, which poses additional cumulative health risks to patients, creating a challenging trade-off between image quality and safety. Super-resolution algorithms enhance images collected at low radiation doses [
2], providing high-quality diagnostic images without raising radiation risks associated with hardware-based improvements. This is particularly valuable for patients requiring frequent follow-up to monitor disease progression and for radiation-sensitive populations such as children and pregnant women.
In recent years, methods in the super-resolution reconstruction field have constantly evolved, with deep learning techniques becoming increasingly pivotal in smart healthcare applications [
3]. Traditional approaches such as bicubic interpolation [
4] and sparse representation [
5] are simple to implement but have limited capability to reconstruct complex structures and fine textures, often resulting in blocking artifacts and failing to provide actual additional detail. With the advent of deep learning, SRCNN [
6] pioneered the use of convolutional neural networks for end-to-end image super-resolution, achieving notable performance gains; however, its simplistic feed-forward structure restricted its expressiveness. The introduction of residual networks, exemplified by VDSR [
7] and EDSR [
8], improved performance substantially by enabling deeper network architectures with residual connections, thereby stabilizing the training of deeper networks. Despite these advances, these methods often produce overly smoothed results that lack realistic perceptual quality.
Super-resolution methods based on generative adversarial networks (GAN) [
9] employ adversarial training to encourage the generation of more realistic and detail-rich high-resolution images. SRGAN [
10] was the first to apply GANs to super-resolution, while ESRGAN [
11] further optimized the architecture with dense residual blocks and perceptual loss, improving texture detail restoration. Real-ESRGAN [
12] introduced complex degradation modeling and a U-Net [
13] discriminator to further enhance generalization and robustness. More recently, GAN-based frameworks have been specifically adapted for various medical modalities to meet precise clinical diagnostic demands. For instance, Cui et al. [
14] demonstrated the value of GANs in diffusion-weighted MRI for accurate rectal cancer staging, while Jiang et al. [
15] proposed a two-channel GAN specifically for the target reconstruction of pulmonary nodules. Similarly, PCT-GAN [
16] was designed to restore fine trabecular bone microstructures from real CT images to handle unique noise characteristics. Nevertheless, medical images demand extremely high structural fidelity, and even minor generator errors may result in misdiagnosis. Although these methods have greatly improved perceptual quality and detail presentation, limitations remain in the recovery of high-frequency features and authentic anatomical structures.
Attention mechanisms have recently shown promising results in super-resolution. Increasing evidence demonstrates their ability to enhance networks’ focus on key features. RCAN [
17] uses channel attention to improve the expression and focus on high-frequency features effectively. Benefiting from Swin Transformer [
18], SwinIR [
19] leverages efficient sliding-window multi-scale attention to capture both local details and global structures. In SwinFIR [
20], the authors integrate fast Fourier convolution and residual modules to extend SwinIR, gaining a broader receptive field. HAT [
21] combines channel and spatial self-attention, while LKHAT [
22] incorporates reparameterization to further enhance local feature extraction, striking a balance between accuracy and efficiency. More recently, region-based attention mechanisms have been proposed to adaptively focus on high-difficulty restoration areas in medical images, reducing interference from irrelevant background noise [
23].
Additionally, studies such as EDSR [
8] and RepSR [
24] indicate that Batch Normalization layers are suboptimal for super-resolution tasks. Batch Normalization standardizes feature statistics, which, while beneficial for many tasks, tends to introduce displeasing artifacts in super-resolution as it changes the distributional properties of features. Pooling layers, which aggregate features within local neighborhoods (e.g., via max or average pooling), similarly affect data distribution. While pooling can simplify features and enhance computational efficiency, it inevitably loses local detail and high-frequency features during downsampling, a significant drawback for medical image reconstruction, where fine structure preservation is critical.
To address these challenges, this paper presents a CSAEGAN-based chest X-ray super-resolution model. Built upon the GAN framework, the model aims to solve the super-resolution reconstruction problem under simulated low-dose imaging conditions through adversarial training. A novel CSA hybrid attention module is incorporated to enable the network to accurately capture critical pathological features and subtle structures in chest X-ray images. We further remove pooling layers from the channel attention modules to effectively preserve spatial details and high-frequency information, thereby enhancing the reconstruction quality of clinically significant features such as small pulmonary nodules, fine textures, and edge structures. Furthermore, we validate the model’s robust generalization ability beyond the training distribution using independent external datasets.
3. Experimental Result
3.1. Dataset and Experimental Settings
The chest X-ray (CXR) image dataset used in this study primarily originates from the public dataset published by M. E. H. Chowdhury et al. [
33], comprising a total of 3886 images. This dataset covers various clinical scenarios, including 1200 COVID-19 positive images, 1341 normal images, and 1345 viral pneumonia images, providing a rich and diverse foundation for super-resolution reconstruction of chest diseases. To better simulate common image degradation scenarios in clinical practice, we generate paired low-resolution and high-resolution data from original high-resolution X-rays by applying a Gaussian blur in conjunction with bicubic interpolation. Specifically, we apply a Gaussian blur with a standard deviation of
to simulate the inevitable optical blur (e.g., point spread function) during image acquisition, followed by
bicubic downsampling to mimic the limited spatial resolution of detectors. This numerical preprocessing facilitates improved adaptation of the super-resolution model to simulated degradation types.
For data partitioning, 256 images are randomly selected as a validation set for model tuning and hyperparameter selection, while 30 images are used as an independent test set for initial performance evaluation. To further validate the model’s generalization and clinical applicability, two external test sets [
34] are introduced: the Normal test set includes 234 normal chest X-rays, and an additional set contains 390 images exhibiting varying degrees of pulmonary opacity (due to viral and bacterial infections). These external test sets are used solely for final performance evaluation to ensure result objectivity and generalizability. This diversified testing strategy allows for a comprehensive assessment of the proposed method’s performance across different clinical scenarios, particularly in terms of preserving diagnostically critical features.
Three metrics—LPIPS, PSNR, and SSIM—are used to comprehensively evaluate the generated images. PSNR measures the similarity between the original and reconstructed images [
35] and is calculated as follows:
where MSE denotes mean squared error and MAX is the maximum possible pixel value. PSNR quantitatively evaluates image reconstruction quality; higher values indicate greater similarity between the reconstructed and original images. SSIM likewise measures image similarity, considering luminance, contrast, and structural information [
36]. The simplified formula is as follows:
Here,
x and
y denote sliding window data from two images;
and
are their means,
and
are their variances,
is their covariance, and
,
are constants to stabilize the calculation and prevent division by zero. LPIPS outperforms PSNR and SSIM in complying with human perceptual similarity [
37], and measures perceptual differences between two images using deep learning models. The calculation is as follows:
where
and
are the height and width of the
l-th layer feature map,
and
are the feature values at spatial location (
h,
w) in layer
l, and
is a learnable weighting parameter that adjusts the importance of different feature layers.
Experimental environment: The Adam optimizer was used for loss function optimization, with a learning rate of . The num_workers parameter was set to 3, the batch size was 6, and the model was trained for 400,000 iterations. The super-resolution scaling factor was set to 4×, and all training was conducted on an NVIDIA RTX 4080 GPU.
3.2. Training Results
The training characteristics of the model are illustrated in
Figure 2a, which shows the iterative changes in three types of losses for the generator during training. The pixel loss (
) remains consistently low and stable throughout the process, reflecting the model’s sustained ability to reconstruct basic anatomical structures. The perceptual loss (
) is relatively higher and shows some fluctuations, which mainly arise from continuous optimization of high-level texture and semantic features. The adversarial loss (
) remains within a lower range but exhibits occasional sharp fluctuations, indicative of the dynamic competition between the generator and discriminator.
Figure 2b presents the discriminator’s output scores for real samples (
)) and generated samples (
). At the early stage of training, both exhibit significant fluctuations, indicating that the generator and discriminator are undergoing rapid adjustment and continuous confrontation. As training progresses, both scores show a stable upward trend and a gradual convergence, reflecting that the discriminator is achieving effective real-versus-fake discrimination, while the generator is constantly enhancing its ability to “fool” the discriminator. Towards the end of training, the output gap between the two further narrows, suggesting that a new equilibrium has been reached, characteristic of a well-trained GAN, which facilitates the generation of high-quality and indistinguishable super-resolved images. Overall, these training curves demonstrate the model’s good convergence and dynamic stability. The collaborative optimization of multiple loss components ensures a balance of structural accuracy, perceptual quality, and visual effect, laying the groundwork for subsequent high-quality chest X-ray reconstruction.
3.3. Analysis of Typical Case Reconstruction
To comprehensively assess the proposed method’s applicability and generalization to different pulmonary disease scenarios, we selected chest X-ray images covering seven typical lesion categories for super-resolution reconstruction assessment.
Figure 3 presents super-resolution results for various clinical cases, including normal lungs, tuberculosis, lung abscess, emphysema, pleural effusion, primary syndrome, and lung cancer. For each case, the images are arranged from top to bottom as follows: low-resolution input, super-resolved output produced by the proposed method, and the corresponding high-resolution reference image.
Experimental results show that for all lesion types, the super-resolved images produced by our method, compared with the low-resolution input images, exhibit much crisper structural edges and significantly enhanced detail in local lesions, lung textures, airways, and nodules, greatly alleviating the blurring and detail loss common in low-resolution images. This is more evident in the locally magnified comparison regions. In cases such as tuberculosis, emphysema, or tumors characterized by heterogeneous shadows or blurred boundaries, the super-resolution output sharpens lesion contours, improves edge delineation, and effectively preserves early features such as subtle foci, spiculation, and texture irregularities, thereby providing stronger imaging support for the detection of complex or early-stage pathologies. In the majority of cases, the super-resolved output closely matches the true high-resolution references, showing excellent restoration of lung field structure, vascular pathways, and thoracic outlines with natural image texture. The proposed method robustly enhances both increased density areas (e.g., pleural effusion, masses) and decreased density areas (e.g., emphysema, cavities) without introducing noticeable artifacts, demonstrating good generalization and robustness to diverse image types.
These results clearly demonstrate that our adversarial training and channel–spatial attention-based super-resolution approach can reliably restore structural and textural details under various chest disease scenarios, providing a solid imaging foundation for subsequent clinical screening.
3.4. Quantitative and Qualitative Comparison with Other Models
To comprehensively evaluate the proposed method’s performance, we systematically compare it with current mainstream super-resolution approaches, including deep learning approaches (SRCNN, EDSR, SwinIR, CRAFT), and various GAN-based architectures (SRGAN, ESRGAN, MedSRGAN [
38], Real-ESRGAN). The results are summarized in
Table 1, with GAN-based models evaluated without their discriminator modules. The best results are highlighted in bold.
The experimental outcomes indicate that our method achieves the best or near-best performance across all three test sets and evaluation metrics. The aggregate results surpass those counterparts, comprehensively outperforming both traditional and state-of-the-art deep learning models, which reflects advantages in both imaging accuracy and perceptual quality.
Figure 4 displays the reconstruction results of different methods on representative chest X-rays. Overall, except for early methods, all mainstream deep learning approaches and our proposed method are able to restore the chest X-ray structures well. Although the differences among advanced methods in terms of global structure, edge continuity, and detailed texture restoration are subtle, local magnifications (highlighted by red boxes) show that our method preserves finer, more natural details with less local blurring and fewer artifacts, exhibiting superior texture restoration and noise suppression abilities.
While deep learning-based super-resolution has reached a mature stage where numerical gains on standard benchmarks appear marginal, our method consistently outperforms strong competitors in PSNR, SSIM, and LPIPS, as shown in
Table 1. Although the numerical margins are subtle, in the context of medical imaging, they translate to critical gains in detail fidelity and structural preservation. Even minor improvements in these metrics often correlate with the clearer delineation of tissue boundaries and the retention of subtle pathological textures, which are pivotal for enhancing diagnostic confidence and reducing the risk of misinterpretation. Thus, these quantitative advantages provide a more reliable foundation for clinical observation.
3.5. Comparative Analysis of Generative Adversarial Network Methods
Given that the perceived quality of medical image generation depends not only on structural information but also on perceptual realism and the richness of details, this section introduces a discriminator and employs adversarial training to further enhance the generative model. We compare the performance of SRGAN, ESRGAN, Real-ESRGAN, and our proposed method under a complete GAN architecture.
As shown in
Table 2 (where the best results are highlighted in bold), after introducing the discriminator for adversarial training, all GAN-based methods—including ours—exhibit reduced performance on traditional quantitative metrics such as PSNR and SSIM. This decline is a common phenomenon in GAN models, as they inherently pursue higher perceptual quality and realism. Despite this more challenging adversarial training setup, our proposed method still outperforms other GAN-based methods on most or all evaluation metrics. Driven by the discriminator, the model outputs images with higher subjective realism and richer visual details: although there is some loss in certain objective indicators, subjective assessment shows images that are more natural and exhibit greater detail hierarchies and realism. As illustrated in
Figure 5, our method demonstrates notably higher visual quality in terms of tissue structure, edge sharpness, and local noise suppression.
This phenomenon reveals the dual requirements of medical image super-resolution tasks: on one hand, traditional quantitative indices reflect overall structure and SNR, which are suitable for measuring the generator’s structural capacity; on the other hand, GANs aided by discriminators can significantly improve subjective quality, restoring details that are closer to the true data distribution. Our proposed method performs excellently in both modes, ensuring structural restoration while also enhancing perceptual quality, demonstrating strong applicability and practical value.
3.6. Ablation Study
To validate the actual impact of the proposed model components, we conducted comprehensive comparative experiments in this subsection. As illustrated in
Figure 6a, we compared the PSNR trends during training for models with (Proposed w/CSA Block) and without (Proposed w/o CSA Block) the CSA Block on the validation set. The results show that, at every stage of training—both early and late—the inclusion of the CSA Block leads to higher PSNR, and the improvement remains steady throughout. Compared to the version without the module, adding the CSA Block significantly enhances the model’s super-resolution reconstruction ability; moreover, with ongoing iterations, this advantage in PSNR persists.
These findings confirm our previous claims that the channel–spatial attention mechanism enables the network to more precisely capture key pathological features and subtle structures present in chest X-ray images. Given the complex pulmonary textures, small nodules, and various overlapping anatomical structures in chest radiographs, the model needs to attend to both local details and global context. The CSA Block simultaneously optimizes channel weighting and spatial attention, thereby improving the network’s perception of clinically significant regions, such as pulmonary margins, cardiopulmonary interfaces, and occult lesion areas, and enabling higher-quality super-resolution reconstructions.
Table 3 presents a quantitative performance comparison on the Datatest (X4) set between models with and without pooling layers. The results indicate that removing the pooling layer from the channel attention module yields consistent improvements across all evaluation metrics: PSNR increases from 35.2200 to 35.4274, SSIM rises from 0.8453 to 0.8463, and LPIPS decreases from 0.0923 to 0.0915. Similarly, the training curves in
Figure 6b demonstrate that the pooling-free model maintains a consistent performance lead throughout the training process. While pooling operations are ubiquitous in general computer vision tasks, our findings suggest that in medical image super-resolution, the downsampling inherent to pooling inevitably discards local details and high-frequency features. This is particularly detrimental to the reconstruction of critical clinical characteristics, such as micro-nodules and disordered lung textures. By eliminating the pooling layer, the model more effectively preserves spatial details, thereby preventing the loss of crucial diagnostic information, especially for pathological features that are already extremely faint in low-resolution images.
Table 4 further quantifies the overhead of this architectural modification in terms of computational complexity. The data reveal that eliminating the pooling layer results in a marginal increase in Floating Point Operations (FLOPs), rising from 36.72 G to 36.91 G—an absolute increase of only 0.19 G (approximately 0.5%). Practically, the model maintains a rapid inference speed of 0.016 s for a 256 × 256 image. This result compels the conclusion that the proposed pooling-free strategy does not impose a significant computational burden. Furthermore, since standard pooling layers contain no learnable parameters, their removal preserves the full-resolution features without altering the model size, with both configurations maintaining 16.75 M parameters. Collectively,
Table 3 and
Table 4 demonstrate that the proposed strategy achieves a tangible improvement in reconstruction quality at a negligible computational cost.
3.7. Robustness Evaluation Under Extreme Low-Dose Noise Conditions
In practical radiography, image quality is compromised not only by limited resolution but also significantly by insufficient radiation dose. Particularly in scanning scenarios strictly adhering to the “ALARA” principle, the paucity of X-ray photons reaching the detector results in high-intensity Poisson noise. This signal-dependent noise differs fundamentally from the additive Gaussian blur used in standard benchmarks; it is highly correlated with tissue density and tends to obscure pathological details in low-contrast regions. To verify the model’s robustness against such extreme physical degradation, this section establishes a high-noise test environment simulating low-dose imaging. Specifically, we utilize a numerical approximation where signal-dependent Poisson noise with a dose scale of is injected to simulate quantum shot noise, superimposed with additive Gaussian noise () representing electronic thermal noise. Considering that mainstream super-resolution algorithms (e.g., SRGAN, SwinIR) often suffer from severe domain shift when encountering non-Gaussian quantum noise, we fine-tuned the model using data incorporating this specific mixed noise model. This experiment aims to determine whether the model, after learning specific photon noise priors, can effectively balance noise suppression with texture preservation during super-resolution reconstruction.
As illustrated in
Figure 7, we compared the reconstruction performance of the model under two distinct degradation mechanisms. In the first row (Low-Dose Degradation Environment), the input image simulates imaging conditions under extremely low photon flux, superimposed with significant Poisson and mixed Gaussian blur. This degradation manifests as diffuse granular speckles across the entire field of view, severely disrupting the continuity of lung markings and rendering high-frequency fine structures, such as trabeculae, indistinguishable. In contrast, the CSAEGAN reconstruction (top right) demonstrates superior visual quality. The model successfully identifies and suppresses granular noise mixed with anatomical structures, resulting in a cleaner background (e.g., soft tissue shadows) without introducing noticeable artifacts or ringing effects. Crucially, a comparison with the results in the second row (Standard Benchmark Environment) reveals that even under high-intensity quantum noise interference, the model avoids the common “over-smoothing” phenomenon. The edges of vascular branches within the lung field remain sharp, and skeletal textures originally submerged in noise are clearly restored. The structural fidelity is highly consistent with the reconstruction results observed in the noise-free environment.
These experimental results compellingly demonstrate the robust capabilities of our proposed pooling-free CSA architecture. It proves effective not only in addressing conventional resolution degradation but also in handling extreme low-dose scenarios characterized by “blur + high quantum noise.” The model exhibits excellent noise resilience and detail reconstruction capabilities, providing strong technical validation for future image enhancement tasks in real-world low-dose clinical imaging.
3.8. The Improvement of Super-Resolution for Downstream Diagnostic Classification Tasks
First, a high-performance deep learning classification model (DenseNet-121), pre-trained on the large-scale public chest X-ray dataset Chest X-ray14, was employed as a proxy for an automated diagnostic tool. This classifier, having been extensively trained to identify multiple pulmonary diseases, serves as a simulation of a clinical expert system in its decision-making logic. The CheXNet model, which utilizes this architecture, has been extensively validated to exceed the average radiologist performance on the F1 metric for pneumonia detection [
39].
To quantitatively validate the diagnostic value, we evaluated the classification performance on the reserved Chest X-ray14 test set, comprising 22,433 images. The Low-Resolution (LR) inputs were generated using Gaussian blur (σ = 1.0) and 4× bicubic downsampling, consistent with our degradation model.
Table 5 presents the Area Under the Curve (AUC) scores, a standard metric reflecting the classifier’s discriminative ability, for 14 pulmonary diseases. The results show that the SR images generated by CSAEGAN achieved an average AUC of 0.7970, significantly outperforming the LR baseline (0.7454) and narrowing the gap with the HR ground truth (0.8404). Notably, substantial improvements were observed in pathologies relying on fine structural details, such as Pneumothorax (+0.1581) and Fibrosis (+0.0983), confirming the model’s ability to recover diagnostically relevant features.
Second, to interpret these quantitative gains, this study selected Gradient-weighted Class Activation Mapping (Grad-CAM) as the core analytical tool [
40]. By analyzing the gradient flow within the model during a specific prediction, Grad-CAM generates a “saliency map.” This map, presented as a heatmap, visually highlights the regions in the input image that contribute most significantly to the classification decision, thereby providing a window into the model’s “reasoning” process and revealing its decision-making basis.
To visually validate this hypothesis, representative cases of pulmonary diseases were selected. The low-resolution (LR), super-resolved (SR), and high-resolution (HR) versions of these images were independently fed into the pre-trained classification model to generate corresponding Grad-CAM saliency maps, as depicted in
Figure 8. When the classifier processed the LR images, the resulting attention maps commonly exhibited diffuse, scattered, and poorly localized characteristics. The activated regions in the heatmaps were broad and lacked a clear focus, often extending over non-pathological lung parenchyma or beyond the lung fields. This phenomenon clearly indicates that due to the loss of high-frequency information in LR images, the classifier was unable to pinpoint definitive pathological indicators, leading to an ambiguous and unreliable decision-making process.
In contrast, the Grad-CAM heatmaps corresponding to the SR images generated by the proposed CSAEGAN model demonstrated a qualitative leap in performance. The activated regions in these heatmaps became sharp, intense, and highly focused. Most importantly, these highlighted areas showed a high degree of spatial correspondence with the visible abnormal lesions in the chest radiographs. This improvement confirms that the SR process successfully restored the critical high-frequency details necessary for the classifier to make high-confidence decisions, while maintaining strong consistency with the HR images. The classifier was able to precisely “see” and localize the specific features driving its diagnosis, such as the texture of pulmonary infiltrates or the margins of nodules. This result directly validates the effectiveness of our proposed model architecture in preserving and enhancing the fine structures that are crucial for diagnosis.
4. Conclusions
To address the clinical demand for enhancing the spatial resolution of low-dose chest X-ray (CXR) images, this study proposes a generative adversarial super-resolution model with CSA hybrid attention. The introduced approach incorporates residual dense blocks and the CSA hybrid attention module into the backbone of the generator, while eliminating pooling operations in the channel attention to effectively preserve high-frequency and local structural details in medical images. In 4× super-resolution reconstruction tasks on public datasets, the proposed method achieved optimal objective evaluation metrics across three independent test sets. Notably, on the independent external dataset (comprising both ‘Normal’ and ‘Opacity’ subsets), our method maintains its lead, demonstrating strong robustness against domain shifts common in clinical deployment. Qualitative analysis further demonstrates superior edge definition and texture restoration for seven typical types of pulmonary lesions (such as nodules, cavities, and pleural effusion). In summary, this method enables the generation of higher-quality chest X-ray images without increasing radiation dose, thereby providing a solid imaging foundation for early screening and follow-up of pulmonary diseases.
Despite these promising results, this study has limitations. First, although we simulated low-dose noise using Poisson distributions, the training data relies on synthetic degradation from high-quality images, whereas real-world clinical degradation may involve more complex scattering and sensor artifacts. Second, while we relied on quantitative metrics (PSNR, SSIM) and perceptual proxies (LPIPS, downstream classification) to evaluate image quality, a large-scale subjective study involving radiologists was not conducted. However, the robust performance on independent external cohorts and the significant improvements observed in downstream diagnostic classification provide strong indirect evidence of the method’s clinical usefulness and generalizability. Future work will focus on validating the method with raw clinical data and expert observers.