1. Introduction
Lung cancer remains one of the leading causes of cancer-related mortality worldwide, and its prognosis is highly dependent on early detection [
1]. In the Philippine context, the burden of life-threatening diseases continues to pose a major public health concern. Recent data indicates that neoplastic diseases rank among the top five leading causes of death nationwide [
2]. This highlights an urgent need for improved diagnostic approaches, particularly for lung cancer, where early identification is crucial for improving survival outcomes.
Chest radiography (CXR) is the most widely used imaging modality in clinical practice due to its accessibility, low cost, and relatively low radiation dose compared to computed tomography (CT) [
3]. CXR serves primarily as a first-line screening tool. However, detecting pulmonary nodules—small growths typically less than 3 cm—on CXRs is a highly challenging task.
Consequently, CXR interpretation is highly susceptible to diagnostic limitations and false-negative findings. A prospective study by Miki et al. demonstrated that radiologists significantly tend to miss lung nodules located in anatomically complex areas, such as the bilateral hilar regions, revealing inherent limitations in human visual search behavior [
4]. Further evidence from Digumarthy et al. utilized simulation-based approaches to show that even with targeted training and education, substantial diagnostic blind spots persist in complex anatomical locations [
5].
While CT offers superior spatial resolution and sensitivity for nodule detection, it is considerably more expensive, delivers a substantially higher radiation dose, and requires specialized infrastructure that is often unavailable in low- and middle-income settings [
3]. As a consequence, the majority of high-performing deep learning Computer-Aided Detection (CAD) systems in the literature have been developed and evaluated primarily on CT or low-dose CT (LDCT) datasets [
6,
7]. CXR-based nodule detection has received comparatively less attention, despite CXR remaining the predominant—and in many rural or resource-constrained facilities, the only—available imaging modality [
8]. This creates a significant translational gap: state-of-the-art CT-focused models are not clinically deployable in settings where CT infrastructure is absent, yet the CXR-focused literature reports considerably lower sensitivity benchmarks. Critically, the goal of a CXR-based CAD system in such environments is not to replace CT diagnosis, but to serve as a low-cost, computationally lightweight second opinion—flagging suspicious regions that a fatigued radiologist might overlook and thereby triaging patients who warrant further investigation.
To contextualize these limitations within current real-world diagnostic practices, we conducted expert interviews with radiologists at the Lung Center of the Philippines (LCP). The interviews revealed that radiologists face a substantial daily workload, interpreting approximately 100 to 200 radiographic images alongside other imaging modalities. This intensity causes severe cognitive load and professional burnout, which are known risk factors for perceptual and interpretive errors. Furthermore, the experts noted a lack of locally developed medical imaging technologies tailored to these specific systemic constraints.
These insights reveal a critical diagnostic gap that contributes to delayed diagnoses. Given the widespread reliance on CXR screening, there is a vital need for intelligent, automated Computer-Aided Detection (CAD) systems. Deep convolutional neural networks, specifically ResNet-50, have demonstrated strong performance in extracting fine-grained features for medical imaging using residual learning [
9]. While standard ResNet-50 performs well on binary classification, its limited ability to capture multi-scale features and model long-range dependencies reduces its sensitivity to small nodules [
10]. Furthermore, standard deep vision models often lack the spatial attention required to isolate subtle, overlapping abnormalities [
11].
To address these limitations, this study proposes RNNet-MST, an enhanced ResNet-50 architecture for pulmonary nodule classification on chest X-rays, with attention-based weak localization used to highlight disease-relevant regions. While hybrid CNN–Transformer models have been explored previously, the specific architectural novelty of the proposed RNNet-MST lies in its deep, stage-wise integration strategy paired with a dual-branch output. Standard hybrid models typically append Transformer blocks at the very end of a CNN backbone, or they rely purely on Transformer architectures with either patch tokenization (e.g., ViT [
12]) or hierarchical tokenization (e.g., Swin [
13]). In contrast, RNNet-MST integrates Multi-Scale Light Transformer (MST) blocks sequentially at every single stage (Stages 1 through 4) of the ResNet-50 backbone. This means that at each distinct semantic level (low, mid, high, and top-level feature maps), the local feature extraction of the ResNet block is immediately enriched by the global contextual modeling of the MST block before being passed to the next stage.
Furthermore, unlike standard classification networks, this proposed architecture utilizes these enriched multi-scale features for a dual-purpose objective. The final feature maps are routed not only to a Classification Output branch but also to a dedicated Spatial Attention Module. This module generates spatial attention maps, effectively granting the model weak region localization capabilities without requiring computationally expensive, pixel-level bounding box annotations. Rather than serving as an autonomous diagnostic system, this model is designed to assist radiologists by highlighting potential nodules, thereby optimizing workflow efficiency and reducing the cognitive burden.
The main aim of this work is to develop a more sensitive computer-aided screening model that may help reduce false-negative interpretations in resource-constrained settings. Experimental results showed improved model performance relative to the baseline configuration. Most notably, the proposed RNNet-MST system achieved a mean Nodule Recall of 91.55 ± 1.41%, representing a 3.53% improvement over the baseline (88.02 ± 1.92%), alongside a mean Nodule F1-Score of 90.99 ± 0.39%, successfully outperforming the baseline architecture across key metrics.
While hybrid CNN–Transformer architectures have been widely explored in medical imaging, their application to pulmonary nodule analysis on chest X-rays remains relatively limited, particularly in resource-constrained screening settings. This study focuses on adapting such hybrid architectures specifically for CXR-based nodule detection, emphasizing sensitivity to small nodules and reduction in false-negative findings.
2. Materials and Methods
2.1. Dataset
This study utilized the NODE21 public dataset, a benchmark repository for pulmonary nodule detection on frontal-view chest radiographs. The dataset aggregates images from multiple sources, including JSRT, PadChest, ChestX-ray14, and Open-I, comprising 4882 images in total: 1134 positive cases containing 1476 annotated pulmonary nodules with radiologist-provided bounding boxes, and 3748 negative (nodule-free) cases. The dataset exhibits significant class imbalance, with the non-nodule class substantially overrepresented relative to the nodule class. All images were resized to 224 × 224 pixels to conform to the input layer specifications of the ResNet-50 backbone. The dataset was partitioned into training (70%), validation (15%), and test (15%) sets using a stratified patient-level split, with stratification by class label to ensure balanced class distribution across all partitions. Each unique patient appears exclusively in one partition, with zero patient-level overlap verified programmatically across all split pairs. Data augmentation was applied exclusively to the training set, while validation and test sets retained only original images.
2.2. Data Preprocessing and Augmentation
To improve generalization while preserving the intrinsic radiographic characteristics of chest X-ray images, controlled augmentation was applied only to the training set. The augmentation pipeline included resizing to 224 × 224 pixels, random horizontal flipping, small-angle rotation, mild translation, limited brightness/contrast perturbation, and normalization using the standard ImageNet mean and standard deviation. The nodule class was further upsampled to reduce class imbalance during training. PyTorch 2.6 + CUDA 12.4 DataLoader objects were configured with a batch size of 16 and num_workers = 4.
2.3. Baseline Model: ResNet-50
The baseline model is a ResNet-50 deep convolutional neural network pretrained on ImageNet [
9]. ResNet-50 introduced residual learning through skip connections, enabling training of 50-layer networks without suffering from the vanishing gradient problem. Its final fully connected layer was replaced with a 2-class output layer (Nodule/No Nodule) for binary classification. The baseline was trained for 25 epochs using AdamW (lr = 1 × 10
−4, weight decay = 0.01), a Cosine Annealing learning rate scheduler, and Cross-Entropy Loss.
Despite strong general classification capability, ResNet-50 has two documented limitations addressed in this study: (1) its local receptive fields cannot model long-range spatial dependencies, causing misclassification of smaller-diameter nodules; and (2) its ImageNet-pretrained weights are optimized for natural-image features that differ substantially from the subtle, texture-dependent patterns of pulmonary nodules on grayscale CXRs, resulting in reduced sensitivity on the medical domain.
2.4. Proposed Architecture: RNNet-MST
RNNet-MST extends the baseline ResNet-50 by hierarchically integrating Multi-Scale Transformer (MST) blocks across all four backbone stages (
Figure 1). This design was intended to improve contextual feature modeling across scales and to better adapt pretrained convolutional features to grayscale chest radiographs. The resulting architecture combines convolutional feature extraction with transformer-based global context modeling in a single classification pipeline.
Each MST block utilizes 2 attention heads, an MLP ratio of 1.0, and a dropout rate of 0.1. To manage computational complexity while maintaining multi-scale depth, the block employs adaptive spatial downsampling; specifically, feature maps exceeding a resolution of 14 × 14 are pooled to a 14 × 14 grid before entering the attention mechanism, then bilinearly upsampled back to their original dimensions to maintain spatial consistency.
For final classification, a decision threshold of 0.5 is applied to the sigmoid output of the network. This threshold serves as the standard for distinguishing between positive and negative classes across all evaluated models in this study.
To capture long-range dependencies and global contextual information, MST blocks were hierarchically integrated at each of the four stages of the ResNet-50 backbone. Feature maps are extracted at four distinct stages: Stage 1—Low-Level Features; Stage 2—Mid-Level Features; Stage 3—High-Level Features; and Stage 4—Top-Level Features.
The output of each ResNet stage is passed into a corresponding Lightweight Transformer Block consisting of Layer Normalization (LN), Multi-Head Self-Attention (MHSA), and a Multi-Layer Perceptron (MLP) with GELU activation. Given an input image
I ∈ R
H×W×C, the feature map at stage
s is produced by the ResNet stage function
fs(·) applied to the output of the previous stage:
where
Xs ∈ R
Hs ×
Ws ×
Cs is the feature map at stage
s, with
Hs,
Ws, and
Cs denoting its spatial height, width, and number of channels respectively. Each
Xs is then passed into a corresponding Lightweight Transformer block
Ts(·):
where
Ys ∈ R
Hs ×
Ws ×
Cs is the transformer-enhanced feature map. To manage computational cost,
Xs is downsampled to 14 × 14 before the attention operation and upsampled back to
Hs ×
Ws afterward. Let
SMLP ∈ R
Hs ×
Ws ×
Cs denote the output of the MLP sub-layer within
Ts(·), upsampled and reshaped back to the spatial dimensions of
Xs. The final output feature map
Fout is then obtained by combining
SMLP with the original feature map
Xs via a residual connection to preserve low-level spatial information:
where
Fout ∈ R
Hs ×
Ws ×
Cs is the final enriched feature map passed to the next stage or classification head.
This hierarchical design ensures that global contextual information is captured continuously across all scales—from fine-grained textures in Stage 1 to high-level semantic structures in Stage 4—overcoming ResNet-50’s inherent local-receptive-field constraint.
In addition, the same MST integration simultaneously addresses the domain gap between ImageNet pretraining and medical radiographs. ResNet-50’s convolutional filters are optimized for the distinct edges and RGB textures of natural scenes, not the subtle grayscale texture patterns of pulmonary nodules on CXR. By applying self-attention across the entire CXR image at each feature scale, the transformer blocks enable the model to contextualize local convolutional features within the global thoracic structure, learning CXR-specific representations that compensate for ResNet-50’s natural-image inductive bias.
The MST blocks were integrated at all four stages to ensure that global contextual information is captured across multiple feature scales, from low-level textures to high-level semantic representations. Downsampling to 14 × 14 was used to balance computational efficiency with sufficient spatial resolution for attention modeling. Freezing the ResNet-50 backbone stabilizes training and preserves pretrained low-level feature representations, allowing the MST blocks to focus on adapting features to the target domain.
To further differentiate RNNet-MST from existing hybrid CNN–Transformer architectures such as ViT and Swin Transformer, three specific architectural distinctions are noted. First, rather than replacing the convolutional backbone entirely as in ViT or Swin, RNNet-MST hierarchically integrates lightweight transformer blocks at all four stages of a pretrained ResNet-50, preserving pretrained low-level feature representations while adapting higher-level features to the CXR domain. Second, the integration at all four stages ensures that global contextual information is captured continuously across multiple feature scales, from fine-grained textures in Stage 1 to high-level semantic structures in Stage 4, which is distinct from architectures that apply attention only at the final feature stage. Third, unlike general-purpose transformer architectures, RNNet-MST incorporates a custom spatial attention module inspired by multiplicative spatial attention mechanism of CBAM [
14] that produces clinically interpretable localization maps aligned with radiologist-annotated nodule regions, which are used for weak localization analysis rather than dense detector-style supervision. This combination of hierarchical multi-scale transformer integration and spatially supervised attention constitutes thespecific architectural contribution of this work relative to existing hybrid approaches.
2.5. Training Configuration
To assess the stability and reproducibility of the models, all experiments were repeated across three independent runs using different random seeds (42, 444, and 916). Results are reported as mean ± standard deviation across runs. All experiments and model development were conducted using a personal computing system. The primary workstation was an Acer Nitro ANV15-51 laptop (Acer Inc., Xizhi, New Taipei City, Taiwan), equipped for deep learning model training and evaluation tasks.
During training, the ResNet-50 backbone weights (previously fine-tuned on NODE21) were frozen. Only the added MST blocks and the classification head were made trainable. The model was compiled with the following configuration:
Optimizer: AdamW (lr = 1 × 10−4, weight decay = 0.01);
Scheduler: Cosine Annealing Learning Rate;
Loss Function: Weighted Binary Cross-Entropy (WBCE).
where
yi is the ground-truth label,
is the predicted probability, and
w1,
w0 are class weights computed as:
yielding
w1 = 1.7684 for the Nodule class and
w0 = 0.6971 for the Non-Nodule class. Training ran for 25 epochs with a batch size of 16, saving model checkpoints whenever the total validation loss improved.
Model selection was based on the lowest validation F1-score. For reproducibility, key training details include the hardware configuration, software library versions, random seed, and decision threshold used for positive-class prediction. In this study, results are reported from a fixed train–validation–test split.
2.6. Evaluation Metrics
Model performance was assessed using the following metrics:
Where TP is the true positive, TN is the true negative, FP is the false positive, and FN is the false negative. Performance metrics are reported as mean ± standard deviation across three independent runs with random seeds to ensure reproducibility and assess result stability.
To evaluate performance on small nodules, a dedicated small-nodule subset was isolated from the test set comprising 171 CXR images whose radiologist-annotated bounding boxes measured below 70 × 70 pixels in either dimension (
Figure 2), targeting small and irregular nodules.
To better visualize these cases,
Figure 3 presents representative samples categorized by size. The subset isolated for this dedicated evaluation consists of the small and medium categories, where nodules are most susceptible to being overlooked. By isolating these 171 images for dedicated evaluation, we can better assess the model’s ability to localize subtle pathological features that are traditionally difficult to distinguish from complex anatomical noise in a standard chest radiograph.
Performance on this subset was measured by detection rate and false negative count. To evaluate Objective 2, Nodule Recall improvement was measured on the full test set, reflecting the reduction in domain-gap-related false negatives. Both evaluations use the same trained model per run—the distinction is analytical, not architectural; with results reported as mean ± standard deviation across three independent runs.
2.7. Comparative Statistical Evaluation
To rigorously assess whether the performance improvements of the proposed RNNet-MST architecture over the baseline ResNet-50 were statistically significant, McNemar’s test was utilized. Because both models were evaluated on the exact same test set of X-ray images, their predictions represent paired nominal data. McNemar’s test evaluates the discordant pairs—instances where one model predicted the nodule classification correctly while the other failed—to determine if the marginal frequencies are homogeneous.
To account for the discrete nature of the data and reduce Type I errors, Edwards’ continuity correction was applied. The test statistic is calculated as follows:
where
presents the number of images correctly classified by the RNNet-MST but misclassified by the baseline ResNet-50, and
represents the number of images correctly classified by the baseline but misclassified by the RNNet-MST. A significance level
of 0.05 was established to reject the null hypothesis of equal model performance.
4. Discussion
4.1. Interpretation of Performance Gains
The primary objective of this study was to address the critical diagnostic gap in pulmonary nodule detection, characterized by high rates of false-negative interpretations due to the anatomical complexity of CXRs and radiologist fatigue. The proposed RNNet-MST architecture improved Nodule Recall by 3.53 percentage points relative to the baseline ResNet-50 model.
These findings are consistent with the view that standard convolutional neural networks are limited by local receptive fields when subtle abnormalities must be interpreted within broader anatomical context. By integrating Multi-Scale Transformer (MST) blocks, the proposed model effectively captured the long-range spatial dependencies required to contextualize subtle radiographic abnormalities. This improvement was most pronounced in the targeted small-nodule subset, where the RNNet-MST model increased the detection rate by 12.3% and reduced false negatives from 33 to just 12 cases. These results align with Raghu et al. [
18], who demonstrated that Vision Transformers preserve spatial location information more effectively than pure CNNs, and corroborate the findings of Fu et al. [
17] regarding the superiority of hybrid architectures in overcoming CNN style-over-content biases in medical imaging.
4.2. Comparison with Related CXR-Based Nodule Detection Work
To situate the performance of RNNet-MST within the broader literature,
Table 9 summarizes representative deep learning methods for pulmonary nodule detection on chest radiographs. A consistent observation across the surveyed works is that CXR-based detection methods report substantially lower sensitivity benchmarks than their CT counterparts, underscoring the inherent difficulty of the task and the importance of continued research on this modality. Behrendt et al. [
20] achieved the highest reported sensitivity in the CXR domain through an ensemble of four state-of-the-art object detectors trained on the same NODE21 dataset, winning the Node21 challenge; however, their approach carries a substantially higher computational cost due to the multi-model ensemble strategy. Against this backdrop, RNNet-MST achieved a Nodule Recall of 91.55% on the NODE21 test set, representing one of the strongest recall values among the CXR-based methods summarized here, although direct cross-study comparison remains limited by differences in datasets, task definitions, and evaluation protocols. Extended comparative benchmarking details are provided in
Appendix B.
4.3. Clinical Significance of the Precision–Recall Trade-Off
A critical point of interpretation in this study is the observed precision–recall trade-off. While Nodule Recall improved significantly to 91.55%, Nodule Precision decreased from 0.94 to 0.90. We acknowledge that in a real-world screening workflow, this reduction in precision is not trivial; it translates directly to a meaningful increase in false alarms, which inevitably adds to the radiologist’s cognitive load as they must visually review and dismiss these non-malignant regions.
However, the clinical acceptability of this trade-off must be evaluated concretely within the model’s intended deployment setting: serving as an automated second-reader triage tool in high-volume, resource-constrained environments. As emphasized by Luo et al. [
15], prioritizing sensitivity is paramount in cancer screening because the clinical penalties are highly asymmetrical. The clinical “cost” of a false positive is the additional seconds required for a radiologist to overrule the CAD system or, at worst, the scheduling of a secondary CT scan. Conversely, the cost of a false negative—a missed malignant nodule—is a severely delayed diagnosis that can prove fatal.
Conversely, maintaining a high baseline of precision remains vital to the practical success of any Computer-Aided Detection (CAD) system. If a model’s precision falls too low, the resulting influx of false positives can induce “alarm fatigue,” a phenomenon where overwhelmed clinicians become desensitized to automated alerts, thereby neutralizing the tool’s clinical value. Furthermore, excessive false alarms can trigger unwarranted psychological distress for patients and lead to unnecessary, resource-intensive follow-up imaging. Therefore, while the RNNet-MST sacrifices a fraction of its precision for a vital gain in sensitivity, maintaining a robust precision score of 0.90 ensures that the system does not overwhelm the diagnostic workflow with spurious findings.
Because RNNet-MST is explicitly designed to mitigate the diagnostic blind spots and perceptual errors made by fatigued radiologists interpreting hundreds of scans daily [
4,
5], prioritizing a 91.55% recall rate represents a deliberate and clinically sound compromise. By flagging a broader set of suspicious regions while maintaining manageable false positive rates, the model successfully minimizes the risk of missed nodules on first-line chest radiographs, successfully fulfilling its primary clinical objective as a highly viable triage tool.
4.4. Workflow Optimization in Resource-Constrained Settings
Furthermore, these performance gains carry significant implications for the Philippine healthcare system. Expert interviews conducted at the Lung Center of the Philippines highlighted that radiologists routinely interpret up to 200 images daily, leading to intense cognitive load. By compensating for the domain gap between natural-image pretraining and grayscale radiographs, RNNet-MST provides attention maps that may support sensitive visual assessment of suspicious regions. This functions not as an autonomous diagnostic replacement, but as an intelligent visual aid designed to reduce perceptual errors, streamline workflow, and alleviate professional burnout in resource-constrained environments.
It is important to note that the proposed system operates as a classification-based computer-aided detection (CAD) model with weak localization via attention maps, rather than a fully supervised object detection framework. While object detection models such as Faster R-CNN or DETR provide explicit bounding box predictions, they typically require dense annotations and higher computational resources. In contrast, the proposed approach prioritizes computational efficiency relative to full object-detection frameworks while still providing clinically useful localization guidance for screening applications in resource-constrained clinical environments. Additional pre-clinical assessment details and supporting analyses are provided in
Appendix C.
4.5. Study Limitations and Future Research Directions
This study has several limitations. First, the model was assessed on a single public dataset, which limits conclusions about generalizability across institutions and acquisition settings. Second, the localization analysis was attention-based and therefore should be interpreted as limited localization guidance rather than detector-level lesion localization. More established localization endpoints such as pointing game accuracy, center-hit rate, and thresholded IoU were not employed, as these metrics are designed for dedicated object detection frameworks with explicit localization outputs. Third, the dataset primarily consists of frontal-view radiographs and may not fully capture the variability present in real-world clinical settings, particularly in local Philippine hospitals. Fourth, the current architecture is also limited to binary classification (nodule vs. no nodule) and does not differentiate between benign and malignant nodules, which restricts its direct clinical interpretability. Fifth, source labels are not available in the publicly released NODE21 dataset, precluding source-stratified analysis or leave-one-source-out validation. Future work will prioritize datasets with explicit source annotations to enable more rigorous evaluation of generalization across acquisition settings.
Future research directions include extending the proposed hybrid backbone into object detection frameworks, such as Faster R-CNN or DETR, to enable precise bounding box prediction rather than relying on classification and attention-based localization. Additionally, incorporating histopathologically confirmed datasets would allow the model to perform multi-class malignancy classification. Finally, validation on larger, independent clinical cohorts—particularly using localized data from Philippine medical institutions—will be essential to ensure clinical robustness and deployment readiness.
5. Conclusions
This study proposed RNNet-MST, a hybrid deep learning architecture for pulmonary nodule classification on chest X-ray (CXR) images, developed by enhancing ResNet-50 with Multi-Scale Transformer (MST) blocks. The proposed model was designed to address two documented limitations of the baseline ResNet-50: its limited ability to capture long-range dependencies relevant to small-nodule classification, and its reduced sensitivity associated with the domain gap between ImageNet pretraining and CXR-specific features.
Experimental results on the NODE21 dataset showed consistent improvements across the primary evaluation metrics across three independent runs. Most critically, the model achieved a mean Nodule Recall of 91.55 ± 1.41%, representing a 3.53% improvement over the baseline (88.02 ± 1.92%), corresponding to fewer false-negative classifications. Mean Nodule F1-Score improved from 90.73 ± 1.52% to 90.99 ± 0.39%, with the reduced standard deviation indicating more stable performance across runs. On the isolated small-nodule subset, the proposed model achieved a 12.3% improvement in sensitivity over the baseline.
These findings suggest that combining convolutional feature extraction with multi-scale transformer-based contextual modeling can improve sensitivity in CXR-based pulmonary nodule classification. This may be particularly valuable in resource-constrained clinical settings, where high radiologist workloads and limited access to CT imaging increase the need for assistive screening tools. Future work should extend this architecture toward full bounding box detection using frameworks such as Faster R-CNN or DETR, incorporate benign-to-malignant nodule classification using histopathologically confirmed datasets, and validate the model across additional large-scale and independent datasets to strengthen generalizability.