Next Article in Journal
Partial Substitution of Rye Flour with Sea Buckthorn (Hippophae rhamnoides L.) Fruit Pomace in Three-Stage Rye Sourdough Breadmaking: Fermentation Dynamics and Bread Quality
Previous Article in Journal
Comparative Study of Lipid Quality from Edible Insect Powders and Selected Cereal Flours Under Storage Conditions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

COPD Multi-Task Diagnosis on Chest X-Ray Using CNN-Based Slot Attention

by
Wangsu Jeon
1,†,
Hyeonung Jang
2,†,
Hongchang Lee
2 and
Seongjun Choi
3,4,*
1
Department Computer Engineering, Kyungnam University, Changwon 51767, Republic of Korea
2
Haewootech Co., Ltd., Busan 46742, Republic of Korea
3
Department of Otolaryngology-Head and Neck Surgery, Cheonan Hospital, College of Medicine, Soonchunhyang University, Cheonan 31151, Republic of Korea
4
MDAI Co., Ltd., Cheonan 31151, Republic of Korea
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2026, 16(1), 14; https://doi.org/10.3390/app16010014
Submission received: 9 November 2025 / Revised: 16 December 2025 / Accepted: 16 December 2025 / Published: 19 December 2025

Abstract

This study proposes a unified deep-learning framework for the concurrent classification of Chronic Obstructive Pulmonary Disease (COPD) severity and regression of the FEV1/FVC ratio from chest X-ray (CXR) images. We integrated a ConvNeXt-Large backbone with a Slot Attention mechanism to effectively disentangle and refine disease-relevant features for multi-task learning. Evaluation on a clinical dataset demonstrated that the proposed model with a 5-slot configuration achieved superior performance compared to standard CNN and Vision Transformer baselines. On the independent test set, the model attained an Accuracy of 0.9107, Sensitivity of 0.8603, and Specificity of 0.9324 for three-class severity stratification. Simultaneously, it achieved a Mean Absolute Error (MAE) of 8.2649 and a Mean Squared Error (MSE) of 151.4704, and an R 2 of 0.7591 for FEV1/FVC ratio estimation. Qualitative analysis using saliency maps also suggested that the slot-based approach contributes to attention patterns that are more constrained to clinically relevant pulmonary structures. These results suggest that our slot-attention-based multi-task model offers a robust solution for automated COPD assessment from standard radiographs.

1. Introduction

Chronic obstructive pulmonary disease (COPD) is a progressive respiratory disease characterized by persistent airway obstruction and respiratory symptoms and is one of the leading causes of death worldwide [1,2]. According to the World Health Organization (WHO), COPD is expected to be the third most common cause of death in the world by 2030, accounting for approximately 5% of all deaths, causing approximately 3.5 million deaths worldwide as of 2021 [3,4]. According to the Global Burden of Disease study, there are approximately 213.39 million COPD patients in 2021, and this trend is constantly increasing [5,6].
Early diagnosis of COPD and adequate assessment of severity are crucial for improving patient prognosis and disease management [2,7]. However, approximately 70–85% of COPD patients worldwide remain undiagnosed, and this problem is more acute in low- and middle-income countries [8,9]. Undiagnosed COPD patients are exposed to higher health risks compared to normal people, and experience acute exacerbations, pneumonia, and high mortality [10,11].
Currently, the standard method for diagnosing COPD is pulmonary function testing (spirometry), which is diagnosed according to the Global Initiative for Chronic Obstructive Lung Disease (GOLD) criteria when the ratio of forced expiratory volume in 1 s (FEV1) to forced vital capacity (FVC) is less than 0.7 after bronchodilator administration [2,12]. The GOLD classification system distinguishes disease severity into four levels according to the percentage of predicted FEV1: mild (≥80%), moderate (50–79%), severe (30–49%), and very severe (<30%) [13,14]. However, these pulmonary function tests are difficult to perform in all medical settings due to limitations such as the availability of equipment, technical difficulties in performing the test, and cost [15,16].
Recent advances in artificial intelligence (AI), particularly machine learning (ML) and deep learning (DL) technologies, have brought about revolutionary changes in the field of medical image analysis [17,18]. Chest computed tomography (CT) imaging is a powerful tool for visualizing structural changes in COPD, capturing pathological features such as emphysema, airway thickening, and vascular changes [19,20]. Deep learning models, especially convolutional neural networks (CNNs), have shown excellent performance in diagnosing COPD and classifying its severity by automatically extracting and learning features from CT images [21,22].
Several previous studies have utilized pre-trained CNN models such as ResNet, VGG, DenseNet, and Inception to significantly improve the accuracy of COPD diagnosis, and some studies have achieved classification accuracy of more than 90% [23,24,25]. In addition, the introduction of Multiple Instance Learning (MIL) and Transfer Learning techniques has made it possible to achieve high performance even with limited medical data [26,27]. However, the heterogeneous nature of COPD and its accurate classification of different stages of severity remain a challenge [28,29].
Crucially, the radiographic diagnosis of COPD relies not on a single global feature, but on the synthesis of multiple, distinct localized signs such as hyperinflation, diaphragm flattening, and bronchial wall thickening. Standard CNNs, which tend to aggregate global context, often struggle to disentangle these heterogeneous local features effectively. This limitation highlights the need for an attention mechanism capable of explicitly separating and attending to these distinct anatomical components to improve diagnostic precision.
Therefore, this study addresses these limitations by proposing a unified multi-task deep learning framework utilizing chest X-rays (CXRs), which are more widely available than CTs. We introduce a novel architecture integrating a ConvNeXt-Large backbone with a Slot Attention decoder. By leveraging the object-centric nature of Slot Attention, our model is designed to disentangle the complex, overlapping radiographic signs of COPD into distinct feature slots, enabling more accurate concurrent performance of COPD severity classification (Non COPD, Mild COPD, and Severe COPD) and continuous FEV1/FVC ratio regression. We present a comprehensive evaluation against multiple baseline models, analyzing both quantitative performance and qualitative saliency maps to validate the effectiveness of the proposed approach.

2. Related Works

2.1. COPD Classification and Severity Assessment System

A standardized classification system for COPD is essential for effective disease management and treatment planning. The Global Initiative for Chronic Obstructive Lung Disease (GOLD) was established in 1997 and provides international guidelines for the diagnosis, management, and prevention of COPD [2]. The core of the GOLD classification system is the assessment of the severity of airway obstruction through pulmonary function tests, diagnosing COPD when the FEV1/FVC ratio is less than 0.7 after bronchodilator administration [12,30].
The traditional GOLD classification is divided into four levels based on the percentage of FEV1 predicted. Stage 1 (mild) is defined as FEV1 greater than 80% of predicted, Stage 2 (moderate) is 50–79%, Stage 3 (severe) is 30–49%, and Stage 4 (very severe) is less than 30% [2,31]. These classifications reflect the physiological severity of the disease and are important criteria for clinical decision-making and treatment intensity determination [13].
Recent studies have pointed out the limitations of FEV1 alone indicators and proposed a new classification system. Bodduluri et al. [14] developed a STaging of Airflow obstruction using Ratio (STAR) classification system based on the FEV1/FVC ratio, which was found to be less sensitive to racial and ethnic characteristics and better differentiate between survival (a key outcome in major trials such as the TORCH study [32]) and disease burden. The STAR classification is divided into Stage 1 (≥0.60 to <0.70), Stage 2 (≥0.50 to <0.60), Stage 3 (≥0.40 to <0.50), and Stage 4 (<0.40) according to the FEV1/FVC ratio [33].
In addition, GOLD has introduced a comprehensive assessment system since 2011, including symptom assessment and risk of acute exacerbations. Symptoms will be assessed through the modified British Medical Research Council Dyspnea Scale (mMRC) and COPD Assessment Test (CAT), which will be combined with pulmonary function test results and history of exacerbation to classify patients into four groups: A, B, C, and D [34,35]. This multidimensional approach goes beyond simple lung function measurements and enables the establishment of patient-centered treatment strategies.

2.2. Machine Learning-Based COPD Diagnosis

Machine learning technology is emerging as an important tool in COPD diagnosis and prognosis prediction. Zeng et al. [36] developed a machine learning model to predict severe COPD exacerbation in the next 1 year using data from 43,576 patients, achieving excellent performance with an area under the ROC curve (AUC) of 0.866. This model showed a sensitivity of 56.6%, a specificity of 91.17%, and an accuracy of 90.33%, proving to be useful for the application of management programs through the identification of high-risk patients.
A systematic literature review and meta-analysis study by Smith et al. [37] confirmed that machine learning and deep learning models performed similarly or better than conventional disease severity scoring systems in predicting the long-term prognosis of COPD. However, the need for strict adherence to reporting guidelines and the importance of improving study reproducibility to reduce the risk of bias were also emphasized.
The application of machine learning for early diagnosis of COPD is also actively being studied. Lin et al. [38] reported that machine learning-based risk prediction models are effective in early detection of COPD even in the context of unbalanced data. In particular, due to the subtlety of early symptoms of COPD, which takes an average of 3.6 years to diagnosis, the ability of machine learning models to detect early is of great clinical significance [39,40].
Nikolaou et al. [41] conducted a study to classify COPD phenotypes by combining cluster analysis and machine learning. They identified five COPD phenotypes through cluster analysis of electronic health record data, and characterized the demographic characteristics, comorbidities, risk of death, and exacerbation patterns of each phenotype. This approach may contribute to the development of personalized treatment strategies.

2.3. Deep Learning-BASED CT Image Analysis

Chest CT imaging is a powerful tool for visualizing structural changes in COPD, and advancements in deep learning technology have significantly improved the accuracy of CT-based COPD diagnosis. A systematic literature review by Wu et al. [21] confirmed that AI has excellent performance in identifying and quantifying emphysema, airway dynamics, and vascular structure in CT images. CNNs can be trained to automatically identify structural changes such as emphysema, thickened airways, and mucus plugs in COPD patients.
González et al. [42] trained a 2D CNN model using four standard CT slices and achieved an accuracy of 77.3% in the COPDGene test cohort. This marked an important milestone in early CNN-based COPD diagnostic research. Tang et al. [43] achieved an AUC of 0.889 in the PanCAN dataset in a study using a deep residual network, demonstrating that effective COPD detection is possible even with low-dose CT scans.
Advances in 3D CNN models have made it possible to make more effective use of spatial information from CT images. Ho et al. [44] developed a model that integrates Parametric Response Mapping (PRM) techniques and 3D CNNs, achieving a classification accuracy of 89.3% and a sensitivity of 88.3%. This study quantified the rate of functional small airway disease and emphysema, effectively distinguishing between COPD and Non COPD.
Wu et al. [45] proposed a method for identifying COPD through the integration of multi-view snapshots of 3D airway trees and lung regions. Using the ResNet-26 model, features extracted from 9 different views were integrated by majority voting method, and a high accuracy of 94.7% was achieved. This shows that integrating different perspectives of 3D structures contributes to improving COPD diagnostic performance.
Zhang et al. [46] developed a CNN model that combines inspiratory and expiratory dual-phase CT images with clinical information. The model, which integrates Residual Feature Extracting Blocks Network (RFEBNet) and Fully Connected Network (FCNet), achieved an AUC of 0.925 in internal tests and an AUC of 0.896 in external tests, demonstrating high generalization performance.
Recent studies have combined multi-instance learning (MIL) with attention mechanisms to achieve high performance even with limited medical data. Xue et al. [47] proposed an MIL method with two-stage attention, achieving an accuracy of 88.1%, and the VGG-16 backbone outperformed other CNN architectures. The local–global deep learning framework by Cai et al. [48] utilized group attention and slice-aware loss to achieve an accuracy of 96.08% and a sensitivity of 94.12%.

2.4. Utilizing Transfer Learning and Pre-Learning Models

In the field of medical imaging, the problem of lack of data is one of the main obstacles to the development of deep learning models. Transfer learning can solve this problem by transferring knowledge from pre-trained models from large natural image datasets such as ImageNet to medical image analysis [49]. A systematic literature review by Shorten et al. [50] analyzed 121 medical image classification studies and confirmed that using deep models such as ResNet and Inception as feature extractors can save computational costs and time while maintaining predictive performance.
Polat et al. [51] conducted a study to determine COPD severity on chest CT imaging using a deep transfer learning network. By using the pre-trained deep learning model in fine-tuning, high classification performance was achieved even with limited COPD datasets. Sørensen et al. [52] proposed a weighted multi-instance classifier in a transfer learning study using multicenter datasets and developed a model that generalizes well between data acquired by different scanners and protocols.
Wang et al. [53] conducted a transfer learning-based COPD screening study using chest X-ray imaging. By combining the EfficientNet model with transfer learning, excellent diagnostic performance was achieved in a multicenter retrospective study, and the model’s decision-making process was visualized through Gradient-weighted Class Activation Mapping (Grad-CAM). This contributed to improving the explainability of AI models.
Recent studies have emphasized the importance of pre-learning strategies specific to the medical imaging domain. Projects such as Med3D have shown that by pre-training 3D CNNs using a variety of medical image data, they can learn feature representations that are more suitable for medical image analysis than ImageNet-based transfer learning [54]. In the study of Li et al. [55], a comprehensive diagnostic model was developed that fused radiological features and deep learning features for the diagnosis of COPD, and high diagnostic performance was achieved by combining feature extraction using Variational Autoencoder and classification through Multi-Layer Perceptron.
The introduction of new architectures, such as Graph Neural Networks (GNNs), is also noteworthy. Liu et al. [56] utilized a graph convolutional network (GCN) to effectively detect COPD even with limited data and weak labeling. This demonstrated that spatially heterogeneous COPD lesions can be better detected by utilizing topological structure information between randomly selected regions of interest (ROI).

3. Proposal Methods

In this study, we developed a multi-task deep learning framework designed to concurrently perform classification and regression analysis on CXR images. The proposed architecture integrates a high-capacity convolutional neural network backbone with a slot-based attention mechanism to extract and subsequently refine disease-relevant features from high-resolution medical images. The overall pipeline encompasses data pre-processing, model training utilizing specialized loss functions for each task, and subsequent validation procedures.

3.1. Overall Architecture

The complete architecture of the proposed model is schematically illustrated in Figure 1. The model is configured to process input CXR images with predetermined dimensions of 512 pixels in both height and width, across three RGB color channels. Data flows through the network in batches, denoted by the batch size B, resulting in an input tensor shape represented as B × 3 × 512 × 512.
As shown in the global view of Figure 1, the processing pipeline begins with a backbone network that encodes the raw high-dimensional image data into a compact feature representation. Unlike conventional architectures that directly utilize this backbone output for final predictions, our approach introduces an intermediate Slot Attention module. This module functions as a semantic bottleneck, decomposing the monolithic feature vector from the backbone into a set of distinct component vectors, referred to as slots. These refined slots are subsequently concatenated to form a unified rich representation, which is then fed into parallel output heads. Each head is specialized for its respective task: the classification head outputs probabilities for three diagnostic categories while the regression head simultaneously predicts the continuous FEV1/FVC ratio.

3.2. Feature Extraction Backbone

For the primary encoder, we employed the ConvNeXt-Large architecture [57]. ConvNeXt modernizes the standard ResNet design by incorporating architectural choices inspired by Vision Transformers, such as larger convolutional kernel sizes (7 × 7), replacing ReLU with GELU activations, and utilizing fewer normalization layers. These adjustments allow the model to achieve competitive performance and scalable capacity while maintaining the inductive biases of convolutional networks, which remain advantageous for visual data processing.
The detailed structure of the backbone utilized in our framework is depicted in Figure 2. The network consists of a stem layer followed by four progressive stages. The input image of shape B × 3 × 512 × 512 is first processed by the stem layer, which applies a 4 × 4 convolution with a stride of 4, effectively downsampling the spatial resolution to 128 × 128 while expanding the channel depth to 192. The subsequent four stages are composed of varying numbers of ConvNeXt blocks—specifically 3, 3, 27, and 3 blocks, respectively, for the ‘Large’ variant. Between these stages, downsampling layers further reduce spatial dimensions by a factor of 2 while doubling the channel capacity. Upon completing Stage 4, the resulting feature maps have dimensions of B × 1536 × 16 × 16. A global average pooling operation is then applied to collapse the spatial dimensions entirely, effectively summarizing the image content into a single feature vector of 1536 dimensions for each sample in the batch. A final flattening step ensures the output is a two-dimensional tensor with shape B × 1536, ready for the subsequent attention-based refinement.

3.3. Slot Attention Module

To enable more granular feature processing and disentanglement of semantic concepts, we incorporated a Slot Attention module [58] as a trainable bottleneck. The internal mechanism of this module is detailed in Figure 3. The module interfaces with the aggregated features from the backbone and distributes the contained information into a fixed number of five distinct slots (K = 5).
The process initiates by expanding the backbone’s output vector B × 1536 to B × 1 × 1536 to introduce a sequence dimension (N = 1), allowing it to function as visual keys and values within the attention mechanism. Simultaneously, five slot vectors, each maintaining the dimension D = 1536, are initialized using learnable parameters derived from a Gaussian distribution with trained mean and variance. These slots serve as queries that iteratively compete for information from the input feature vector.
As illustrated in the iterative loop of Figure 3, this refinement occurs over three recurrent update steps. In each iteration, a dot-product attention mechanism calculates similarity scores between the current slot states (queries) and the input features (keys). These scores are normalized via a softmax function across the distinct slots, ensuring a competitive allocation of information. The weighted sum of input values is then used to update the slots via a Gated Recurrent Unit (GRU) [59], followed by a Multi-Layer Perceptron (MLP) with residual connections. Through this iterative process, the initially random slots progressively diverge and specialize to capture different latent aspects of the input image representation.

3.4. Multi-Task Output Heads

The final stage of the network diverges into separate pathways to handle the disparate objectives of classification and regression simultaneously. The five refined slot vectors from the attention module, each with a dimensionality of 1536, are first flattened and concatenated into a single unified representation vector with a total dimensionality of 7680 (5 × 1536).
This combined vector serves as the common input to two distinct linear layers functioning as task-specific heads. The classification head projects this 7680-dimensional vector down to three output nodes (B × 3). These nodes correspond to the un-normalized logits for the three target classes: Non COPD, Mild COPD, and Severe COPD. During inference, a Softmax function is applied to these logits to yield the final predicted probabilities for each severity level. In parallel, the regression head maps the same feature vector to a single scalar output node (B × 1), which represents the predicted continuous value of the FEV1/FVC ratio, a critical spirometric indicator of airflow limitation. By sharing the entire feature extraction and slot refinement pipeline up to these final layers, the model is encouraged to learn generalized representations that are robust enough to support both discrete categorization and continuous variable prediction tasks without needing separate backbone networks.

4. Experimental Environment

The proposed model was implemented using the PyTorch framework version 1.12 (Meta Platforms, Inc., Menlo Park, CA, USA) and the Python programming language version 3.8.13 (Python Software Foundation, Wilmington, DE, USA). All model training and evaluation processes were conducted on a workstation equipped with an NVIDIA GeForce RTX 3090 GPU (NVIDIA Corp., Santa Clara, CA, USA).

4.1. Dataset

This retrospective study was conducted in accordance with the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of Cheonan Soonchunhyang Hospital (SCH; SCHCA 14 June 2025, approved on 4 July 2025). The dataset consists of de-identified CXR images in PNG format. To ensure consistency in anatomical presentation, only standard Anteroposterior (AP) and Posteroanterior (PA) projection views were included in the study cohort.
Ground truth labels for the classification task were established based on spirometric assessment following the Global Initiative for Chronic Obstructive Lung Disease (GOLD) guidelines [2]. To avoid confusion with existing guidelines, we explicitly defined the diagnostic criteria for our three-class stratification system.
The Non COPD group included subjects with normal pulmonary function, defined as a post-bronchodilator ratio of forced expiratory volume in 1 s to forced vital capacity (FEV1/FVC) of 0.70 or higher.
The Mild COPD category corresponded to GOLD Stage 1 (Mild) and GOLD Stage 2 (Moderate). Patients in this group exhibited airflow limitations characterized by an FEV1/FVC ratio below 0.70 and an FEV1 of 50% or more of the predicted value.
The Severe COPD category merged patients from GOLD Stage 3 (Severe) and GOLD Stage 4 (Very Severe). This group represented advanced disease states, defined by severe airflow limitation with an FEV1/FVC ratio below 0.70 and an FEV1 of less than 50% of the predicted value [60].
This stratification strategy was adopted to balance clinical relevance with model training stability, effectively grouping patients requiring similar levels of medical intervention.
To minimize class imbalance inherent in raw clinical data, the dataset collection followed a prospective protocol established prior to data extraction. The inclusion criteria were designed to acquire a targeted number of high-quality cases for each severity class, aiming for a balanced distribution (approximately 1599 Normal, 1200 Mild, and 1200 Severe cases) suitable for robust model training. This curated collection strategy ensures that the dataset reflects a controlled distribution for experimental validation rather than a random sample of hospital admissions.
The baseline demographic and clinical characteristics of the study population are summarized in Table 1. The dataset included a total of 5000 subjects. Statistical analysis (One-way ANOVA for age and FEV1/FVC; Chi-square test for sex) revealed significant differences in age and gender distribution among the three groups (p < 0.001). This reflects the natural prevalence of COPD, which is strongly associated with older age and male sex in the Korean population.
The study cohort was partitioned into training, validation, and test sets. To ensure rigorous evaluation and prevent data leakage, this split was performed strictly at the patient level. Consequently, images from the same subject were assigned exclusively to one subset, ensuring that no patient appeared in multiple splits. The detailed distribution of image samples across these three classes for each dataset split is presented in Table 2.
For the regression task, the Forced Expiratory Volume in 1 s to FEV1/FVC was utilized as the target variable. This ratio is the primary spirometric index used for diagnosing obstructive lung diseases, representing the fraction of vital capacity that can be expired in the first second of a forced exhalation [61].
The distribution of FEV1/FVC ratios across the training, validation, and test splits is illustrated in Figure 4. To facilitate direct comparison of data distribution among splits, histograms are overlaid with distinct colors. A single extreme outlier with a recorded value of 1114 was identified in the training set; this data point was determined to be physiologically impossible and was permanently removed from the dataset to ensure data integrity and training stability. Descriptive statistics for the FEV1/FVC ratio, calculated from the cleaned dataset for each individual split, are summarized in Table 3.

4.2. Data Preprocsssing

Input CXR images were preprocessed to align with the initialization requirements of the backbone network pre-trained on ImageNet. Specifically, pixel intensity values were normalized using the standard ImageNet RGB mean (0.485, 0.456, 0.406) and standard deviation (0.229, 0.224, 0.225) after rescaling to the [0, 1] range. While medical images typically exhibit different intensity distributions compared to natural images, we maintained the ImageNet normalization statistics to preserve the feature distribution expected by the pre-trained weights, thereby ensuring stable convergence during transfer learning. All images were resized to a uniform spatial resolution of 512 × 512 pixels via bicubic interpolation. To improve model generalization and robustness against variations in image quality, we employed RandAugment [62], an automated data augmentation strategy. For this study, the RandAugment parameters were set to N = 2 (number of augmentations to apply sequentially) and M = 28 (magnitude of the augmentation operations).
Target variables were processed according to their respective tasks. Diagnostic class labels for classification were converted using one-hot encoding. The continuous FEV1/FVC ratio labels for regression underwent standardization (Z-score normalization) utilizing the global mean and standard deviation derived from the training dataset. This step ensured that the regression targets had a comparable scale to the classification outputs, preventing either task from dominating the joint gradient descent process.

4.3. Experimental Setup and Metric

The model was trained for a total of 250 epochs with a batch size of 32. Network parameters were optimized using the AdamW optimizer [63], which decouples weight decay from gradient updates. The initial learning rate was set to 1 × 10−4 with a weight decay coefficient of 1 × 10−4. To stabilize the early training phase, a flat warmup strategy was applied for the first 5 epochs, followed by a Cosine Annealing learning rate scheduler [64] for the remainder of the training process. The multi-task objective function combined distinct losses for each head. For the multi-class classification task, Categorical Cross-Entropy loss was utilized with label smoothing [65] of 0.1 to mitigate overconfidence in predictions. For the regression task, the Log-Cosh loss function was employed, as it approximates Mean Squared Error for small errors and Mean Absolute Error for larger errors, making it robust to outliers. To ensure a balanced optimization process where neither task dominates the gradient updates, we adopted a fixed weighting scheme, assigning an equal weight of 1.0 to both the classification and regression losses. The overall hyperparameters are summarized in Table 4.
Performance evaluation was stratified into classification and regression components to align with the multi-task learning objective. For the multi-class classification task, performance was assessed using macro-averaged metrics derived from one-vs-all confusion matrices for each diagnostic category. This approach ensures equal weight is given to all three classes regardless of their prevalence in the test set. The primary classification metrics include Sensitivity, Specificity, and Accuracy. These are calculated as the arithmetic mean of the respective per-class metrics, providing a balanced view of model performance across all severity levels. The mathematical definitions of these metrics are presented in Equations (1)–(3):
A c c u r a c y = 1 N i = 0 N T P i + T N i T P i + T N i + F P i + F N i
S e n s i t i v i t y = 1 N i = 0 N T P i T P i + F N i
S p e c i f i c i t y = 1 N i = 0 N T N i T N i + F P i
where N denotes the number of diagnostic classes (here N = 3), and T P i , T N i , F P i , and T N i represent the number of True Positives, True Negatives, False Positives, and False Negatives for class i, respectively.
For the regression task, the predictive accuracy of the continuous FEV1/FVC ratio was evaluated using Mean Absolute Error, Mean Squared Error, and the Coefficient of Determination. Prior to calculating these metrics, the raw output values of the model were inversely standardized using the global mean of 74.83 and standard deviation of 29.18 from the original dataset. This step restores the predicted values to their original physiological scale for clinically meaningful validation. Mean Absolute Error measures the average magnitude of errors in a set of predictions without considering their direction. Mean Squared Error measures the average of the squares of the errors, giving more weight to larger differences. The Coefficient of Determination provides an indication of goodness-of-fit, representing the proportion of variance in the dependent variable that is predictable from the independent variable. These metrics are formally defined in Equations (4)–(6):
M A E = 1 n j = 0 n | y j y ^ j |
M S E = 1 n j = 0 n ( y j y ^ j ) 2
R 2 = 1 1 n j = 0 n ( y j y ^ j ) 2 1 n j = 0 n ( y j y ^ ) 2
where n is the total number of test samples, y j is the actual ground truth FEV1/FVC value for sample j, and y ^ j is the corresponding predicted value after inverse standardization.
This dual-evaluation framework ensures a balanced optimization of the multi-task objective, verifying that the model achieves reliable performance across both qualitative disease staging and quantitative pulmonary function estimation.

5. Experimental Results

5.1. Quantitative Analysis

5.1.1. Validation Performance and Model Selection

All models were trained for a maximum of 250 epochs. The validation set was used to monitor training dynamics and determine the optimal checkpoint that balanced classification accuracy and regression precision.
Table 5 summarizes the performance of all models on the validation set. Among the baseline architectures, ViT-L/16 showed the most competitive performance with a Macro Accuracy of 0.9120 and an R 2 of 0.7021. The EfficientNet-V2-L, which previously exhibited instability, successfully converged after data cleaning and hyperparameter adjustment, achieving a Macro Accuracy of 0.9027 and an R 2 of 0.6363.
However, the proposed ConvNeXt Large with Slot Attention (Slot 5) consistently outperformed all baselines. It demonstrated stable convergence, peaking at epoch 132, with the highest validation metrics across the board: a Macro Accuracy of 0.9267, an MAE of 8.2010, and an R 2 of 0.7581. This indicates that the 5-slot configuration effectively captures the validation data distribution without overfitting.

5.1.2. Test Set Performance Analysis

We evaluated the generalization capability of the trained models using the held-out test set. The comprehensive quantitative results are presented in Table 6.
Standard CNN baselines generally established the lower bound of performance. ResNet152 and DenseNet201 achieved R 2 values of 0.6363 and 0.6505, respectively. While GoogLeNet and EfficientNet-V2-L showed decent classification accuracy (around 0.89), their regression performance remained moderate ( R 2 < 0.69). ViT-L/16 and the vanilla ConvNeXt-Large proved to be strong baselines, achieving R 2 values of 0.7173 and 0.7073, respectively, highlighting the efficacy of modern architectures in medical imaging.
The integration of the Slot Attention mechanism yielded significant performance gains. The proposed Slot 5 model achieved the best overall results with a Macro Accuracy of 0.9107, Macro Sensitivity of 0.8603, Macro Specificity of 0.9324, and an R 2 of 0.7591. This represents a substantial improvement over the vanilla ConvNeXt-Large backbone, confirming that the slot-based feature refinement is critical for the multi-task objective.
Tprovide a more granular view of the classification performance, we visualize the confusion matrices of our proposed model on the test set in Figure 5. The figure is structured to offer both a holistic and a class-specific perspective. The top-left panel (0, 0) displays the overall 3 × 3 multi-class confusion matrix, illustrating the distribution of predicted versus actual labels across all three categories. To further dissect the proposed model’s discriminative ability for each specific severity level, the subsequent panels present the one-vs-all binary confusion matrices: the top-right panel (0, 1) details the performance for the Non-COPD class against all others; the bottom-left panel (1, 0) focuses on the Mild COPD class; and the bottom-right panel (1, 1) showcases the results for the Severe COPD class. These detailed matrices confirm that our method maintains high true positive rates and low false positive rates consistently across all diagnostic categories, rather than achieving high overall accuracy by biasing towards the majority class.

5.1.3. Ablation Study: Impact of Slot Configuration

To determine the optimal representational capacity for the multi-task objective, we conducted an ablation study by varying the number of slots (K) from 3 to 7. The detailed performance comparison on the test set is presented in Table 7.
The baseline ConvNeXt-Large model without the slot mechanism achieved a Macro Accuracy of 0.9013 and an R 2 of 0.7073. Integrating the Slot Attention decoder generally improved performance, validating the efficacy of the feature disentanglement approach. The model with 3 slots showed a substantial improvement in the regression task, achieving an R 2 of 0.7534. This suggests that even a limited number of slots can effectively aggregate features for pulmonary function estimation.
However, the proposed configuration with 5 slots demonstrated the most robust performance across both tasks. It achieved the highest Macro Accuracy of 0.9107 and R 2 of 0.7591, surpassing both the baseline and other slot variants. Increasing the number of slots beyond five resulted in performance saturation or degradation; the models with 6 and 7 slots recorded lower R 2 values of 0.7272 and 0.7328, respectively. This trend indicates that while a sufficient number of slots is required to capture distinct radiographic signs such as hyperinflation and bronchial thickening, an excessive number may lead to the over-segmentation of semantic features or the inclusion of background noise. Consequently, 5 slots were identified as the optimal hyperparameter for balancing model complexity and generalization capability in this unified framework.

5.1.4. Cross-Validation Analysis

To further validate the robustness and generalization capability of the proposed method, we performed a 5-fold cross-validation. The entire dataset, comprising training, validation, and test sets, was merged and randomly partitioned into five equal-sized folds at the patient level. In each iteration, one fold was used for validation while the remaining four folds were used for training. This process ensures that every sample in the dataset is evaluated exactly once, minimizing the bias associated with a fixed train–test split.
Table 8 summarizes the performance metrics for each fold along with the mean and standard deviation across all folds. The model demonstrated consistent performance stability, with a mean Macro Accuracy of 0.9195 (±0.061) and a mean R 2 of 0.7341 (± 0.0148). The low standard deviations across both classification and regression metrics indicate that the proposed architecture is not sensitive to specific data partitions and can reliably learn disentangled features for COPD assessment regardless of the subset distribution.

5.2. Qualitative Analysis

To interpret the model’s decision-making process, we generated Saliency Maps for all test-set images. These maps visualize the regions that most influenced the model’s predictions. To enhance visualization clarity and suppress insignificant background noise, a Gaussian filter with a sigma of 5 was applied as a post-processing step, followed by a threshold of 0.5 to filter out low-activation regions.
Figure 6 presents representative examples of these visualized Saliency Maps across different diagnostic classes. Baseline models occasionally exhibited diffuse attention, focusing on irrelevant areas outside the lung fields, such as the clavicles or soft tissue. In comparative analysis, the proposed ConvNeXt Large with Slot Attention model demonstrated a tendency to produce activation patterns that were more distinctly constrained within the lung parenchyma. The baseline models’ focus on non-pulmonary structures was significantly reduced in our proposed model, confirming that the Slot Attention mechanism effectively drives the model to learn disease-specific radiographic features.
This qualitative observation is further supported by the quantitative FEV1/FVC predictions displayed in the figure. For the Mild COPD case, the proposed model’s prediction of 63.79 is substantially closer to the ground truth of 64.00 than predictions from other models, such as the vanilla ConvNeXt’s 64.32. A similar result is seen in the Severe COPD case, where the proposed model predicted 48.54 against the ground truth of 48.00, contrasting with significantly larger errors from models like DenseNet201’s 49.26 and vanilla ConvNeXt’s 55.70.
This subtle qualitative trend, combined with the demonstrated accuracy of the regression outputs, aligns with the overall quantitative findings. It suggests that the Slot Attention mechanism may contribute to a modest reduction in noise from irrelevant features, rather than a complete localization, thereby guiding the model toward more disease-relevant representations for both tasks.

6. Discussion

This study proposed a unified deep learning framework integrating a ConvNeXt backbone with Slot Attention to simultaneously classify COPD severity and estimate the FEV1/FVC ratio from chest radiographs. Our findings demonstrate that this multi-task approach achieves high diagnostic accuracy and precise functional quantification, outperforming standard CNN and Vision Transformer baselines.

6.1. Clinical Implications and Diagnostic Performance

The primary clinical contribution of this work is the demonstration that deep learning can effectively extract spirometric surrogates from standard CXR images. The model achieved a Test R 2 of 0.7591 for FEV1/FVC estimation, a figure that aligns with recent studies reporting strong correlations between radiographic features and pulmonary function [21,42]. While structural changes on X-rays do not always linearly map to functional impairment, our results suggest that deep convolutional networks can identify complex, non-linear patterns—such as subtle parenchymal texture variations and micro-architectural distortions—that are often imperceptible to human observers but highly indicative of airflow limitation.
In resource-limited settings where spirometry is unavailable or contra-indicated, this automated tool could serve as a valuable screening mechanism. By providing an immediate, low-cost assessment of COPD severity, it could help prioritize high-risk patients for confirmatory pulmonary function testing, thereby optimizing healthcare resource allocation.

6.2. Role of Slot Attention in Feature Disentanglement

A key methodological innovation of this study is the application of Slot Attention to medical image analysis. COPD diagnosis relies on the synthesis of various localized radiographic signs, such as hyperinflation, diaphragm flattening, and bronchial wall thickening. Standard CNNs often struggle to disentangle these heterogeneous features from the global context.
Our ablation study (Table 7) and qualitative analysis (Figure 6) provide empirical evidence that the Slot Attention mechanism addresses this limitation. The 5-slot configuration yielded the highest performance, suggesting it provides the optimal capacity to separate distinct disease-relevant concepts. Visually, the slot-based model produced more constrained and localized attention maps compared to the diffuse activation seen in baseline models. This indicates that the slots effectively “compete” for different anatomical regions, allowing the model to aggregate specific local evidence for a more robust global prediction.

6.3. Limitations and Future Directions

Despite the promising results, this study has several limitations. First, our experiments relied on data collected from a single institution, utilizing a specific set of scanner types and imaging protocols. This creates a potential risk of the model overfitting to local acquisition parameters and the specific demographic characteristics of the patient population at our center. Consequently, the generalizability of our proposed model to external datasets acquired from different institutions or scanner manufacturers has not yet been fully verified. Future studies should aim to validate the model using large-scale, multi-center cohorts to ensure robustness across diverse clinical environments and imaging conditions.
Second, the input resolution was standardized to 512 × 512 pixels due to computational constraints. While sufficient for capturing major structural changes, higher resolutions might be necessary to detect finer details of mild COPD, such as early vascular pruning. Future research should explore high-resolution training strategies or multi-scale architecture.
Third, our “Severe” class merges GOLD stages 3 and 4. While this simplification is clinically practical for identifying patients requiring urgent intervention, a more granular classification aligning strictly with all four GOLD stages would be beneficial for precise disease staging.
Finally, as shown in Table 2, significant demographic differences were observed between classes, with the Severe COPD group being older and predominantly male compared to the Non COPD group. This raises the concern that the model might rely on demographic shortcuts (e.g., predicting age or bone density) rather than pathological features. However, our qualitative analysis using saliency maps (Figure 6) demonstrates that the model’s attention is consistently focused on lung parenchymal structures and airway markers rather than irrelevant features like bone structure or soft tissue. This suggests that while demographic biases exist in the dataset, the proposed Slot Attention mechanism successfully drives the model to learn disease-specific radiographic patterns. Future work could explicitly integrate tabular data (age, sex, smoking history) into the model to further enhance predictive accuracy and mitigate potential confounding factors.
In conclusion, this study validates the potential of Slot Attention-enhanced deep learning for comprehensive COPD assessment from chest X-rays, offering a promising path toward accessible and interpretable AI-aided diagnosis.

7. Conclusions

This study presented a unified deep learning framework designed for the concurrent multi-task analysis of COPD from standard chest radiographs. By integrating a high-capacity ConvNeXt-Large backbone with a Slot Attention mechanism, the proposed architecture addressed the challenge of simultaneously learning discriminative features for discrete severity classification and continuous functional quantification.
Empirical evaluations on a clinical dataset demonstrated that the proposed model consistently outperformed established CNN baselines (ResNet, DenseNet, EfficientNet families) and Vision Transformer (ViT-L/16) across both tasks. The ablation study specifically highlighted the efficacy of the Slot Attention decoder in refining standard backbone features for multi-task learning. The 5-slot configuration proved optimal, achieving the highest test set performance with an Accuracy of 0.9253 for three-class severity stratification and an R 2 of 0.7897 for FEV1/FVC ratio estimation.
Furthermore, qualitative analysis using saliency maps suggested that the slot-based approach contributes to attention patterns that are more constrained to clinically relevant pulmonary structures, compared to baseline models. These findings suggest that incorporating explicit feature disentanglement mechanisms like Slot Attention can effectively enhance both the predictive performance and interpretability of deep learning models in complex medical imaging tasks. Future work may involve validating this approach on larger, multi-center cohorts to further confirm its generalizability in diverse clinical environments.

Author Contributions

Conceptualization and supervision, S.C. and W.J.; data curation, methodology, and writing—original draft preparation, W.J.; formal analysis, and writing—review and editing, H.J.; methodology and software and validation, H.J. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Soonchunhyang Research Fund. This work was supported by the Institute of Information and Communications Technology Planning and Evaluation (IITP)—Innovative Human Resource Development for Local Intellectualization program grant funded by the Korea government (MSIT) (IITP-2025-RS-2024-00436773, 50%).

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of Cheonan Soonchunhyang Hospital (Soonchunhyang University Cheonan Hospital, Cheonan, Korea, SCHCA 2025-06-014, date: 4 July 2025).

Informed Consent Statement

Patient consent was waived due to the retrospective design of this study.

Data Availability Statement

The original contributions presented in the study are included in the article.

Conflicts of Interest

Authors Hyeonung Jang, Hongchang Lee were employed by the company Haewootech Co., Ltd. Author Seongjun Choi was employed by the company MDAI Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Soriano, J.B.; Kendrick, P.J.; Paulson, K.R.; Gupta, V.; Abrams, E.M.; Adedoyin, R.A.; Adhikari, T.B.; Advani, S.M.; Agrawal, A.; Ahmadian, E.; et al. Prevalence and attributable health burden of chronic respiratory diseases, 1990-2017: A systematic analysis for the Global Burden of Disease Study 2017. Lancet Respir. Med. 2020, 8, 585–596. [Google Scholar] [CrossRef]
  2. Global Initiative for Chronic Obstructive Lung Disease (GOLD). Global Strategy for the Diagnosis, Management, and Prevention of Chronic Obstructive Pulmonary Disease (2025 Report), 2025 ed.; GOLD: Fontana, WI, USA, 2025. [Google Scholar]
  3. World Health Organization. Chronic Obstructive Pulmonary Disease (COPD). Available online: https://www.who.int/news-room/fact-sheets/detail/chronic-obstructive-pulmonary-disease-(copd) (accessed on 6 November 2024).
  4. Mathers, C.D.; Loncar, D. Projections of global mortality and burden of disease from 2002 to 2030. PLoS Med. 2006, 3, e442. [Google Scholar] [CrossRef] [PubMed]
  5. Wang, Z.; Lin, J.; Liang, L.; Huang, F.; Yao, X.; Peng, K.; Gao, Y.; Zheng, J. Global, regional, and national burden of chronic obstructive pulmonary disease and its attributable risk factors from 1990 to 2021: An analysis for the Global Burden of Disease Study 2021. Respir. Res. 2025, 26, 2. [Google Scholar] [CrossRef] [PubMed]
  6. Adeloye, D.; Song, P.; Zhu, Y.; Campbell, H.; Sheikh, A.; Rudan, I. Global, regional, and national prevalence of, and risk factors for, chronic obstructive pulmonary disease (COPD) in 2019: A systematic review and modelling analysis. Lancet Respir. Med. 2022, 10, 447–458. [Google Scholar] [CrossRef]
  7. Boers, E.; Barrett, M.; Su, J.G.; Benjafield, A.V.; Sinha, S.; Kaye, L.; Zar, H.J.; Vuong, V.; Tellez, D.; Gondalia, R.; et al. Global Burden of Chronic Obstructive Pulmonary Disease Through 2050. JAMA Netw. Open 2023, 6, e2346598. [Google Scholar] [CrossRef]
  8. Bednarek, M.; Maciejewski, J.; Wozniak, M.; Kuca, P.; Zielinski, J. Prevalence, severity and underdiagnosis of COPD in the primary care setting. Thorax 2008, 63, 402–407. [Google Scholar] [CrossRef]
  9. Halpin, D.M.G.; Celli, B.R.; Criner, G.J.; Frith, P.; López–Varela, M.V.L.; Salvi, S.; Vogelmeier, C.F.; Chen, R.; Mortimer, K.; Montes de Oca, M.; et al. The GOLD Summit on chronic obstructive pulmonary disease in low- and middle-income countries. Int. J. Tuberc. Lung Dis. 2019, 23, 1131–1141. [Google Scholar] [CrossRef]
  10. Lamprecht, B.; McBurnie, M.A.; Vollmer, W.M.; Gudmundsson, G.; Welte, T.; Nizankowska-Mogilnicka, E.; Studnicka, M.; Bateman, E.; Anto, J.M.; Burney, P.; et al. COPD in never smokers: Results from the population-based burden of obstructive lung disease study. Chest 2011, 139, 752–763. [Google Scholar] [PubMed]
  11. Miravitlles, M.; Vogelmeier, C.; Roche, N.; Halpin, D.; Cardoso, J.; Chuchalin, A.G.; Kankaanranta, H.; Sandström, T.; Śliwiński, P.; Zatloukal, J.; et al. A review of national guidelines for management of COPD in Europe. Eur. Respir. J. 2016, 47, 625–637. [Google Scholar] [CrossRef]
  12. Pellegrino, R.; Viegi, G.; Brusasco, V.; Crapo, R.O.; Burgos, F.; Casaburi, R.; Coates, A.; Van Der Grinten, C.P.M.; Gustafsson, P.; Hankinson, J.; et al. Interpretative strategies for lung function tests. Eur. Respir. J. 2005, 26, 948–968. [Google Scholar] [CrossRef]
  13. Celli, B.R.; Decramer, M.; Wedzicha, J.A.; Wilson, K.C.; Agustí, A.; Criner, G.J.; MacNee, W.; Make, B.J.; Rennard, I.S.; A Stockley, R.; et al. An official American Thoracic Society/European Respiratory Society statement: Research questions in COPD. Eur. Respir. J. 2015, 45, 879–905. [Google Scholar] [CrossRef]
  14. Bodduluri, S.; Bhatt, S.P.; Nakhmani, A.; Fortis, S.; Strand, M.J.; Silverman, E.K.; Sciurba, F.C. FEV1/FVC Severity Stages for Chronic Obstructive Pulmonary Disease. Am. J. Respir. Crit. Care Med. 2023, 208, 676–684. [Google Scholar] [CrossRef]
  15. Lin, C.H.; Cheng, S.L.; Chen, C.Z.; Chen, C.H.; Lin, S.-H.; Wang, H.C. Current progress of COPD early detection: Key points and novel strategies. Int. J. Chron. Obstruct. Pulmon. Dis. 2023, 18, 1511–1524. [Google Scholar] [CrossRef]
  16. Talker, L.; Neville, D.; Wiffen, L.; Dogan, C.; Lim, R.H.; Broomfield, H.; Lambert, G.; Selim, A.; Brown, T.; Carter, J.; et al. Diagnosis and Severity Assessment of COPD Using a Novel Fast-Response Capnometer and Interpretable Machine Learning. Chronic Obstr. Pulm. Dis. 2024, 11, 374–389. [Google Scholar] [CrossRef] [PubMed]
  17. Aggarwal, R.; Sounderajah, V.; Martin, G.; Ting, D.S.W.; Karthikesalingam, A.; King, D.; Ashrafian, H.; Darzi, A. Diagnostic accuracy of deep learning in medical imaging: A systematic review and meta-analysis. NPJ Digit. Med. 2021, 4, 65. [Google Scholar] [CrossRef] [PubMed]
  18. Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.W.M.; van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [PubMed]
  19. Lynch, D.A.; Austin, J.H.; Hogg, J.C.; Grenier, P.A.; Kauczor, H.U.; Bankier, A.A.; Barr, R.G.; Colby, T.V.; Galvin, J.R.; Gevenois, P.A.; et al. CT-Definable Subtypes of Chronic Obstructive Pulmonary Disease: A Statement of the Fleischner Society. Radiology 2015, 277, 192–205. [Google Scholar] [CrossRef]
  20. Sui, H.; Mo, Z.; Wei, Y.; Shi, F.; Cheng, K.; Liu, L. Diagnosis and Severity Assessment of COPD Based on Machine Learning of Chest CT Images. Int. J. Chronic Obstr. Pulm. Dis. 2024, 189, 2853–2867. [Google Scholar] [CrossRef]
  21. Wu, Y.; Xia, S.; Liang, Z.; Chen, R.; Qi, S. Artificial intelligence in COPD CT images: Identification, staging, and quantitation. Respir. Res. 2024, 25, 319. [Google Scholar] [CrossRef]
  22. Patel, P.J.; Diwan, D.; Patel, K.A.; Ranga, S.; Modi, N.J.; Dumasia, S. Multi Feature fusion for COPD Classification using Deep Learning Algorithms. J. Integr. Sci. Technol. 2024, 12, 780. [Google Scholar] [CrossRef]
  23. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
  24. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  25. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  26. Xue, Z.; You, D.; Candemir, S.; Jaeger, S.; Antani, S.; Long, L.R.; Thoma, G.R. Chest X-ray Image View Classification. In Proceedings of the 2015 IEEE 28th International Symposium on Computer-Based Medical Systems, São Carlos/Ribeirão Preto, Brazil, 22–25 June 2015. [Google Scholar]
  27. Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017, 542, 115–118, Erratum in Nature 2017, 546, 686.. [Google Scholar] [CrossRef]
  28. Pikoula, M.; Kallis, C.; Madjiheurem, S.; Quint, J.K.; Bafadhel, M.; Denaxas, S. Evaluation of data processing pipelines on real-world electronic health records data for the purpose of measuring patient similarity. PLoS ONE 2023, 18, e0287264. [Google Scholar] [CrossRef]
  29. Janssens, W.; Bouillon, R.; Claes, B.; Carremans, C.; Lehouck, A.; Buysschaert, I.; Coolen, J.; Mathieu, C.; Decramer, M.; Lambrechts, D. Vitamin D deficiency is highly prevalent in COPD and correlates with variants in the vitamin D-binding gene. Thorax 2010, 65, 215–220. [Google Scholar] [CrossRef]
  30. Tashkin, D.P.; Celli, B.; Senn, S.; Burkhart, D.; Kesten, S.; Menjoge, S.; Decramer, M. A 4-year trial of tiotropium in chronic obstructive pulmonary disease. N. Engl. J. Med. 2008, 359, 1543–1554. [Google Scholar] [CrossRef]
  31. Jenkins, C.R.; Jones, P.W.; Calverley, P.M.; Celli, B.; A Anderson, J.; Ferguson, G.T.; Yates, J.C.; Willits, L.R.; Vestbo, J. Efficacy of salmeterol/fluticasone propionate by GOLD stage of chronic obstructive pulmonary disease: Analysis from the randomised, placebo-controlled TORCH study. Respir. Res. 2009, 10, 59. [Google Scholar] [CrossRef] [PubMed]
  32. Calverley, P.M.; Anderson, J.A.; Celli, B.; Ferguson, G.T.; Jenkins, C.; Jones, P.W.; Yates, J.C.; Vestbo, J. Salmeterol and fluticasone propionate and survival in chronic obstructive pulmonary disease. N. Engl. J. Med. 2007, 356, 775–789. [Google Scholar] [CrossRef]
  33. Calverley, P.M.A. A STAR Is Born: A New Approach to Assessing Chronic Obstructive Pulmonary Disease Severity. Am. J. Respir. Crit. Care Med. 2023, 208, 647–648. [Google Scholar] [CrossRef] [PubMed]
  34. Jones, P.W.; Harding, G.; Berry, P.; Wiklund, I.; Chen, W.H.; Kline Leidy, N. Development and first validation of the COPD Assessment Test. Eur. Respir. J. 2009, 34, 648–654. [Google Scholar] [CrossRef]
  35. Agusti, A.; Calverley, P.M.; Celli, B.; Coxson, H.O.; Edwards, L.D.; Lomas, D.A.; MacNee, W.; Miller, B.E.; Rennard, S.; Silverman, E.K. Characterisation of COPD heterogeneity in the ECLIPSE cohort. Respir. Res. 2010, 11, 122. [Google Scholar] [CrossRef] [PubMed]
  36. Zeng, S.; Arjomandi, M.; Tong, Y.; Liao, Z.C.; Luo, G. Developing a Machine Learning Model to Predict Severe Chronic Obstructive Pulmonary Disease Exacerbations: Retrospective Cohort Study. J. Med. Internet. Res. 2022, 24, e28953. [Google Scholar] [CrossRef]
  37. Smith, L.A.; Oakden-Rayner, L.; Bird, A.; Zeng, M.; To, M.S.; Mukherjee, S.; Palmer, L.J. Machine learning and deep learning predictive models for long-term prognosis in patients with chronic obstructive pulmonary disease: A systematic review and meta-analysis. Lancet Digit. Health 2023, 5, e872–e882. [Google Scholar] [CrossRef]
  38. Shen, X.; Liu, H. Using machine learning for early detection of chronic obstructive pulmonary disease: A narrative review. Respir. Res. 2024, 25, 336. [Google Scholar] [CrossRef]
  39. Decramer, M.; Janssens, W.; Miravitlles, M. Chronic obstructive pulmonary disease. Lancet 2012, 379, 1341–1351. [Google Scholar] [CrossRef] [PubMed]
  40. Gershon, A.S.; Warner, L.; Cascagnette, P.; Victor, J.C.; To, T. Lifetime risk of developing chronic obstructive pulmonary disease: A longitudinal population study. Lancet 2011, 378, 991–996. [Google Scholar] [CrossRef]
  41. Nikolaou, V.; Massaro, S.; Fakhimi, M.; Stergioulas, L.; Price, D. COPD phenotypes and machine learning cluster analysis: A systematic review and future research agenda. Respir. Med. 2020, 171, 106093. [Google Scholar] [CrossRef] [PubMed]
  42. González, G.; Ash, S.Y.; Vegas-Sánchez-Ferrero, G.; Onieva, J.O.; Rahaghi, F.N.; Ross, J.C.; Díaz, A.; Estépar, R.S.J.; Washko, G.R.; Copdgene, F.T. Disease Staging and Prognosis in Smokers Using Deep Learning in Chest Computed Tomography. Am. J. Respir. Crit. Care Med. 2018, 197, 193–203. [Google Scholar] [CrossRef] [PubMed]
  43. Tang, L.Y.; Coxson, H.O.; Lam, S.; Leipsic, J.; Tam, R.C.; Sin, D.D. Towards large-scale case-finding: Training and validation of residual networks for detection of chronic obstructive pulmonary disease using low-dose CT. Lancet Digit. Health 2020, 2, e259–e267. [Google Scholar] [CrossRef]
  44. Ho, T.T.; Gwak, J. Multiple Feature Integration for Classification of Thoracic Disease in Chest Radiography. Appl. Sci. 2019, 9, 4130. [Google Scholar] [CrossRef]
  45. Wu, Y.; Du, R.; Feng, J.; Qi, S.; Pang, H.; Xia, S.; Qian, W. Deep CNN for COPD identification by Multi-View snapshot integration of 3D airway tree and lung field. Biomed. Signal Process. Control 2022, 79, 104162. [Google Scholar] [CrossRef]
  46. Zhang, Z.; Wu, F.; Zhou, Y.; Yu, D.; Sun, C.; Xiong, X.; Situ, Z.; Liu, Z.; Gu, A.; Huang, X.; et al. Detection of chronic obstructive pulmonary disease with deep learning using inspiratory and expiratory chest computed tomography and clinical information. J. Thorac. Dis. 2024, 16, 6101–6111. [Google Scholar] [CrossRef] [PubMed]
  47. Xue, M.; Jia, S.; Chen, L.; Huang, H.; Yu, L.; Zhu, W. CT-based COPD identification using multiple instance learning with two-stage attention. Comput. Methods Progr. Biomed. 2023, 230, 107340. [Google Scholar] [CrossRef]
  48. Cai, N.; Xie, Y.; Cai, Z.; Liang, Y.; Zhou, Y.; Wang, P. Deep Learning Assisted Diagnosis of Chronic Obstructive Pulmonary Disease Based on a Local-to-Global Framework. Electronics 2024, 13, 4443. [Google Scholar] [CrossRef]
  49. Shin, H.C.; Roth, H.R.; Gao, M.; Lu, L.; Xu, Z.; Nogues, I.; Yao, J.; Mollura, D.; Summers, R.M. Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning. IEEE Trans. Med. Imaging 2016, 35, 1285–1298. [Google Scholar] [CrossRef]
  50. Shorten, C.; Khoshgoftaar, T.M.; Furht, B. Transfer learning for medical image classification: A literature review. BMC Med. Imaging 2022, 22, 69. [Google Scholar]
  51. Polat, Ö.; Şalk, İ.; Doğan, Ö.T. Determination of COPD severity from chest CT images using deep transfer learning network. Multimed. Tools Appl. 2022, 81, 21903–21917. [Google Scholar] [CrossRef]
  52. Cheplygina, V.; Peña, I.P.; Pedersen, J.H.; Lynch, D.A.; Sørensen, L.; de Bruijne, M. Transfer Learning for Multicenter Classification of Chronic Obstructive Pulmonary Disease. IEEE J. Biomed. Health Inform. 2018, 22, 1486–1496. [Google Scholar] [CrossRef]
  53. Wang, Q.; Wang, H.; Wang, L.; Yu, F. Diagnosis of chronic obstructive pulmonary disease based on transfer learning. IEEE Access 2020, 8, 47370–47383. [Google Scholar] [CrossRef]
  54. Chen, S.; Ma, K.; Zheng, Y. Med3D: Transfer Learning for 3D Medical Image Analysis. IEEE Trans. Med. Imaging 2021, 40, 1097–1109. [Google Scholar]
  55. Zhu, Z.; Zhao, S.; Li, J.; Wang, Y.; Xu, L.; Jia, Y.; Li, Z.; Li, W.; Chen, G.; Wu, X.; et al. Development and application of a deep learning-based comprehensive early diagnostic model for chronic obstructive pulmonary disease. Respir. Res. 2024, 25, 167. [Google Scholar] [CrossRef]
  56. Li, Z.; Huang, K.; Liu, L.; Zhang, Z. Early detection of COPD based on graph convolutional network and small and weakly labeled data. Med. Biol. Eng. Comput. 2022, 60, 2321–2333. [Google Scholar] [CrossRef] [PubMed]
  57. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
  58. Locatello, F.; Weissenborn, D.; Unterthiner, T.; Mahendran, A.; Heigold, G.; Uszkoreit, J.; Dosovitskiy, A.; Kipf, T. Object-Centric Learning with Slot Attention. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020); Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 11525–11538. [Google Scholar]
  59. Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
  60. Celli, B.R.; MacNee, W. Standards for the diagnosis and treatment of patients with COPD: A summary of the ATS/ERS position paper. Eur. Respir. J. 2004, 23, 932–946. [Google Scholar] [CrossRef] [PubMed]
  61. Miller, M.R.; Hankinson, J.; Brusasco, V.; Burgos, F.; Casaburi, R.; Coates, A.; Crapo, R.; Enright, P.; van der Grinten, C.P.M.; Gustafsson, P.; et al. Standardisation of spirometry. Eur. Respir. J. 2005, 26, 319–338. [Google Scholar] [CrossRef] [PubMed]
  62. Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 702–703. [Google Scholar]
  63. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  64. Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
  65. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Figure 1. Schematic Overview of the Proposed Multi-Task Frame.
Figure 1. Schematic Overview of the Proposed Multi-Task Frame.
Applsci 16 00014 g001
Figure 2. ConvNext Large Backbone.
Figure 2. ConvNext Large Backbone.
Applsci 16 00014 g002
Figure 3. Schematic illustration of the slot Attention Module architecture.
Figure 3. Schematic illustration of the slot Attention Module architecture.
Applsci 16 00014 g003
Figure 4. FEV1/FVC ratio distributions for the (a) Training, (b) Validation, and (c) Test datasets. An extreme outlier in the Training set with a value of 1114 was permanently removed from the dataset prior to training to ensure data integrity.
Figure 4. FEV1/FVC ratio distributions for the (a) Training, (b) Validation, and (c) Test datasets. An extreme outlier in the Training set with a value of 1114 was permanently removed from the dataset prior to training to ensure data integrity.
Applsci 16 00014 g004
Figure 5. Confusion matrix analysis of the proposed model on the test set. The figure includes: (a) the overall multi-class confusion matrix, and (bd) the one-vs-all binary confusion matrices for Non COPD, Mild COPD, and Severe COPD classes, respectively.
Figure 5. Confusion matrix analysis of the proposed model on the test set. The figure includes: (a) the overall multi-class confusion matrix, and (bd) the one-vs-all binary confusion matrices for Non COPD, Mild COPD, and Severe COPD classes, respectively.
Applsci 16 00014 g005
Figure 6. Comparative visualization of Saliency Maps. The original CXR images are overlaid with post-processed saliency maps. To enhance visualization clarity, a Gaussian filter (sigma = 5) was applied, and a threshold of 0.5 was utilized to suppress low-activation regions. Warmer colors indicate regions with higher influence on the model’s prediction. Compared to baseline models, the proposed model’s attention patterns show a tendency to be more constrained within pulmonary structures. The corresponding FEV1/FVC ground truth or predicted value is displayed below each respective image.
Figure 6. Comparative visualization of Saliency Maps. The original CXR images are overlaid with post-processed saliency maps. To enhance visualization clarity, a Gaussian filter (sigma = 5) was applied, and a threshold of 0.5 was utilized to suppress low-activation regions. Warmer colors indicate regions with higher influence on the model’s prediction. Compared to baseline models, the proposed model’s attention patterns show a tendency to be more constrained within pulmonary structures. The corresponding FEV1/FVC ground truth or predicted value is displayed below each respective image.
Applsci 16 00014 g006
Table 1. Baseline demographic and clinical characteristics of the study population.
Table 1. Baseline demographic and clinical characteristics of the study population.
TotalNon COPD 1Mild COPDSevere COPDp-Value
Age (years), Mean ± Std66.4 ± 13.361.4 ± 16.071.5 ± 9.868.2 ± 9.7<0.001
Sex–Male, n (%)3717 (74.4%)1048 (52.4%)1354 (90.3%)1315 (87.7%)<0.001
Sex–Female, n (%)1282 (25.6%)951 (47.6%)146 (9.7%)185 (12.3%)<0.001
FEV1/FVC (%), Mean ± Std74.6 ± 25.2102.4 ± 10.760.8 ± 10.551.4 ± 9.8<0.001
1 COPD, Chronic obstructive pulmonary disease.
Table 2. Distribution of CXR images across diagnostic classes in training, validation, and test sets.
Table 2. Distribution of CXR images across diagnostic classes in training, validation, and test sets.
Training SetValidation SetTest SetTotal
Non COPD15991992011999
Mild COPD12001511501500
Severe COPD12001491501500
Total39995005004999
Table 3. Descriptive statistics (Mean and Standard Deviation) of the FEV1/FVC ratio across dataset splits.
Table 3. Descriptive statistics (Mean and Standard Deviation) of the FEV1/FVC ratio across dataset splits.
Training SetValidation SetTest SetOverall
FEV1/FVC Mean75.0274.0674.0874.83
FEV1/FVC Std30.2024.2825.0729.18
Table 4. Hyperparameter setup.
Table 4. Hyperparameter setup.
ParameterValue
Batch Size32
Total Epochs250
OptimizerAdamW [63]
Initial Learning Rate1 × 10−4
Weight Decay1 × 10−4
Learning Rate SchedulerFlat Warmup (5 Epochs) + Cosine Annealing [64]
Classification LossCategorical Cross-Entropy (Label Smoothing 0.1) [65]
Regression LossLog-Cosh
Loss Weighting λ c l s = 1.0 ,     λ r e g = 1.0 (Fixed 1:1 ratio)
Image Resolution512 × 512
NormalizationImageNet Mean & Std
Data AugmentationRandAugment (N = 2, M = 28) [62]
Table 5. Comparative performance on the Validation Set. (Best results are highlighted in bold).
Table 5. Comparative performance on the Validation Set. (Best results are highlighted in bold).
ModelAccuracySensitivitySpecificityMAEMSE R 2
ResNet1520.84530.76560.885911.2322216.06630.6335
DenseNet2010.88130.8080.90712.2186243.89910.5863
GoogLeNet0.90.84110.922811.1703214.39260.6364
EfficientNet B70.88930.82510.915210.229181.8150.6916
EfficientNet-V2-L0.90270.84660.925811.1888214.43760.6363
ViT-L/160.9120.86680.9348.6324175.62890.7021
ConvNeXt Large (Vanilla)0.9040.85310.928111.0822229.91110.61
Proposed Model0.92670.88420.94378.201142.62460.76
Table 6. Comparative classification and regression performance on the Test Set. (Best results are highlighted in bold).
Table 6. Comparative classification and regression performance on the Test Set. (Best results are highlighted in bold).
ModelAccuracySensitivitySpecificityMAEMSE R 2
ResNet1520.84130.75770.882811.3632228.65050.6363
DenseNet2010.85070.75610.882111.4808219.74870.6505
GoogLeNet0.89330.82880.918211.3204213.63170.6602
EfficientNet B70.87470.80270.904510.429188.40630.7003
EfficientNet-V2-L0.89730.83610.921410.9367199.02320.6834
ViT-L/160.90530.85480.92868.6045177.74470.7173
ConvNeXt Large (Vanilla)0.90130.84980.927210.1813183.99830.7073
Proposed Model0.91070.86030.93248.2649151.47040.7591
Table 7. Ablation Study on Slot Configuration (Test Set Performance; best results in bold).
Table 7. Ablation Study on Slot Configuration (Test Set Performance; best results in bold).
ModelAccuracySensitivitySpecificityMAEMSE R 2
ConvNeXt Large (Slot 3)0.90670.8570.93028.4515155.04280.7534
ConvNeXt Large (Slot 4)0.90.84180.923410.1425202.02070.6787
Proposed Model (Slot 5)0.91070.86030.93248.2649151.47040.7591
ConvNeXt Large (Slot 6)0.90930.85630.93069.6933171.53250.7272
ConvNeXt Large (Slot 7)0.90530.85360.92928.8356167.96670.7328
Table 8. Performance results of 5-Fold Cross-Validation for the proposed model.
Table 8. Performance results of 5-Fold Cross-Validation for the proposed model.
FoldAccuracySensitivitySpecificityMAEMSE R 2
10.9190.87750.93888.5196175.8950.7323
20.92730.88720.94437.8683159.52040.7393
30.91330.86470.9338.4990177.95850.7110
40.9240.87940.94088.2628174.65240.7361
50.91390.86350.93418.0868156.58440.7519
Mean ± Std0.9195 ± 0.0610.8744 ± 0.01010.9384 ± 0.00458.2473 ± 0.277168.9221 ± 10.0460.7341 ± 0.0148
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jeon, W.; Jang, H.; Lee, H.; Choi, S. COPD Multi-Task Diagnosis on Chest X-Ray Using CNN-Based Slot Attention. Appl. Sci. 2026, 16, 14. https://doi.org/10.3390/app16010014

AMA Style

Jeon W, Jang H, Lee H, Choi S. COPD Multi-Task Diagnosis on Chest X-Ray Using CNN-Based Slot Attention. Applied Sciences. 2026; 16(1):14. https://doi.org/10.3390/app16010014

Chicago/Turabian Style

Jeon, Wangsu, Hyeonung Jang, Hongchang Lee, and Seongjun Choi. 2026. "COPD Multi-Task Diagnosis on Chest X-Ray Using CNN-Based Slot Attention" Applied Sciences 16, no. 1: 14. https://doi.org/10.3390/app16010014

APA Style

Jeon, W., Jang, H., Lee, H., & Choi, S. (2026). COPD Multi-Task Diagnosis on Chest X-Ray Using CNN-Based Slot Attention. Applied Sciences, 16(1), 14. https://doi.org/10.3390/app16010014

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop