Multi-Label Disease Detection in Chest X-Ray Imaging Using a Fine-Tuned ConvNeXtV2 with a Customized Classifier
Abstract
1. Introduction
- Directed Fine-Tuning for Enhanced Multi-Scale Features: We freeze the first 77.08% of layers in the ConvNeXtV2 backbone, fine-tuning only its upper stages and the classification head. This approach preserves pre-trained multi-scale representations while significantly reducing trainable parameters and improving convergence speed. Under a five-epoch training constraint, this strategy achieves an average ROC–AUC of 0.8523-a 3.97% improvement over a baseline with a linear classifier (0.8197).
- Attention-Driven Spatial–Channel Fusion: We introduce a spatial attention pooling layer followed by a multi-head self-attention module with residual connections at the beginning of the classification head. These components sequentially perform spatial weighting and channel recalibration. Ablation studies reveal a 1.22% ROC–AUC drop when the spatial attention layer is removed and a 0.37% drop without the self-attention fusion module, confirming the importance of this fusion design for modeling spatial dispersion and label correlations.
- Deep MLP for Non-Linear Discrimination: A two-layer multilayer perceptron (MLP) with activation, batch normalization, and dropout layers is appended to the attention module to support deep non-linear mapping of global features. Removing this component results in a 6.65% decline in ROC–AUC (from 0.8523 to 0.7958), emphasizing its critical role in multi-label classification.
2. Related Works
3. Model Architecture: CONVFCMAE
3.1. Overview of the CONVFCMAE Pipeline
3.2. Image Preprocessing
- Random horizontal flipping with a probability of 50% to simulate bilateral variability;
- Random rotation within to mimic variations in patient posture;
- Random affine transformations with translation to enforce spatial invariance;
- Color jittering with maximum brightness, contrast, and saturation variation of 0.2 to account for imaging condition discrepancies.
3.3. Feature Extraction and Enhancement
Algorithm 1 Pretraining and Fine-Tuning of ConvNeXtV2 Feature Extractor. |
Require: Target dataset , Batch size B, Learning rate , Epochs T Ensure: Fine-tuned feature extractor
|
3.4. Three-Stage Classification Head: Attentive Pooling, Self-Attention Fusion, and Deep MLP
- Stage One—Attentive Global Pooling: Replaces fixed pooling with a learnable spatial attention mechanism to aggregate features adaptively across the spatial domain. This enhances the model’s ability to focus on irregular and dispersed lesion regions.
- Stage Two—Self-Attention Fusion: Incorporates a lightweight multi-head self-attention module to capture long-range dependencies and recalibrate inter-channel features. This facilitates modeling of spatially distant but semantically correlated abnormalities.
- Stage Three—Deep MLP Classifier: A two-layer MLP with nonlinear activation, batch normalization, and dropout increases the expressive capacity of the classification head, enabling improved handling of complex label co-occurrence patterns.
3.4.1. Stage One: Attentive Global Pooling
3.4.2. Stage Two: Self-Attention Fusion
3.4.3. Stage Three: MLP-Based Classifier
3.5. Training Procedure of Three-Stage Classifier
Algorithm 2 Joint Training of Three-Stage Classifier. |
Require: Frozen feature extractor , classifier parameters , training data , batch size B, learning rate Ensure: Trained classifier
|
3.6. Grad-CAM-Based Interpretability
4. Experimental Evaluation
4.1. Dataset
4.2. Evaluation Metrics
- True Positive (TP): the model correctly predicts the presence of a disease that actually exists.
- False Positive (FP): the model predicts a disease to be present when it is not (a false alarm).
- True Negative (TN): the model correctly predicts the absence of a disease.
- False Negative (FN): the model fails to detect a disease that is actually present (a missed diagnosis).
Actual/Predicted | Predicted Positive | Predicted Negative |
---|---|---|
Actual Positive | True Positive (TP) | False Negative (FN) |
Actual Negative | False Positive (FP) | True Negative (TN) |
4.3. Experimental Setup
4.4. Comparison with Other Systems
- Adaptive Attention. Attentive Pooling dynamically reweights spatial locations, focusing on faint lesion regions, while Self-Attention fusion integrates long-range channel dependencies to combine global context with local detail.
- Focused Augmentation. Beyond standard flips and rotations, we apply Contrast Stretching to amplify lesion–background differences, Edge Enhancement to sharpen blurred boundaries, and Color Jitter to bolster robustness against imaging variations.
- Class-Balanced Loss. We employ Class-Balanced Focal Loss to upweight rare and hard-to-classify examples, directing training toward subtle, underrepresented findings.
4.5. Comparison with State-of-the-Art Heads
4.6. Hyperparameter Sensitivity Analysis
4.7. Ablation Study
- Configuration Descriptions:
- Baseline Linear Head: A minimal setup using only a fully connected layer followed by sigmoid activation.
- Advanced Head (Full): Incorporates all three components—AttnPool, SelfAttn, and MLP—designed to enhance spatial localization, inter-channel feature interaction, and non-linear decision boundaries.
- Ablation A (w/o AttnPool): Replaces attention pooling with standard global average pooling; retains SelfAttn and MLP.
- Ablation B (w/o SelfAttn): Removes self-attention fusion; retains AttnPool and MLP.
- Ablation C (w/o MLP): Eliminates the MLP classifier; retains only AttnPool and SelfAttn for feature aggregation.
- Ablation D (w/o AttnPool and SelfAttn): Retains only MLP; removes both attention modules.
- Ablation E (w/o AttnPool and MLP): Retains only SelfAttn; removes AttnPool and MLP.
- Ablation F (w/o SelfAttn and MLP): Retains only AttnPool; removes SelfAttn and MLP.
- Summary of Contributions:
- MLP: Most critical component; enables expressive, non-linear classification boundaries.
- AttnPool: Substantially contributes by enhancing spatial localization and attention to discriminative regions.
- SelfAttn: Moderately improves channel interaction but has limited standalone utility.
4.8. Experimental Summary and Final Configuration
- Strategic backbone freezing to preserve hierarchical features;
- Learnable spatial–channel recalibration via attentive pooling and self-attention;
- Deep nonlinear mapping through the MLP head.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Tarekegn, A.N.; Giacobini, M.; Michalak, K. A review of methods for imbalanced multi-label classification. Pattern Recognit. 2021, 118, 107965. [Google Scholar] [CrossRef]
- Turkoglu, M. COVIDetectioNet: COVID-19 diagnosis system based on X-ray images using features selected from pre-learned deep features ensemble. Appl. Intell. 2021, 51, 1213–1226. [Google Scholar] [CrossRef]
- Punn, N.S.; Agarwal, S. Automated diagnosis of COVID-19 with limited posteroanterior chest X-ray images using fine-tuned deep neural networks. Appl. Intell. 2021, 51, 2689–2702. [Google Scholar] [CrossRef]
- Al-antari, M.A.; Hua, C.H.; Bang, J.; Lee, S. Fast deep learning computer-aided diagnosis of COVID-19 based on digital chest X-ray images. Appl. Intell. 2021, 51, 2890–2907. [Google Scholar] [CrossRef] [PubMed]
- Rai, H.M.; Chatterjee, K. Hybrid CNN-LSTM deep learning model and ensemble technique for automatic detection of myocardial infarction using big ECG data. Appl. Intell. 2022, 52, 5366–5384. [Google Scholar] [CrossRef]
- Li, X.; Zhou, Y.; Du, P.; Lang, G.; Xu, M.; Wu, W. A deep learning system that generates quantitative CT reports for diagnosing pulmonary Tuberculosis. Appl. Intell. 2021, 51, 4082–4093. [Google Scholar] [CrossRef]
- Afify, H.M.; Mohammed, K.K.; Hassanien, A.E. Novel prediction model on OSCC histopathological images via deep transfer learning combined with Grad-CAM interpretation. Biomed. Signal Process. Control 2023, 83, 104704. [Google Scholar] [CrossRef]
- Li, Q.; Yao, N.; Zhao, J.; Zhang, Y. Self attention mechanism of bidirectional information enhancement. Appl. Intell. 2022, 52, 2530–2538. [Google Scholar] [CrossRef]
- Shen, X.; Han, D.; Guo, Z.; Chen, C.; Hua, J.; Luo, G. Local self-attention in transformer for visual question answering. Appl. Intell. 2023, 53, 16706–16723. [Google Scholar] [CrossRef]
- Wang, Y.; Yang, G.; Li, S.; Li, Y.; He, L.; Liu, D. Arrhythmia classification algorithm based on multi-head self-attention mechanism. Biomed. Signal Process. Control 2023, 79, 104206. [Google Scholar] [CrossRef]
- Wang, J.; Zang, J.; Yao, S.; Zhang, Z.; Xue, C. Multiclassification for heart sound signals under multiple networks and multi-view feature. Measurement 2024, 225, 114022. [Google Scholar] [CrossRef]
- Kang, H.; Wang, X.; Sun, Y.; Li, S.; Sun, X.; Li, F.; Hou, C.; Lam, S.K.; Zhang, W.; Zheng, Y.P. Automatic Transcranial Sonography-Based Classification of Parkinson’s Disease Using a Novel Dual-Channel CNXV2-DANet. Bioengineering 2024, 11, 889. [Google Scholar] [CrossRef] [PubMed]
- Gour, N.; Khanna, P. Multi-class multi-label ophthalmological disease detection using transfer learning based convolutional neural network. Biomed. Signal Process. Control 2021, 66, 102329. [Google Scholar] [CrossRef]
- Huang, Y.; Zhang, R.; Li, H.; Xia, Y.; Yu, X.; Liu, S.; Yang, Y. A multi-label learning prediction model for heart failure in patients with atrial fibrillation based on expert knowledge of disease duration. Appl. Intell. 2023, 53, 20047–20058. [Google Scholar] [CrossRef]
- Li, Y.; Chen, Z.; Zhang, F.; Wei, Z.; Huang, Y.; Chen, C.; Zheng, Y.; Wei, Q.; Sun, H.; Chen, F. Research on detection of potato varieties based on spectral imaging analytical algorithm. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2024, 311, 123966. [Google Scholar] [CrossRef] [PubMed]
- Ren, R.; Niu, S.; Jin, J.; Zhang, J.; Ren, H.; Zhao, X. Multi-scale attention context-aware network for detection and localization of image splicing. Appl. Intell. 2023, 53, 18219–18238. [Google Scholar] [CrossRef]
- Roshan, S.; Tanha, J.; Zarrin, M.; Babaei, A.F.; Nikkhah, H.; Jafari, Z. A deep ensemble medical image segmentation with novel sampling method and loss function. Comput. Biol. Med. 2024, 172, 108305. [Google Scholar] [CrossRef]
- Liu, H.; Liu, F.; Fan, X.; Huang, D. Polarized self-attention: Towards high-quality pixel-wise mapping. Neurocomputing 2022, 506, 158–167. [Google Scholar] [CrossRef]
- Qin, X.; Cai, R.; Yu, J.; He, C.; Zhang, X. An efficient self-attention network for skeleton-based action recognition. Sci. Rep. 2022, 12, 4111. [Google Scholar] [CrossRef]
- Yu, X.; Wang, J.; Hong, Q.Q.; Teku, R.; Wang, S.H.; Zhang, Y.D. Transfer learning for medical images analyses: A survey. Neurocomputing 2022, 489, 230–254. [Google Scholar] [CrossRef]
- Yeung, M.; Sala, E.; Schönlieb, C.B.; Rundo, L. Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Comput. Med. Imaging Graph. 2022, 95, 102026. [Google Scholar] [CrossRef] [PubMed]
- Priya, K.V.; Peter, J.D. A federated approach for detecting the chest diseases using DenseNet for multi-label classification. Complex Intell. Syst. 2022, 8, 3121–3129. [Google Scholar] [CrossRef]
- Albahli, S.; Rauf, H.T.; Algosaibi, A.; Balas, V.E. AI-driven deep CNN approach for multi-label pathology classification using chest X-Rays. PeerJ Comput. Sci. 2021, 7, e495. [Google Scholar] [CrossRef] [PubMed]
- Perumal, V.; Narayanan, V.; Rajasekar, S.J.S. Detection of COVID-19 using CXR and CT images using Transfer Learning and Haralick features. Appl. Intell. 2021, 51, 341–358. [Google Scholar] [CrossRef] [PubMed]
- Chen, K.; Wang, X.; Zhang, S. Thorax Disease Classification Based on Pyramidal Convolution Shuffle Attention Neural Network. IEEE Access 2022, 10, 85571–85581. [Google Scholar] [CrossRef]
- Irtaza, M.; Ali, A.; Gulzar, M.; Wali, A. Multi-Label Classification of Lung Diseases Using Deep Learning. IEEE Access 2024, 12, 124062–124080. [Google Scholar] [CrossRef]
- Li, D.; Huo, H.; Jiao, S.; Sun, X.; Chen, S. Automated thorax disease diagnosis using multi-branch residual attention network. Sci. Rep. 2024, 14, 11865. [Google Scholar] [CrossRef]
- Kufel, J.; Bielówka, M.; Rojek, M.; Mitręga, A.; Lewandowski, P.; Cebula, M.; Krawczyk, D.; Bielówka, M.; Kondoł, D.; Bargieł-Łączek, K.; et al. Multi-Label Classification of Chest X-ray Abnormalities Using Transfer Learning Techniques. J. Pers. Med. 2023, 13, 1426. [Google Scholar] [CrossRef]
- Chehade, H.; Abdallah, N.; Marion, J.M.; Hatt, M.; Oueidat, M.; Chauvet, P. Reconstruction-based approach for chest X-ray image segmentation and enhanced multi-label chest disease classification. Artif. Intell. Med. 2025, 165, 103135. [Google Scholar] [CrossRef]
- Dong, N.; Kampffmeyer, M.; Su, H.; Xing, E. An exploratory study of self-supervised pre-training on partially supervised multi-label classification on chest X-ray images. Appl. Soft Comput. 2024, 163, 111855. [Google Scholar] [CrossRef]
- Chae, G.; Lee, J.; Kim, S.B. Contrastive learning with hard negative samples for chest X-ray multi-label classification. Appl. Soft Comput. 2024, 165, 112101. [Google Scholar] [CrossRef]
- Verma, S.S.; Prasad, A.; Kumar, A. CovXmlc: High performance COVID-19 detection on X-ray images using multi-model classification. Biomed. Signal Process. Control 2022, 71, 103272. [Google Scholar] [CrossRef]
- Cheng, Y.C.; Hung, Y.C.; Huang, G.H.; Chen, T.B.; Lu, N.H.; Liu, K.Y.; Lin, K.H. Deep Learning–Based Object Detection Strategies for Disease Detection and Localization in Chest X-Ray Images. Diagnostics 2024, 14, 2636. [Google Scholar] [CrossRef]
- Fan, W.; Yang, Y.; Qi, J.; Zhang, Q.; Liao, C.; Wen, L.; Wang, S.; Wang, G.; Xia, Y.; Wu, Q.; et al. A Deep-Learning-Based Framework for Identifying and Localizing Chest Abnormalities. Nat. Commun. 2024, 15, 5599. [Google Scholar] [CrossRef]
- Iqbal, H.; Khan, A.; Nepal, N.; Khan, F.; Moon, Y.K. Deep Learning Approaches for Chest Radiograph Interpretation: A Survey. Electronics 2024, 13, 4688. [Google Scholar] [CrossRef]
- Oltu, B.; Güney, S.; Esen Yuksel, S.; Dengiz, B. Automated classification of chest X-rays: A deep learning approach with attention mechanisms. BMC Med. Imaging 2025, 25, 71. [Google Scholar] [CrossRef] [PubMed]
- Alam, M.S.; Wang, D.; Sowmya, A. DLA-Net: Dual lesion attention network for classification of pneumoconiosis using chest X-ray images. Sci. Rep. 2024, 14, 11616. [Google Scholar] [CrossRef]
- Zhang, N.; Liu, Z.; Zhang, E.; Chen, Y.; Yue, J. An ESG-ConvNeXt network for steel surface defect classification based on hybrid attention mechanism. Sci. Rep. 2025, 15, 10926. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Li, P.; Hu, X.; Yu, K. Learning common and label-specific features for multi-Label classification with correlation information. Pattern Recognit. 2022, 121, 108259. [Google Scholar] [CrossRef]
- Han, M.; Wu, H.; Chen, Z.; Li, M.; Zhang, X. A survey of multi-label classification based on supervised and semi-supervised learning. Int. J. Mach. Learn. Cybern. 2023, 14, 697–724. [Google Scholar] [CrossRef]
- Liu, Y.; Xing, W.; Zhao, M.; Lin, M. A new classification method for diagnosing COVID-19 pneumonia based on joint CNN features of chest X-ray images and parallel pyramid MLP-Mixer module. Neural Comput. Appl. 2023, 35, 17187–17199. [Google Scholar] [CrossRef] [PubMed]
- Azad, R.; Kazerouni, A.; Heidari, M.; Aghdam, E.K.; Molaei, A.; Jia, Y.; Jose, A.; Roy, R.; Merhof, D. Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review. Med. Image Anal. 2024, 91, 103000. [Google Scholar] [CrossRef] [PubMed]
- Chen, J.; Er, M.J. Dynamic YOLO for Small Underwater Object Detection. Artif. Intell. Rev. 2024, 57, 165. [Google Scholar] [CrossRef]
Disease | CONVFCMAE | MobileNetV1 | EfficientNet | PCSANet |
---|---|---|---|---|
AUC (Ours) | AUC [26] | AUC [28] | AUC [25] | |
Atelectasis | 0.84 | 0.88 | 0.81 | 0.80 |
Consolidation | 0.81 | 0.89 | 0.81 | 0.80 |
Infiltration | 0.73 | 0.87 | 0.71 | 0.69 |
Pneumothorax | 0.92 | 0.79 | 0.89 | 0.85 |
Edema | 0.89 | 0.82 | 0.90 | 0.88 |
Emphysema | 0.92 | 0.69 | 0.93 | 0.89 |
Fibrosis | 0.81 | 0.80 | 0.82 | 0.81 |
Effusion | 0.89 | 0.88 | 0.87 | 0.87 |
Pneumonia | 0.78 | 0.80 | 0.76 | 0.75 |
Pleural Thickening | 0.82 | 0.90 | 0.81 | 0.76 |
Cardiomegaly | 0.91 | 0.73 | 0.91 | 0.91 |
Nodule | 0.81 | 0.75 | 0.77 | 0.75 |
Mass | 0.88 | 0.78 | 0.85 | 0.82 |
Hernia | 0.92 | 0.72 | 0.89 | 0.91 |
Mean AUC | 0.852 | 0.807 | 0.837 | 0.820 |
Model | Mean AUC |
---|---|
DenseNet-121 (CheXNet) | 0.810 |
EfficientNet-B1 | 0.840 |
PCSANet | 0.825 |
MobileNetV1 | 0.810 |
CONVFCMAE (Ours) | 0.852 |
Epoch | Train Loss | Valid Loss | Accuracy | F1-Spatial-ChannelScore | ROC–AUC | Time |
---|---|---|---|---|---|---|
1 | 0.153 | 0.150 | 0.949 | 0.118 | 0.818 | 19:13 |
2 | 0.153 | 0.149 | 0.949 | 0.159 | 0.829 | 19:24 |
3 | 0.144 | 0.142 | 0.951 | 0.129 | 0.847 | 19:25 |
4 | 0.130 | 0.139 | 0.951 | 0.183 | 0.852 | 19:34 |
5 | 0.128 | 0.137 | 0.950 | 0.181 | 0.851 | 19:27 |
Epoch | MLP-Mixer [41] | Transformer [42] | Dynamic [43] | Advanced Head (Ours) | Baseline Linear |
---|---|---|---|---|---|
1 | 0.7939 | 0.7326 | 0.7821 | 0.8176 | 0.7927 |
2 | 0.8047 | 0.7465 | 0.7930 | 0.8286 | 0.8086 |
3 | 0.8116 | 0.7659 | 0.8006 | 0.8467 | 0.8167 |
4 | 0.8160 | 0.7768 | 0.8056 | 0.8522 | 0.8196 |
5 | 0.8166 | 0.7789 | 0.8058 | 0.8523 | 0.8197 |
Configuration | Epoch 1 | Epoch 2 | Epoch 3 | Epoch 4 | Epoch 5 |
---|---|---|---|---|---|
kernel = 1, activ = sigmoid heads = 4, drop = 0.25 | 0.8176 | 0.8286 | 0.8467 | 0.8523 | 0.8513 |
kernel = 1, activ = tanh heads = 4, drop = 0.25 | 0.7971 | 0.8163 | 0.8251 | 0.8346 | 0.8360 |
kernel = 1, activ = softmax heads = 4, drop = 0.25 | 0.8068 | 0.8157 | 0.8276 | 0.8371 | 0.8401 |
kernel = 3, activ = sigmoid heads = 4, drop = 0.25 | 0.7982 | 0.8188 | 0.8241 | 0.8318 | 0.8327 |
kernel = 1, activ = sigmoid heads = 1, drop = 0.25 | 0.8019 | 0.8181 | 0.8252 | 0.8366 | 0.8397 |
kernel = 1, activ = sigmoid heads = 2, drop = 0.25 | 0.8045 | 0.8165 | 0.8267 | 0.8317 | 0.8361 |
kernel = 1, activ = sigmoid heads = 8, drop = 0.25 | 0.8063 | 0.8178 | 0.8308 | 0.8398 | 0.8379 |
kernel = 1, activ = sigmoid heads = 4, drop = 0 | 0.8039 | 0.8204 | 0.8281 | 0.8336 | 0.8355 |
kernel = 1, activ = sigmoid heads = 4, drop = 0.5 | 0.7947 | 0.8150 | 0.8170 | 0.8256 | 0.8295 |
Configuration | ROC–AUC | Relative Gain |
---|---|---|
1. Baseline Linear Head | 0.8197 | - |
2. Advanced Head (Full) | 0.8523 | +3.26% |
3. w/o AttnPool | 0.8401 | +2.04% |
4. w/o SelfAttn | 0.8486 | +2.89% |
5. w/o MLP | 0.7958 | –2.39% |
6. w/o AttnPool and SelfAttn | 0.8506 | +3.09% |
7. w/o AttnPool and MLP | 0.8260 | +0.63% |
8. w/o SelfAttn and MLP | 0.7850 | –3.47% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xiong, K.; Tu, Y.; Rao, X.; Zou, X.; Du, Y. Multi-Label Disease Detection in Chest X-Ray Imaging Using a Fine-Tuned ConvNeXtV2 with a Customized Classifier. Informatics 2025, 12, 80. https://doi.org/10.3390/informatics12030080
Xiong K, Tu Y, Rao X, Zou X, Du Y. Multi-Label Disease Detection in Chest X-Ray Imaging Using a Fine-Tuned ConvNeXtV2 with a Customized Classifier. Informatics. 2025; 12(3):80. https://doi.org/10.3390/informatics12030080
Chicago/Turabian StyleXiong, Kangzhe, Yuyun Tu, Xinping Rao, Xiang Zou, and Yingkui Du. 2025. "Multi-Label Disease Detection in Chest X-Ray Imaging Using a Fine-Tuned ConvNeXtV2 with a Customized Classifier" Informatics 12, no. 3: 80. https://doi.org/10.3390/informatics12030080
APA StyleXiong, K., Tu, Y., Rao, X., Zou, X., & Du, Y. (2025). Multi-Label Disease Detection in Chest X-Ray Imaging Using a Fine-Tuned ConvNeXtV2 with a Customized Classifier. Informatics, 12(3), 80. https://doi.org/10.3390/informatics12030080