LMeRAN: Label Masking-Enhanced Residual Attention Network for Multi-Label Chest X-Ray Disease Aided Diagnosis

Fu, Hongping; Song, Chao; Qu, Xiaolong; Li, Dongmei; Zhang, Lei

doi:10.3390/s25185676

Open AccessArticle

LMeRAN: Label Masking-Enhanced Residual Attention Network for Multi-Label Chest X-Ray Disease Aided Diagnosis

by

Hongping Fu

^1,2,†,

Chao Song

^1,2,†

,

Xiaolong Qu

^1,2

,

Dongmei Li

^1,2,*

and

Lei Zhang

³

¹

School of Information Science and Technology, Beijing Forestry University, Beijing 100083, China

²

Engineering Research Center for Forestry-Oriented Intelligent Information Processing of National Forestry and Grassland Administration, Beijing 100083, China

³

National Data Center of Traditional Chinese Medicine, China Academy of Chinese Medical Sciences, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Sensors 2025, 25(18), 5676; https://doi.org/10.3390/s25185676

Submission received: 3 August 2025 / Revised: 6 September 2025 / Accepted: 10 September 2025 / Published: 11 September 2025

(This article belongs to the Special Issue Vision- and Image-Based Biomedical Diagnostics—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Chest X-ray (CXR) imaging is essential for diagnosing thoracic diseases, and computer-aided diagnosis (CAD) systems have made substantial progress in automating the interpretation of CXR images. However, some existing methods often overemphasize local features while neglecting global context, limiting their ability to capture the broader pathological landscape. Moreover, most methods fail to model label correlations, leading to insufficient utilization of prior knowledge. To address these limitations, we propose a novel multi-label CXR image classification framework, termed the Label Masking-enhanced Residual Attention Network (LMeRAN). Specifically, LMeRAN introduces an original label-specific residual attention to capture disease-relevant information effectively. By integrating multi-head self-attention with average pooling, the model dynamically assigns higher weights to critical lesion areas while retaining global contextual features. In addition, LMeRAN employs a label mask training strategy, enabling the model to learn complex label dependencies from partially available label information. Experiments conducted on the large-scale public dataset ChestX-ray14 demonstrate that LMeRAN achieves the highest mean AUC value of 0.825, resulting in an increase of 3.1% to 8.0% over several advanced baselines. To enhance interpretability, we also visualize the lesion regions relied upon by the model for classification, providing clearer insights into the model’s decision-making process.

Keywords:

chest X-ray; computer-aided diagnosis; multi-label classification; attention mechanism; label mask training

1. Introduction

Thoracic diseases are among the leading causes of morbidity and mortality worldwide. They impose a substantial health burden on individuals and create significant challenges for healthcare systems globally [1]. Early and accurate detection is essential for effective treatment and improved patient outcomes. However, the diagnosis of these conditions often relies heavily on imaging techniques. Among the available modalities, chest X-ray (CXR) plays a pivotal role in detecting thoracic diseases. CXR is widely used because it provides comprehensive information about the lungs and the contents of the chest cavity. It is particularly valuable for the early detection of diseases, enabling timely intervention and management. In radiology research, the interpretation of CXR images has traditionally been performed manually by radiologists. However, this process is both time-consuming and prone to human error, leading to potential misdiagnoses and delayed treatments. To address these challenges, computer-aided diagnosis (CAD) systems have been developed [2,3]. CAD systems utilize advanced models and computational techniques to assist radiologists in the analysis of CXR images, offering rapid and precise analysis, thereby enhancing diagnostic accuracy and efficiency.

Commonly, CXR images are annotated with multiple disease categories, thereby framing the classification task as a multi-label problem. In this context, the predictive model is expected to concentrate on pathological features that are specifically relevant to each label. However, in practical scenarios, non-relevant visual information can interfere with the model’s ability to accurately identify disease-specific patterns. As shown in Figure 1, which presents a CXR image labeled with a “nodule”, the majority of the image comprises normal regions, while the lesion occupies only a small area highlighted by the red box. This limited spatial extent and the subtle nature of the abnormality pose a significant challenge for the model in isolating and learning discriminative features associated with the corresponding disease label.

To tackle the aforementioned issues, numerous recent studies have adopted deep learning-based approaches [4]. For example, Chen et al. [5] combined ResNet [6] and DenseNet [7] in an asymmetric feature learning network to enhance discriminative capacity. Guan et al. [8] proposed ConsultNet, a dual-branch structure that identifies key pathological regions and models disease relationships. Sanida et al. [9] developed a lightweight CNN optimized for embedded systems, achieving high accuracy across seven disease categories. In parallel, Attention mechanisms have also been widely adopted for their ability to focus on relevant features. Saednia et al. [10] used LSTM-based recurrent attention to emphasize abnormal regions. Guan et al. [11] applied localized attention branches to highlight small lesions. Jiang et al. [12] employed self-attention to capture both short- and long-range dependencies. Khater et al. [13] introduced AttCDCNet, integrating attention with DenseNet and focal loss to mitigate class imbalance.

Despite their effectiveness, most attention-based methods tend to overemphasize local lesion features while neglecting global context. This limits their ability to form holistic clinical representations, which are crucial in cases with complex or co-occurring conditions. To address this, we propose the Label Masking-enhanced Residual Attention Network (LMeRAN). Our model incorporates a label-specific residual attention mechanism that combines multi-head self-attention with average pooling, allowing fine-grained lesion features to be highlighted without losing global contextual cues.

In addition, effective multi-label classification must capture dependencies among disease labels. Thoracic diseases often co-occur or influence one another, and modeling these correlations can materially improve predictive accuracy. Masked learning offers a principled way to infer such dependencies from partial observations. In natural language processing, masked language modeling [14] predicts masked tokens from surrounding context, enabling bidirectional context encoding. The idea subsequently migrated to vision as masked image modeling, where random image patches are masked and the network reconstructs tokens or pixels to couple local detail with global structure. Building on this line of work, Lanchantin et al. [15] proposed a label mask training strategy for multi-label image classification to better learn inter-label correlations. We adapt this strategy in LMeRAN, enabling the model to learn latent label dependencies from partially observed labels in a data-driven manner and, in the CXR setting, to exploit disease co-occurrence patterns without requiring explicit prior knowledge.

In this paper, our main contributions can be summarized as follows:

To the best of our knowledge, this work is the first to incorporate a label mask training strategy into CXR image classification, enabling the model to effectively capture inter-disease correlations and thereby improve both the precision and robustness of predictions.
We design a novel label-specific residual attention mechanism that simultaneously emphasizes disease-relevant features and retains crucial global image context. Furthermore, we provide visual explanations by highlighting image regions most influential in the model’s decisions, enhancing interpretability and transparency.
We perform comprehensive evaluations on the publicly available ChestX-ray14 dataset to demonstrate the superiority of LMeRAN and assess the individual contributions of its components through ablation studies.

Section 2 reviews related research in the field. Section 3 introduces the overall design and core components of the proposed LMeRAN model. Section 4 presents the experimental setup along with quantitative evaluations. Section 5 offers an in-depth discussion of the findings. Lastly, Section 6 summarizes the main contributions and concludes the study.

2. Related Work

2.1. Deep Learning-Based CXR Image Classification

With the continuous improvement of computational resources and the growing availability of large-scale medical datasets, deep learning has become a dominant approach in CXR image classification. A significant milestone was reached in 2017, when Wang et al. [16] introduced the ChestX-ray14 dataset, a large multi-label collection, and assessed the performance of several CNN architectures—AlexNet [17], GoogleNet [18], VGGNet-16 [19], and ResNet—by training only the transition and classification layers. This study set a foundation for subsequent research in automated thoracic disease detection. Building on this, Seibold et al. [20] proposed a method based on ResNet that achieved performance comparable to fully supervised models by training on weakly labeled data. Yao et al. [21] employed a weakly supervised multi-task framework using multi-resolution images to jointly classify and localize thoracic abnormalities, though their method was limited by restricted channel information. To address label noise, Yang et al. [22] introduced a detection strategy that combines features from three networks using an ensemble learning setup. Chen et al. [23] enhanced classification accuracy by modifying the loss function, surpassing the performance of conventional weighted binary cross-entropy. More recently, Ishwerlal et al. [24] proposed an ensemble classification framework with hybrid training optimized by a butterfly algorithm, improving diagnostic accuracy on chest X-rays. Öztürk et al. [25] introduced an adaptive multi-branch transformer that captures disease-specific and shared features for multi-label CXR classification through parallel attention pathways.

2.2. Attention in CXR Image Classification

Incorporating attention mechanisms into CXR image classification has proven to be an effective strategy for improving the model’s ability to emphasize diagnostically relevant regions. Ma et al. [26] designed a classification framework based on ResNet-101 that integrated multi-head attention and a squeeze-and-excitation module [27] to enhance inter-channel dependencies, along with a spatial attention module to unify local and global cues. To further refine spatial focus, Guan et al. [28] introduced a residual attention learning model that assigned adaptive weights across different regions of the feature space, thereby enhancing features related to disease and suppressing background noise. Wang et al. [29] proposed a triple attention mechanism within a DenseNet-121 backbone, combining attention modules at the channel, element, and scale levels within a single framework. Taslimi et al. [30] developed SwinCheX, a multi-label classification network utilizing the Swin Transformer [31], which benefited from its hierarchical design and localized self-attention to effectively capture multi-scale visual features in high-resolution CXR images. Peng et al. [32] presented a unified model architecture that integrated local convolution operations with global self-attention, strengthening feature representation. Meanwhile, Wu et al. [33] proposed CTransCNN, which combined CNN and Transformer components to model inter-label relationships via attention-driven feature fusion and multi-branch interaction mechanisms tailored for multi-label medical image classification.

2.3. Label Dependency in Multi-Label Image Classification

Effectively capturing label dependencies plays a crucial role in multi-label image classification, especially in the medical domain where multiple diseases often co-exist. Song et al. [34] proposed a multi-modal convolutional network based on multi-instance multi-label learning, utilizing CNN architectures to group labels and exploit contextual associations within those groups. To leverage both sequential and spatial label dependencies, Wang et al. [35] introduced a hybrid CNN-RNN model that jointly learned image-label representations in an end-to-end manner. Lee et al. [36] further explored label correlations using a CNN-GNN hybrid architecture, where graph-based message passing was applied to explicitly model disease co-occurrence patterns in CXR datasets. In parallel, the emergence of CLIP [37] has brought significant advances to multi-label image classification tasks by leveraging its cross-modal alignment capabilities. For example, CLIP-Decoder [38] reduced semantic inconsistency in zero-shot multi-label classification by aligning visual and textual embeddings to better capture inter-label relationships. Additionally, Wang et al. [39] proposed a hierarchical semantic prompt network built upon CLIP, which learned structured label semantics and demonstrated strong performance in settings with sparse positive labels or limited supervision.

3. Proposed Method

LMeRAN is a novel multi-label disease classification framework designed to effectively address the challenges associated with diagnostic tasks using chest X-ray images. As illustrated in Figure 2, the framework begins by embedding chest X-ray images, disease labels, and label states into their respective representations. To enrich the label information, label embeddings are combined with their corresponding state embeddings. During training, the state embeddings facilitate the implementation of label masking, enabling the model to capture complex label dependencies. Subsequently, the image feature vectors and the state-enhanced label embeddings are fed into the label-specific residual attention mechanism, which extracts more comprehensive and disease-specific feature representations. Finally, the resulting feature vectors, encoding both global and label-specific information, are passed through a classification layer to predict the probability of disease presence.

3.1. Image Feature Extraction

Robust feature extraction is critical for CXR classification due to the subtle and localized nature of pathological regions. DenseNet [7], characterized by its densely connected architecture, effectively mitigates the vanishing gradient issue and encourages comprehensive feature reuse. In this structure, each layer receives inputs from all preceding layers through feature concatenation, enhancing the propagation of fine-grained information—an advantage in identifying minor lesions or abnormalities across deep layers. As illustrated in Figure 3, such dense connectivity facilitates better information preservation and gradient flow throughout the network. Furthermore, DenseNet’s parameter-efficient design reduces the likelihood of overfitting, a common concern in medical imaging applications with limited annotated data.

In LMeRAN, we employ the DenseNet-121 architecture, pre-trained on the ImageNet dataset, as the backbone for feature extraction. It consists of four dense blocks and three transition layers, offering a compact yet expressive structure. For a given CXR image

X \in R^{H \times W \times D}

, the backbone generates a feature map tensor

F \in R^{h \times w \times d}

, where

h

,

w

and

d

represent the spatial height, width and channel depth, respectively. Each feature vector

f_{i} \in R^{d}

from

F

(with

i

ranges from 1 to

h \times w

) corresponds to a specific subregion in the original image, effectively capturing localized visual cues. These localized representations serve as the input foundation for the attention-based and label-specific analyses performed in subsequent stages of the model.

3.2. Embedding Disease Label and State

Beyond embedding CXR images, our model also processes all associated disease labels to generate their respective vector representations, thereby enabling downstream label-specific feature learning. For each input image, we construct a set of label embeddings

L = {l_{1}, l_{2}, l_{3}, \dots, l_{n}}

, where each vector

l_{i} \in R^{d}

corresponds to the

i

-th disease category, and

n

denotes the total number of disease labels. These label embeddings are learned through a simple embedding layer of size

d \times l

. However, embedding labels alone does not convey the contextual information about the actual state of each label for a given CXR image. To address this limitation, we adopt a straightforward approach by incorporating label states, which provide additional context and knowledge for each label within the CXR image.

Specifically, we generate a set of state embeddings

S = {s_{1}, s_{2}, s_{3}, \dots, s_{n}}

through an embedding layer, where each

s_{i} \in R^{d}

encodes the status of the corresponding disease label. For each label embedding

l_{i}

, we incorporate its state information by summing it with the associated state embedding to form a fused representation:

c_{i} = l_{i} + s_{i}

(1)

where

c_{i}

denotes the fused embedding that integrates both the semantic identity of the label and its contextual state. The state embedding

s_{i}

takes on one of three possible states: “unknown”, “negative”, or “positive”. For example, if the label for a given disease is unmasked and confirmed to be present in the image,

s_{i}

would be correspond to the “positive” state, and its embedding is obtained by inputting the state into the embedding layer.

3.3. Label-Specific Residual Attention

To further enhance the image features of disease-related lesion regions and model the relationships between different disease labels, we design a label-specific residual attention mechanism. This mechanism allows the model to independently learn the specific feature representation for each disease, augmenting the disease-related information and the inter-label relationships within the feature representations. Each disease’s feature representation consists of two components: global features and label-specific residual features. The label-specific residual features serve as the residual component to the global features, and both components complement each other to provide a more comprehensive and nuanced representation.

To extract label-specific features, we first process each feature vector

f_{i} \in R^{d}

from the feature map

F

. The global feature vector

g

is calculated by applying average pooling across the entire feature map:

g = \frac{1}{h \times w} \sum_{i = 1}^{h \times w} f_{i}

(2)

This global feature vector captures the overall information of the image, providing a contextual summary that complements the label-specific global features. Next, we construct a set

H = {f_{1}, \dots, f_{h \times w}, c_{1}, \dots, c_{n}}

, which includes both the feature vectors from the image and the fused label embeddings. These feature vectors are then passed to a multi-head self-attention layer, where each attention head learns a different subspace of feature relationships. For the

t

-th attention head, the attention score

α_{i j}^{t}

between feature vectors

h_{i} \in H

and

h_{j} \in H

is first computed. After computing the attention scores for all pairs of feature vectors, we obtain the attention output

{h e a d}_{i}^{t}

using a weighted summation. Finally, the weighted vectors from multiple attention heads are concatenated along the feature dimension and mapped back to the input dimension through a linear transformation, resulting in the updated feature vector

{\tilde{h}}_{i}

. The overall process can be defined as follows:

α_{i j}^{t} = s o f t m a x ((W_{t}^{q} h_{i}) {(W_{t}^{k} h_{j})}^{T} / \sqrt{d})

(3)

{h e a d}_{i}^{t} = \sum_{j = 1}^{M} α_{i j}^{t} W_{t}^{v} h_{j}

(4)

{\tilde{h}}_{i} = c o n c a t ({h e a d}_{i}^{1}, {h e a d}_{i}^{2}, \dots, {h e a d}_{i}^{p}) W^{O}

(5)

where

W_{t}^{q}

,

W_{t}^{k}

, and

W_{t}^{v}

denote the query, key, and value weight matrices for the

t

-th attention head, and

d

is the feature dimensionality.

M

is the total number of input feature vectors, defined as

M = h \times w + n

.

p

is the number of attention heads.

W^{o}

represents the transformation matrix. In addition, this process can be repeated based on the training scenario, where

{\tilde{h}}_{i}

serves as the input for the next layer. The weight parameters

{W_{t}^{q}, W_{t}^{k}, W_{t}^{v}, W^{o}}

are not shared across different layers. The final output can be represented as

\tilde{H} = {{\tilde{f}}_{1}, \dots, {\tilde{f}}_{h \times w}, r_{1}, \dots, r_{n}}

, where

r_{i}

is the label-specific residual feature vector for the

i

-th disease label.

To obtain the final feature vector

z_{i}

for the

i

-th disease label, we combine the global feature vector

g

and the label-specific residual feature vector

r_{i}

, weighted by a hyperparameter

λ

:

z_{i} = g + λ r_{i}

(6)

Finally, prediction

{\hat{y}}_{i}

for the

i

-th disease label is made using a classification layer, as follows:

{\hat{y}}_{i} = σ ((w_{i} \cdot z_{i}) + b_{i})

(7)

where

σ

is the Sigmoid activation function,

w_{i}

is the weight vector, and

b_{i}

is the bias parameter for label

i

. This output

{\hat{y}}_{i}

represents the predicted probability for the presence of the

i

-th disease label.

3.4. Label Mask Training Loss

State embeddings enable the model to effectively incorporate label knowledge during training. To exploit this capability, we implement a label mask training strategy to help the model learn diverse co-occurrence patterns among labels. As illustrated in Figure 4, a random subset of labels is masked, while the ground truth values of the remaining labels are used to support the prediction of the masked ones.

Suppose there are

n

possible labels. During training, we randomly select

u

labels to be masked, where

u \in [0.25 n, n]

. This setting is motivated by two considerations. First, inspired by masked language modeling techniques that typically mask approximately 15% of tokens, we adopt a broader masking range to better accommodate the sparsity and imbalance common in medical multi-label datasets. Second, we aim to expose the model to a wide spectrum of label configurations during training, promoting its ability to learn diverse co-occurrence patterns and reducing its reliance on fixed label subsets.

The masked labels, denoted as

y_{u}

, are randomly sampled from the complete labels

y

and assigned the embedding corresponding to the “unknown” state. The remaining labels, denoted as

y_{k}

, are treated as known and assigned their respective ground truth state embeddings (“positive” or “negative”). These known labels, along with the input image, are fed into the LMeRAN model, which is trained to predict the masked labels

y_{u}

. Model parameters are optimized using binary cross-entropy loss. By dynamically varying both the number and identity of masked labels during training, the model is encouraged to generalize across a wide range of label combinations, improving robustness in real-world multi-label classification scenarios.

To optimize the model under the label mask training strategy, the loss function is designed to focus solely on the prediction error for masked labels. It is formally defined as follows:

L = \sum_{i = 1}^{N} E_{p (y_{k})} {L_{C E} ({\hat{y}}_{u}^{(i)}, y_{u}^{(i)}) | y_{k}}

(8)

where

L_{C E}

represents the cross-entropy loss function and

E_{p (y_{k})} {\cdot | y_{k}}

denotes calculating the expectation over the probability distribution of known labels

y_{k}

.

N

represents the total number of training samples. This loss function ensures that the model does not overfit to specific combinations of known labels, promoting better generalization.

4. Experiments

4.1. Dataset

We use the large-scale CXR dataset ChestX-ray14 for the experiments. This dataset contains 112,120 frontal view CXR images from 30,805 patients. Among these images, 60,361 are labeled as “No Finding” while the remaining images are labeled as containing one or more thoracic diseases. The distribution of labels in the ChestX-ray14 dataset is shown in Table 1. We observe that the samples of diseases such as hernia, pneumonia, and fibrosis are sparser, while the samples of diseases like infiltration and effusion are more abundant. This imbalanced sample distribution clearly increases the difficulty of model classification.

According to the official documentation, we split the dataset, allocating 75,312 images for training, 11,212 for validation, and 25,596 for testing, with a ratio of 7:1:2. Figure 5 shows some examples of CXR images in ChestX-ray14. To speed up the training process, we resize each CXR grayscale image with the original image size of 1024

\times

1024 pixels to 448

\times

448 pixels by bilinear interpolation. In addition, we employ data enhancement techniques including random cropping, image scaling, and random horizontal flipping to improve the generalization ability of the proposed model.

4.2. Evaluation Metrics

We use the receiver operating characteristic (ROC) curve and its area under the curve (AUC) value to evaluate the performance of LMeRAN. The ROC curve’s horizontal axis represents the false positive rate (FPR), while the vertical axis represents the true positive rate (TPR). FPR denotes the proportion of samples with a true label of “0” and incorrectly predicted by the model to be “1”. TPR denotes the proportion of samples with a true label of “1” and correctly predicted by the model to be “1”. The calculation formulas are as follows:

F P R = \frac{F P}{F P + T N}

(9)

T P R = \frac{T P}{T P + F N}

(10)

where

T P

and

T N

denote the number of correctly predicted positive and negative samples, respectively.

F P

and

F N

denote the number of incorrectly predicted positive and negative samples, respectively. Thus, the closer the ROC curve is to the upper left corner, the better the model’s classification performance.

4.3. Experiment Setting

The experimental environment consists of a CentOS 7 operating system running on an Intel(R) Xeon(R) Gold 6240 CPU @ 2.60 GHz (Intel Corporation, Santa Clara, CA, USA). Model training is parallelized using two Nvidia RTX2080Ti GPUs (NVIDIA Corporation, Santa Clara, CA, USA), each equipped with 11 GB of VRAM (NVIDIA Corporation, Santa Clara, CA, USA). The LMeRAN model is implemented using the Pytorch framework (version 1.11.0+cu102), with CUDA 10.2 and cuDNN 7.6.5 providing GPU acceleration. Additionally, the model implementation incorporates libraries such as NumPy (version 1.22.3) for numerical computations and SciPy (version 1.9.1) for scientific computing.

The model architecture employs multi-head self-attention layers, with 4 attention heads. A dropout rate of 0.1 is applied to mitigate overfitting, and the parameter λ is set to 0.2 (see Section 4.5.3 for detailed validation). The Adam optimizer, initialized with a learning rate of 0.00001, is used for model training. The training process is conducted over 50 epochs with a batch size of 64. The detailed parameters are summarized in Table 2.

4.4. Baselines

To validate the performance of LMeRAN, we compare it with the following five advanced methods:

Wang et al. [16] use a pre-trained CNN as a feature extractor and focus on training only the transition and classification layers in the model for weakly supervised classification and localization of common thorax diseases.
Yao et al. [21] employ a multi-resolution analysis approach that combines weakly supervised learning techniques. The model integrates features from different resolutions to enhance the accuracy of medical diagnosis and localization tasks.
Ma et al. [26] design a multi-attention network for thoracic disease classification and localization, utilizing multiple attention mechanisms to better capture relevant features. The model enhances both classification accuracy and localization precision by focusing on critical areas in the images.
Peng et al. [32] propose the Conformer model, which merges the local feature extraction capabilities of CNNs with the global representation power of vision transformers. The model effectively captures both local details and long-distance feature dependencies by combining convolution operations with self-attention mechanisms, enhancing representation learning.
Wu et al. [33] introduce CTransCNN, a model that combines CNNs and Transformers for multi-label medical image classification. It includes a multi-head attention feature module, a multi-branch residual module, and an information interaction module, which together improve label correlation exploration, model optimization, and feature transmission.

4.5. Results and Analysis

We use ROC curves to visually demonstrate the classification performance of LMeRAN across 14 different thoracic diseases on the ChestX-ray14 dataset, as shown in Figure 6. The figure presents 15 ROC curves: 14 curves for the individual classification performance of LMeRAN for each thoracic disease, and one curve representing the mean classification performance across all diseases. By observing the location of these curves, it can be found that they are concentrated in the upper left corner of the figure, which indicates that LMeRAN shows a high level of accuracy and excellent classification performance in the overall classification of these diseases.

4.5.1. Performance Comparison with Baselines

The AUC values for the classification of various thoracic diseases using different models are presented in Table 3. Our proposed LMeRAN achieves the highest AUC scores for 11 out of 14 thoracic diseases, and for the remaining three conditions—cardiomegaly, effusion, and infiltration—the results are also competitive with the baseline models. Furthermore, in terms of overall classification performance, LMeRAN achieves a mean AUC (mAUC) improvement of 3.1% to 8.0% across 14 thoracic diseases compared to the baseline models, demonstrating its superior classification capability.

4.5.2. Ablation Experiment

To evaluate the impact of the Label Mask Training (LMT) strategy and the Label-Specific Residual Attention (LSRA) mechanism on model performance, we conduct a series of ablation experiments on the ChestX-ray14 dataset. To ensure a fair comparison, all other experimental settings remain consistent across different configurations. The ablation models are defined as follows:

LMeRAN (w/o LMT): LMeRAN model with the label mask training excluded.
LMeRAN (w/o LSRA): LMeRAN model with the label-specific residual attention excluded.
LMeRAN (w/o LMT+ LSRA): LMeRAN model with both the label mask training and label-specific residual attention excluded.
LMeRAN (Complete): The complete LMeRAN model, including all components.

Table 4 shows the mAUC values obtained by including and excluding the LMT and LSRA components. The following key observations can be drawn from the results:

LMeRAN (w/o LMT+ LSRA): When both components are excluded, the mAUC is 0.792, serving as the baseline performance.
LMeRAN (w/o LMT): Including only the LSRA component enhances the mAUC to 0.813. This result highlights that LSRA effectively enhances the model’s ability to focus on discriminative image regions, thereby improving feature representations of disease-relevant areas.
LMeRAN (w/o LSRA): When only the LMT component is included, the mAUC increases to 0.804. This enhancement suggests that LMT effectively captures the interdependencies among disease labels, allowing the model to leverage label correlations during training to refine its predictions.
LMeRAN (Complete): When both LMT and LSRA components are included, the mAUC reaches its highest value of 0.825. This confirms that the joint contribution of both components leads to superior classification performance, effectively combining label dependency modeling and enhanced image feature representation to maximize predictive accuracy.

Table 4. Ablation experiment results. A “√” indicates the inclusion of the component, while a “×” denotes its exclusion.

	LMT	LSRA	mAUC
LMeRAN (w/o LMT+ LSRA)	×	×	0.792
LMeRAN (w/o LMT)	×	√	0.813
LMeRAN (w/o LSRA)	√	×	0.804
LMeRAN (Complete)	√	√	0.825

4.5.3. Parameter Sensitivity Experiment

To further evaluate the robustness of LMeRAN, we conduct parameter sensitivity experiments focusing on three key factors: the balance parameter λ in LSRA, the number of attention heads (H), and the lower bound of the mask ratio (R) in LMT. Unless otherwise specified, all other experimental settings, including backbone, optimizer, training schedule, and data split, are kept consistent with the main experiments.

The effect of λ is shown in Figure 7. As λ increases from 0.05 to 0.20, the mAUC of LMeRAN rises steadily and reaches a peak of 0.825 at λ = 0.20. Beyond this point, performance begins to decline, indicating that excessive reliance on self-attention while weakening pooling degrades classification accuracy. The results also show that LMeRAN remains relatively stable when λ lies within [0.15, 0.25], but becomes sensitive at extreme values.

The impact of varying the number of attention heads is presented in Table 5. With a single head, the model achieves an mAUC of 0.819. Increasing to 2 and 4 heads improves performance to 0.822 and 0.825, respectively. When the number of heads is further increased to 6 or 8, the mAUC remains stable at 0.824, showing no additional benefit. These results suggest that too few heads limit feature diversity, while excessive heads introduce redundancy without improving accuracy. A moderate number of heads (H = 4 or H = 6) provides a favorable balance between representational capacity and efficiency.

The effect of the mask ratio lower bound is summarized in Table 6. The model achieves the best performance (mAUC = 0.825) when R = 0.25. Reducing the lower bound to R = 0.00 decreases performance to 0.818, as overly weak masking exposes too many ground-truth labels and restricts the learning of label co-occurrence patterns. Conversely, increasing R to 0.50 and 0.75 reduces mAUC to 0.821 and 0.817, respectively, because excessive masking diminishes supervision and destabilizes optimization. These results highlight that moderate masking strength encourages effective co-occurrence modeling while preserving sufficient supervision.

Overall, these parameter sensitivity experiments demonstrate that LMeRAN maintains stable performance across a reasonable range of λ, H, and R, while moderate settings (λ = 0.20, H = 4, R = 0.25) provide the most effective trade-off between accuracy, robustness, and efficiency. However, it should be noted that these values are derived from experiments on ChestX-ray14 and may not directly generalize to other datasets or tasks. Further validation on diverse benchmarks is necessary to confirm their applicability, and ongoing refinements to the model may help identify parameter configurations that are more broadly effective.

4.5.4. Interpretability Analysis

In addition to evaluating classification accuracy, we conducted an interpretability analysis to gain further insights into the model’s diagnostic behavior. Specifically, we utilize attention weight matrices to generate lesion localization heatmaps, which highlight the regions of CXR images deemed most relevant by the model during prediction. As illustrated in Figure 8, these heatmaps visually identify the areas contributing most significantly to the classification decisions for various thoracic conditions. To verify the clinical relevance of the highlighted regions, we not only invited medical experts to visually evaluate the attention maps, but also introduced a quantitative assessment using the Intersection over Union (IoU). Specifically, we defined the pixels with attention intensity greater than 0.4 as pseudo-segmentation regions and calculated their IoU with the ground-truth lesion annotations. This dual evaluation strategy further confirms the validity and reliability of the extracted features. Furthermore, the visual interpretability afforded by these heatmaps not only aids clinicians in understanding the model’s predictions but also enhances the transparency of the CAD system, thereby fostering trust and supporting its integration into medical practice.

5. Discussion

Experimental results demonstrate that LMeRAN achieves superior classification performance on CXR images compared with several representative methods. We benchmark our model against five baselines: Wang et al. [16], Yao et al. [21], Ma et al. [26], Conformer [32], and CTransCNN [33]. Compared with Wang et al. and Yao et al., LMeRAN improves the mAUC by 8.0% and 6.4%, respectively, highlighting the contribution of attention-enhanced representations to classification robustness. Relative to Ma et al. and CTransCNN, which primarily emphasize local lesion features, LMeRAN yields additional mAUC gains of 3.1% and 4.0% by jointly integrating global context and local discriminative features. The advantage is particularly evident for diseases with diffuse or context-dependent manifestations, such as Mass (+2.9% and +4.9%) and Pleural Thickening (+4.0% and +10.4%). These findings indicate that balancing global and local information enables more accurate and reliable predictions than models biased toward localized lesion representations. Moreover, when compared with Conformer, LMeRAN improves the mAUC by 6.1%, further validating the effectiveness of the proposed LSRA mechanism. In addition, the integration of the LMT strategy allows the model to better capture inter-label dependencies.

While these results confirm the effectiveness of LMeRAN, they are currently based solely on the ChestX-ray14 dataset. Although this benchmark remains widely used, it has certain limitations in terms of scale and diversity. We will further evaluate LMeRAN on larger and more diverse datasets such as CheXpert and MIMIC-CXR to provide stronger evidence of its generalizability and robustness across different clinical scenarios.

6. Conclusions

In this study, we propose a novel multi-label medical image classification model, called LMeRAN, specifically designed for the CXR image classification task. LMeRAN introduces two primary innovations: (1) a label-specific residual attention mechanism that integrates an average pooling layer with a multi-head self-attention layer to extract comprehensive feature representations for multiple diseases in CXR images, and (2) a label mask training strategy that enables the model to explore a wide range of label co-occurrences. Extensive experiments conducted on the large-scale, publicly available ChestX-ray14 dataset demonstrate that LMeRAN offers a significant performance advantage over existing approaches. Furthermore, qualitative assessments using lesion localization heatmaps underscore the model’s reliability and interpretability.

Future work will follow three main directions. First, we will further address the challenge of label imbalance. In addition to conventional augmentation, we plan to investigate cost-sensitive loss, reweighting strategies, and generative augmentation methods to improve the performance of minority classes. Second, to strengthen clinical applicability and ensure generalization, we will extend validation to additional external datasets and real-world hospital data. Third, we aim to integrate multimodal information, such as clinical notes or patient metadata. This extension may provide richer contextual signals and further enhance diagnostic accuracy. Collectively, these efforts will guide the development of LMeRAN into a robust and clinically adaptable framework for multi-label chest X-ray analysis.

Author Contributions

Conceptualization, H.F., C.S. and L.Z.; methodology, H.F., C.S. and X.Q.; software, C.S.; validation, C.S., D.L. and L.Z.; formal analysis, H.F. and C.S.; investigation, C.S., X.Q. and D.L.; resources, C.S.; data curation, H.F. and L.Z.; writing—original draft preparation, H.F., C.S., D.L. and L.Z.; writing—review and editing, C.S. and X.Q.; visualization, C.S.; supervision, H.F., D.L. and L.Z.; project administration, C.S.; funding acquisition, H.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fundamental Research Funds for the Central Universities (Grant No. BLX2016-24) and the Natural Science Foundation of China (Grant No. 61702038).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study is publicly available and can be accessed at: http://nihcc.app.box.com/v/ChestXray-NIHCC (accessed on 15 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Weiss, J.; Raghu, V.K.; Bontempi, D.; Christiani, D.C.; Mak, R.H.; Lu, M.T.; Aerts, H.J. Deep Learning to Estimate Lung Disease Mortality from Chest Radiographs. Nat. Commun. 2023, 14, 2797. [Google Scholar] [CrossRef]
Wei, Y.J.; Pan, N.; Chen, Y.; Lv, P.J.; Gao, J.B. A Study Using Deep Learning-Based Computer-Aided Diagnostic System with Chest Radiographs-Pneumothorax and Pulmonary Nodules Detection. J. Clin. Radiol. 2021, 40, 252–257. [Google Scholar]
Guo, H.; Li, M.Y.; Zhu, P.Z.; Wang, X.M.; Zhou, X. An early screening system for lung lesions in chest X-ray images based on AI algorithms. Imaging Sci. Photochem. 2025, 43, 134–144. [Google Scholar]
Çallı, E.; Sogancioglu, E.; van Ginneken, B.; van Leeuwen, K.G.; Murphy, K. Deep Learning for Chest X-Ray Analysis: A Survey. Med. Image Anal. 2021, 72, 102125. [Google Scholar] [CrossRef] [PubMed]
Chen, B.; Zhang, Z.; Lin, J.; Chen, Y.; Lu, G. Two-Stream Collaborative Network for Multi-Label Chest X-Ray Image Classification with Lung Segmentation. Pattern Recognit. Lett. 2020, 135, 221–227. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Guan, Q.; Huang, Y.; Luo, Y.; Liu, P.; Xu, M.; Yang, Y. Discriminative Feature Learning for Thorax Disease Classification in Chest X-Ray Images. IEEE Trans. Image Process. 2021, 30, 2476–2487. [Google Scholar] [CrossRef]
Sanida, T.; Dasygenis, M. A Novel Lightweight CNN for Chest X-Ray-Based Lung Disease Identification on Heterogeneous Embedded System. Appl. Intell. 2024, 54, 4756–4780. [Google Scholar] [CrossRef]
Saednia, K.; Jalalifar, A.; Ebrahimi, S.; Sadeghi-Naini, A. An Attention-Guided Deep Neural Network for Annotating Abnormalities in Chest X-Ray Images: Visualization of Network Decision Basis. In Proceedings of the 42nd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Online, Montreal, QC, Canada, 20–24 July 2020; pp. 1258–1261. [Google Scholar]
Guan, Q.; Huang, Y.; Zhong, Z.; Zheng, Z.; Zheng, L.; Yang, Y. Thorax Disease Classification with Attention Guided Convolutional Neural Network. Pattern Recognit. Lett. 2020, 131, 38–45. [Google Scholar] [CrossRef]
Jiang, X.; Zhu, Y.; Cai, G.; Zheng, B.; Yang, D. MXT: A New Variant of Pyramid Vision Transformer for Multi-Label Chest X-Ray Image Classification. Cogn. Comput. 2022, 14, 1362–1377. [Google Scholar] [CrossRef]
Khater, O.H.; Shuaib, A.S.; Haq, S.U.; Siddiqui, A.J. AttCDCNeT: Attention-Enhanced Chest Disease Classification using X-Ray Images. In Proceedings of the 22nd IEEE International Multi-Conference on Systems, Signals & Devices, Monastir, Tunisia, 17–20 February 2025; pp. 891–896. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Lanchantin, J.; Wang, T.; Ordonez, V.; Qi, Y. General Multi-Label Image Classification with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 16478–16488. [Google Scholar]
Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly Supervised Classification and Localization of Common Thorax Diseases. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2097–2106. [Google Scholar]
Krizhevsky, A.; Sutskever, L.; Hinton, G.E. Imagenet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Seibold, C.; Reiß, S.; Sarfraz, M.S.; Stiefelhagen, R.; Kleesiek, J. Breaking with Fixed-Set Pathology Recognition through Report-Guided Contrastive Training. In Proceedings of the 25th International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; pp. 690–700. [Google Scholar]
Yao, L.; Prosky, J.; Poblenz, E.; Covington, B.; Lyman, K. Weakly Supervised Medical Diagnosis and Localization from Multiple Resolutions. arXiv 2018, arXiv:1803.07703. [Google Scholar] [CrossRef]
Yang, M.; Tanaka, H.; Ishida, T. Performance Improvement in Multi-Label Thoracic Abnormality Classification of Chest X-Rays with Noisy Labels. Int. J. Comput. Assist. Radiol. Surg. 2023, 18, 181–189. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Wan, Y.; Pan, F. Enhancing Multi-Disease Diagnosis of Chest X-Rays with Advanced Deep-Learning Networks in Real-World Data. J. Digit. Imaging 2023, 36, 1332–1347. [Google Scholar] [CrossRef]
Ishwerlal, R.D.; Agarwal, R.; Sujatha, K.S. Lung Disease Classification using Chest X-Ray Image: An Optimal Ensemble of Classification with Hybrid Training. Biomed. Signal Process. Control 2024, 91, 105941. [Google Scholar]
Öztürk, Ş.; Turalı, M.Y.; Çukur, T. Hydravit: Adaptive Multi-Branch Transformer for Multi-Label Disease Classification from Chest X-Ray Images. Biomed. Signal Process. Control 2025, 100, 106959. [Google Scholar] [CrossRef]
Ma, Y.; Zhou, Q.; Chen, X.; Lu, H.; Zhao, Y. Multi-Attention Network for Thoracic Disease Classification and Localization. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 1378–1382. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Guan, Q.; Huang, Y. Multi-Label Chest X-Ray Image Classification via Category-Wise Residual Attention Learning. Pattern Recognit. Lett. 2020, 130, 259–266. [Google Scholar] [CrossRef]
Wang, H.; Wang, S.; Qin, Z.; Zhang, Y.; Li, R.; Xia, Y. Triple Attention Learning for Classification of 14 Thoracic Diseases using Chest Radiography. Med. Image Anal. 2021, 67, 101846. [Google Scholar] [CrossRef]
Taslimi, S.; Taslimi, S.; Fathi, N.; Salehi, M.; Rohban, M.H. SwinCheX: Multi-Label Classification on Chest X-Ray Images with Transformers. arXiv 2022, arXiv:2206.04246. [Google Scholar]
Liu, Z.; Liu, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Peng, Z.; Guo, Z.; Huang, W.; Wang, Y.; Xie, L.; Jiao, J. Conformer: Local Features Coupling Global Representations for Recognition and Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9454–9468. [Google Scholar] [CrossRef]
Wu, X.; Feng, Y.; Xu, H.; Lin, Z.; Chen, T.; Li, S.; Qiu, S.; Liu, Q.; Ma, Y.; Zhang, S. CTransCNN: Combining Transformer and CNN in Multilabel Medical Image Classification. Knowl.-Based Syst. 2023, 281, 111030. [Google Scholar] [CrossRef]
Song, L.; Liu, J.; Qian, B.; Sun, M.; Yang, K.; Sun, M. A Deep Multi-Modal CNN for Multi-Instance Multi-Label Image Classification. IEEE Trans. Image Process. 2018, 27, 6025–6038. [Google Scholar] [CrossRef]
Wang, J.; Yang, Y.; Mao, J.; Huang, Z.; Huang, C.; Xu, W. CNN-RNN: A Unified Framework for Multi-Label Image Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2285–2294. [Google Scholar]
Lee, Y.W.; Huang, S.K.; Chang, R.F. CheXGAT: A Disease Correlation-Aware Network for Thorax Disease Diagnosis from Chest X-Ray Images. Artif. Intell. Med. 2022, 132, 102382. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Ali, M.; Khan, S. Clip-Decoder: Zeroshot Multilabel Classification using Multimodal CLIP aligned Representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 4675–4679. [Google Scholar]
Wang, A.; Chen, H.; Lin, Z.; Ding, Z.; Liu, P.; Bao, Y.; Yan, W.; Ding, G. Hierarchical Prompt Learning using CLIP for Multi-Label Classification with Single Positive Labels. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 5594–5604. [Google Scholar]

Figure 1. A CXR image with a lung nodule.

Figure 2. The architecture of LMeRAN.

Figure 3. The dense connection mechanism of the DenseNet network.

Figure 4. Label mask training process.

Figure 5. Examples of images in the ChestX-ray14 dataset.

Figure 6. The ROC curves and AUC values for 14 thoracic diseases.

Figure 7. Sensitivity of LMeRAN to the balance parameter λ on ChestX-ray14.

Figure 8. Lesion localization heatmaps.

Table 1. Label distribution in the ChestX-ray14 dataset.

Labels	Quantity	Frequency
No Finding	60,361	42.65%
Infiltration	19,894	14.06%
Effusion	13,317	9.41%
Atelectasis	11,559	8.17%
Nodule	6331	4.47%
Mass	5782	4.09%
Pneumothorax	5302	3.75%
Consolidation	4667	3.30%
Pleural Thickening	3385	2.39%
Cardiomegaly	2776	1.96%
Emphysema	2516	1.78%
Edema	2303	1.63%
Fibrosis	1686	1.19%
Pneumonia	1431	1.01%
Hernia	227	0.16%

Table 2. Parameter setting.

Parameters	Value
optimizer	Adam
learning_rate	0.00001
dropout_rate	0.1
batch_size	64
epoch	50
number of layers	4
number of heads	4
λ	0.2

Table 3. Experimental results of different models. Bold numbers indicate the best results, while underlined italics represent the second-best results.

Disease	Wang et al. [16]	Yao et al. [21]	Ma et al. [26]	Conformer [32]	CTransCNN [33]	LMeRAN (Ours)
Atelectasis	0.700	0.733	0.763	0.727	0.748	0.777
Cardiomegaly	0.810	0.865	0.884	0.907	0.900	0.886
Effusion	0.759	0.806	0.816	0.806	0.837	0.836
Infiltration	0.661	0.673	0.679	0.697	0.707	0.703
Mass	0.693	0.718	0.801	0.761	0.781	0.830
Nodule	0.669	0.777	0.729	0.728	0.742	0.783
Pneumonia	0.658	0.684	0.710	0.623	0.630	0.755
Pneumothorax	0.799	0.805	0.838	0.831	0.847	0.885
Consolidation	0.703	0.711	0.744	0.700	0.731	0.759
Edema	0.805	0.806	0.841	0.828	0.858	0.860
Emphysema	0.833	0.842	0.884	0.815	0.856	0.931
Fibrosis	0.786	0.743	0.801	0.804	0.778	0.830
Pleural Thickening	0.684	0.724	0.754	0.681	0.690	0.794
Hernia	0.872	0.775	0.876	0.786	0.881	0.918
mAUC	0.745	0.761	0.794	0.764	0.785	0.825

Note: Improvements over the baselines are evaluated using the DeLong test for AUC. All reported gains of LMeRAN over comparison methods are statistically significant (p < 0.05).

Table 5. Sensitivity of LMeRAN to the number of attention heads on ChestX-ray14.

	H = 1	H = 2	H = 4	H = 6	H = 8
mAUC	0.819	0.822	0.825	0.824	0.824

Table 6. Sensitivity of LMeRAN to the mask ratio lower bound on ChestX-ray14.

	R = 0	R = 0.25	R = 0.50	R = 0.75
mAUC	0.818	0.825	0.821	0.817

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fu, H.; Song, C.; Qu, X.; Li, D.; Zhang, L. LMeRAN: Label Masking-Enhanced Residual Attention Network for Multi-Label Chest X-Ray Disease Aided Diagnosis. Sensors 2025, 25, 5676. https://doi.org/10.3390/s25185676

AMA Style

Fu H, Song C, Qu X, Li D, Zhang L. LMeRAN: Label Masking-Enhanced Residual Attention Network for Multi-Label Chest X-Ray Disease Aided Diagnosis. Sensors. 2025; 25(18):5676. https://doi.org/10.3390/s25185676

Chicago/Turabian Style

Fu, Hongping, Chao Song, Xiaolong Qu, Dongmei Li, and Lei Zhang. 2025. "LMeRAN: Label Masking-Enhanced Residual Attention Network for Multi-Label Chest X-Ray Disease Aided Diagnosis" Sensors 25, no. 18: 5676. https://doi.org/10.3390/s25185676

APA Style

Fu, H., Song, C., Qu, X., Li, D., & Zhang, L. (2025). LMeRAN: Label Masking-Enhanced Residual Attention Network for Multi-Label Chest X-Ray Disease Aided Diagnosis. Sensors, 25(18), 5676. https://doi.org/10.3390/s25185676

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LMeRAN: Label Masking-Enhanced Residual Attention Network for Multi-Label Chest X-Ray Disease Aided Diagnosis

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning-Based CXR Image Classification

2.2. Attention in CXR Image Classification

2.3. Label Dependency in Multi-Label Image Classification

3. Proposed Method

3.1. Image Feature Extraction

3.2. Embedding Disease Label and State

3.3. Label-Specific Residual Attention

3.4. Label Mask Training Loss

4. Experiments

4.1. Dataset

4.2. Evaluation Metrics

4.3. Experiment Setting

4.4. Baselines

4.5. Results and Analysis

4.5.1. Performance Comparison with Baselines

4.5.2. Ablation Experiment

4.5.3. Parameter Sensitivity Experiment

4.5.4. Interpretability Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI