Convolutional Neural Network–Vision Transformer Architecture with Gated Control Mechanism and Multi-Scale Fusion for Enhanced Pulmonary Disease Classification

Chibuike, Okpala; Yang, Xiaopeng

doi:10.3390/diagnostics14242790

Open AccessArticle

Convolutional Neural Network–Vision Transformer Architecture with Gated Control Mechanism and Multi-Scale Fusion for Enhanced Pulmonary Disease Classification

by

Okpala Chibuike

¹ and

Xiaopeng Yang

^1,2,*

¹

Department of Human Ecology & Technology, Handong Global University, Pohang 37554, Republic of Korea

²

School of Global Entrepreneurship and Information Communication Technology, Handong Global University, Pohang 37554, Republic of Korea

^*

Author to whom correspondence should be addressed.

Diagnostics 2024, 14(24), 2790; https://doi.org/10.3390/diagnostics14242790

Submission received: 5 November 2024 / Revised: 9 December 2024 / Accepted: 11 December 2024 / Published: 12 December 2024

(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

Download

Browse Figures

Versions Notes

Abstract

Background/Objectives: Vision Transformers (ViTs) and convolutional neural networks (CNNs) have demonstrated remarkable performances in image classification, especially in the domain of medical imaging analysis. However, ViTs struggle to capture high-frequency components of images, which are critical in identifying fine-grained patterns, while CNNs have difficulties in capturing long-range dependencies due to their local receptive fields, which makes it difficult to fully capture the spatial relationship across lung regions. Methods: In this paper, we proposed a hybrid architecture that integrates ViTs and CNNs within a modular component block(s) to leverage both local feature extraction and global context capture. In each component block, the CNN is used to extract the local features, which are then passed through the ViT to capture the global dependencies. We implemented a gated attention mechanism that combines the channel-, spatial-, and element-wise attention to selectively emphasize the important features, thereby enhancing overall feature representation. Furthermore, we incorporated a multi-scale fusion module (MSFM) in the proposed framework to fuse the features at different scales for more comprehensive feature representation. Results: Our proposed model achieved an accuracy of 99.50% in the classification of four pulmonary conditions. Conclusions: Through extensive experiments and ablation studies, we demonstrated the effectiveness of our approach in improving the medical image classification performance, while achieving good calibration results. This hybrid approach offers a promising framework for reliable and accurate disease diagnosis in medical imaging.

Keywords:

vision transformer; convolutional neural network; gated control mechanism; multi-scale fusion module; pulmonary diseases

1. Introduction

In recent years, global health has witnessed a notable increase in the prevalence of lung and respiratory-related conditions, affecting a significant proportion of the population. These conditions pose a substantial challenge, necessitating the swift and precise detection and diagnosis of various lung diseases, such as Pneumonia, COVID-19, and tuberculosis (TB). The timely administration of effective treatment is crucial, making the use of advanced diagnostic tools essential [1]. Pulmonary diseases are particularly challenging to detect and diagnose in the early stages due to their deceptive initial symptoms, requiring rigorous and often extensive diagnostic procedures [2]. The accurate diagnosis of pulmonary diseases in the early stages is vital for patient management and care, yet the process is often complicated and prone to errors. The World Health Organization (WHO) recommends the use of chest radiography, such as chest X-ray (CXR), magnetic resonance imaging (MRI), and computed tomography (CT) images, as the principal modalities for diagnosing and screening these diseases because of their notable sensitivity [3]. However, interpreting these images is labor-intensive, subject to individual bias, lacks specificity, and is prone to misdiagnosis due to similarities in the radiologic patterns among various lung diseases [2].

To address these challenges, AI-based computer-aided diagnosis (CAD) systems have been developed to automatically diagnose and detect pulmonary diseases using chest radiography [4]. These CAD systems employ advanced deep learning models, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers, in the analysis, segmentation, and classification of pulmonary diseases.

The emergence of deep learning models marks a transformative period in the field of medical image analysis, holding considerable promise in the detection of pulmonary and other diseases [5]. These models make medical imaging analysis faster and easier, offering a promising avenue for improving the accuracy and efficiency of pulmonary disease detection [6]. However, CNNs and transformer models have their inherent limitations. CNNs are limited in capturing long-range dependencies [7], while transformer models, such as Vision Transformers (ViTs) [8], have a limited capability of capturing low-level features and are data-hungry [9]. Recent surveys [9,10,11] on hybrid CNN-ViT architectures, such as ResNet-ViT and UNet-ViT, highlighted the growing importance of hybrid architectures that combined the strengths of CNNs and ViTs to address these limitations. These hybrid vision transformers (HVTs) integrated CNN layers to capture local features, while leveraging the transformers’ ability to learn long-range dependencies, thereby providing more comprehensive feature representation. HVTs have demonstrated significant potential in computer vision, especially in medical imaging tasks, including segmentation and classification, by overcoming challenges such as data inefficiency and a lack of image-related inductive bias [10,11] through incorporating both the global and local contexts. However, the existing models mostly integrated ViTs with a single CNN model, limiting the sufficient extraction of various features from medical images.

To overcome these limitations, we proposed a gated hybrid framework that fuses the feature extraction capabilities of multiple component blocks consisting of CNN and ViT encoder architectures. In the framework, CNN is used to extract the local features, while the ViT is used to extract the long-range dependencies. Different component blocks use different CNN models to sufficiently extract various features from medical images. Then, the significant features are selectively extracted through a gated control mechanism and are then fused with a multi-scale fusion module for more comprehensive feature representation. The main contributions of this study are outlined as follows:

We propose a hybrid architecture that allows for the integration of any CNN and ViT encoder within component blocks. Each component block is carefully designed to capture the low-range and long-range dependencies effectively.
We introduce a gated attention control mechanism, which selectively emphasizes the important features through channel-, spatial-, and element-wise attention. This mechanism modulates and refines the feature representations by dynamically controlling the flow of relevant information.
We present a multi-scale fusion module that captures single-level and multi-level features. This module uses an Inception-style design to combine fine-grained, medium-scale, and large-scale features across multiple branches, ensuring more comprehensive feature representation.

2. Related Work

2.1. Pulmonary Disease Detection Based on CNN Architecture

Mousavi et al. [12] proposed a COVID-19 detection framework using respiratory sounds (coughing) and medical images using the Internet of Health Things (IoHTs). The authors employed two datasets of CXR and CT images to fine-tune pre-trained Incep-tionResNetV2, InceptionV3, and EfficientNetB4 models for three-class classification tasks. Rajaraman and Antani [13] introduced a modality-specific deep learning model ensemble to enhance TB detection using CXR images. The authors combined custom-built CNNs and pre-trained CNNs to learn the modality-specific features. The predictions from the best-performing models were combined using ensemble methods, which improved the classification performance compared with an individual model. Vinayakumar et al. [14] proposed a multichannel ensemble framework using EfficientNet-based models (EfficientNetB0, EfficientNetB1, and EfficientNetB2) to extract features. These features were then fused and passed into a stacked ensemble learning classifier, enhancing the overall classification performance. Sasikaladevi and Revathi [15] developed a custom deep learning framework for the early detection and prognosis of TB from CXR images. Their framework employed a deep Fused Linear Triangulation (FLT) approach to handle intraclass variation and interclass similarities, accurately visualizing the infected regions in CXR images without requiring segmentation. Urooj et al. [16] presented a stochastic learning-based Artificial Neural Network (ANN) model using CXR images to detect TB. The method introduced random variations into the network by assigning stochastic transfer functions or weights, effectively detecting abnormalities in the CXR images across various levels of TB complexity. While conventional CNN-based models effectively capture local features, they lack the mechanisms for emphasizing relevant information or integrating multi-scaled features. None of these models incorporated a gated attention control mechanism or a multi-scale fusion module, which limited their capabilities in prioritizing significant features or combining fine-grained, medium-scale, and large-scale features.

2.2. Pulmonary Disease Detection Based on Transformer Architecture

Mabrouk et al. [17] proposed an ensemble learning method for Pneumonia detection in CXR images, leveraging three pre-trained models: DenseNet169, MobileNetV2, and Vision Transformer (ViT). The authors argued that combining the features from these models using a probability-based ensemble approach significantly enhanced the classification performance. The ensemble method used by the authors inherently increased the computational complexity during inference, which could limit its deployment in resource-constrained environments. Sun et al. [18] introduced a convolutional transformer model for lung disease classification based on CXR images. They modified the transformer encoder’s attention mechanism by replacing it with an axial attention module and assigning a position offset term, resulting in an improved classification performance. Though the model demonstrated its effectiveness on a small dataset, the model lacked generalizability to more diverse datasets. Real-world medical imaging datasets often include a broader spectrum of conditions, varying image quality, and class imbalance. The model’s effectiveness in such scenarios remains untested, which limits its clinical applicability. Ukwuoma et al. [19] proposed a ViT-based model for lung disease classification in CXR images. The model used an ensemble technique to derive features, followed by global second-order pooling to extract the higher-order global features. This approach combined deep feature extraction with global feature representation for enhanced classification accuracy. The combination of multiple features from different CNN models using an ensemble approach could potentially enhance the model’s performance by leveraging diverse feature representation. However, there is no mechanism or control (gate) to filter or prioritize the relevant features during the concatenation process. This lack of feature relevance control could result in the inclusion of less-relevant or redundant features, thereby limiting the model’s overall performance. Moreover, the indiscriminate concatenation of features increased the dimensionality of the feature space, potentially introducing noise and making the model more prone to overfitting, especially on a small dataset. Ren et al. [20] introduced the ResNet-50 merged transformer (RMT-Net), a model combining ResNet-50 with ViT architecture. This approach aimed to leverage the CNN’s ability to extract local features and the transformer’s capability to capture long-range features, thereby reducing the computational cost and accelerating detection. The RMT-Net featured a four-stage block design, with global self-attention applied in the first three stages and residual blocks in the fourth stage for feature extraction. While this approach successfully captured both the global and local features, a significant limitation was the absence of adaptive feature selection mechanisms. The model did not incorporate strategies, such as attention gating or feature pruning, to control the features prioritized for classification. As a result, irrelevant or redundant features might be involved in the decision-making process, potentially degrading the model’s generalization and performance on unseen datasets.

2.3. Gated Mechanisms

Gated mechanisms facilitate easier gradient back-propagation through depth or time [21]. The primary idea behind a gating module is to control information flow based on the learned parameters, prioritizing the most relevant features for subsequent layers. Zhang et al. [22] introduced a framework for palmprint recognition that integrates CNNs and a transformer, leveraging the local extraction capabilities of CNNs and the global modeling strengths of transformers. This framework includes a gating mechanism and an adaptive feature fusion module, which filter and integrate the features extracted by the backbone network, ensuring a robust palmprint feature extraction and recognition performance. Schlemper et al. [23] proposed an attention gate (AG) model for medical image analysis designed to focus on target structures of varying shapes and sizes. AGs automatically suppress the irrelevant regions and highlight the salient features without requiring explicit tissue or organ localization modules. This mechanism can be seamlessly integrated into standard CNNs like VGG or U-Net, enhancing the model’s sensitivity and prediction accuracy with a minimal computational overhead. Fang and Han [24] developed an attention-modulated network based on the U-Net architecture, embedding spatial and channel attention modules. These modules highlight the interdependent channel maps and focus on the discriminant regions, adaptively emphasizing the relevant features and neglecting the irrelevant information. The authors also proposed aggregation approaches to integrate learned attention with raw feature maps, further enhancing the network’s ability to highlight the salient features and suppress noise. Valanarasu et al. [25] introduced a gated axial attention model, extending the existing architectures with an additional control mechanism in the self-attention module. This gating mechanism prioritizes relevant features during the attention process. Additionally, the authors proposed a local–global training strategy for medical images, operating on whole images to capture the global features and on patches for the local features, thereby improving the model’s overall performance. Despite their advancements, the existing gated mechanisms did not incorporate a unified gated attention control mechanism combining channel-, spatial-, and element-wise attention to modulate feature flow.

2.4. Attention Mechanisms

Woo et al. [26] proposed the convolutional block attention module (CBAM). The CBAM sequentially infers attention maps along two dimensions: channel and spatial. The channel includes both global average pooling and global max pooling, followed by shared multi-layered perceptron (MLP), while the spatial attention module applies convolution over concatenated average-pooled and max-pooled features along the channel axis. Hu et al. [27] proposed the squeeze and excitation (SE) attention block. SE attention adaptively recalibrates the channel-wise feature responses by explicitly modeling interdependencies between channels. The attention mechanism involves the squeeze block having a global average pooling to generate channel-wise statistics and the excitation block containing a fully connected layer followed by non-linearity (ReLU) and another fully connected layer filled by a sigmoid function. This generates weights for each channel. Recalibration is achieved by channel-wise multiplication of the original feature map with the generated weights. While attention mechanisms like CBAM and SE blocks improve feature representation, they are limited to static attention strategies and do not leverage dynamic control mechanisms. Furthermore, these modules do not integrate features at multiple scales, restricting their ability to handle multi-resolution data effectively.

2.5. Graph-Based Hybrid Models

Matlock et al. [28] introduced wave networks to address the limitation of the traditional graph convolutional networks (GCNs) in propagating long-range information across graphs. The authors demonstrated the superiority of wave networks over the traditional GCNs across three tasks: labelling the paths in graphs, solving mazes, and computing the voltages in circuits. The core idea was propagating information in waves across the graph via a breadth-first search, which allowed for more efficient long-range information propagation. Though the wave networks achieved a good performance, while requiring fewer parameters and computational resources than the traditional GCNs, the spectral computations required for wave networks can be computationally intensive, especially for large graphs. This limits their scalability to high-dimensional data, such as large-scale medical image datasets. Dong et al. [29] proposed a Dual-GCN framework for image captioning that integrates an object-level GCN and an image-level GCN. The object-level GCN extracts the spatial relationships between objects within an image, while the image-level GCN utilizes the similarities among multiple images to enhance global feature representation. These embeddings were combined and passed to a transformer-based linguistic decoder, enabling detailed and accurate image captioning. Additionally, the authors introduced a curriculum learning strategy to train the model by progressively incorporating more complex data samples, enhancing robustness and generalization. The Dual-GCN framework explicitly uses graph structures to model relationships, which is powerful, but computationally intensive, especially when generating global embedding from similar images due to the high computational demands of graph construction and similarity calculations.

3. Methods

3.1. Model Architecture

The proposed model architecture, as shown in Figure 1, was designed to leverage the strengths of CNNs and the ViT for image classification. The architecture consists of multiple CNN-ViT component blocks, an attention gate mechanism, a multi-scale fusion module, and a classification layer. The architecture is modular, flexible, and aims to extract robust multi-scale features, while allowing for feature selection and fusion.

3.2. Component Blocks

Each component block begins with a base CNN model designed to extract low-level spatial features. The input image of a shape

(H, W, C)

(height, width, and channel) is passed to one of the CNN models to extract feature maps. Afterward, the extracted feature maps are further reshaped to

(\frac{H}{P}, \frac{W}{P}, C)

sequential patches (with a patch size

P

) with position embedding, making the shape compatible with the ViT encoder. The proposed architecture allows for flexibility in the number of component blocks, and each block can use a different CNN architecture depending on the requirements and available computational resources. Then, the sequential patches are then passed to the ViT encoder to capture long-range dependencies and global context information. The ViT encoder consists of a sequence of transformer encoder layers, each containing layer normalization, which standardizes the feature map inputs, reducing internal covariate shifts and improving the stability of the learning process. The normalized features are then passed through a multi-head self-attention mechanism, where the input sequence is split across multiple attention heads, each learning unique dependencies. For each head, the input features are projected into query, key, and value vectors, which are used to compute attention scores. Following the MHA, another normalization layer and then the multi-layer perceptron (MLP) layer are applied. The MLP layer comprises two fully connected layers with Gaussian Error Linear Unit (GELU), which introduces non-linearity and further enhances feature representation. To prevent overfitting, dropout layers are applied after each fully connected layer. Finally, residual connections are added around both the MHA and MLP layers. These connections help retain the original feature information, support gradient flow, and prevent the degradation of performance over multiple layers.

3.3. Gated Mechanism with Attention

The gated mechanism (Figure 2) in the proposed architecture refines and controls the flow of important features extracted from the component blocks. It integrates channel attention, element-wise attention, and spatial attention mechanisms, each of which captures different aspects of the input feature to ensure that only the most relevant information is passed forward. These attention mechanisms operate independently, and their results are concatenated to create a comprehensive and enhanced feature representation.

3.3.1. Channel-Wise Attention

The channel attention mechanism leverages global average pooling (AvgPool) and global max pooling (MaxPool) to highlight the most important channels in the input feature map

X

. The pooling operations summarize the feature information across all spatial locations for each channel. After pooling, the concatenated (Concat) feature map is passed through two fully connected layers, first with tanh activation, and then sigmoid (

σ

) activation, generating attention weights that emphasize the significant channels. These weights are then broadcast across the spatial locations, allowing for the model to scale each channel’s feature map by its importance.

Mathematically, the channel attention is computed as follows:

A_{c h a n n e l} = σ ({D e n s e}_{2} (t a n h ({D e n s e}_{1} (C o n c a t (A v g P o o l (X), M a x P o o l (X))))))

(1)

A_{c h a n n e l}

is applied to the input feature map through element-wise multiplication, highlighting the important channels.

3.3.2. Element-Wise Attention

Element-wise attention is applied directly to each feature vector in the input feature map

X

. This mechanism generates attention scores for individual features within each spatial location. These scores are computed using two fully connected layers with tanh and sigmoid activations, like the channel attention mechanism. Element-wise attention generates attention scores by first projecting the feature map into a lower-dimensional space using linear transformation. Tanh activation emphasizes the most significant relationships by mapping the feature values within a symmetric range of [−1, 1]. This output is further scaled to [0, 1] using the sigmoid activation, which provides a probabilistic interpretation for the attention scores. This approach allows for the model to selectively emphasize or suppress specific elements in each feature vector.

A_{e l e m e n t} = σ {(D e n s e}_{2} (t a n h ({D e n s e}_{1} (W_{e} X + b_{e}))))

(2)

The input feature map

X

is first linearly transformed using a weight matrix

W_{e}

and bias

b_{e}

, capturing the relationships between the feature elements. The transformed features are then passed through

t a n h

, which introduces non-linearity and ensures the output lies in the range [−1, 1].

s i g m o i d

is applied to scale the values to [0, 1], making them suitable as attention weights. Element-wise attention is then applied via element-wise multiplication to the input feature map for high-level control of feature importance.

3.3.3. Spatial-Wise Attention

Spatial attention captures the relationships between different spatial regions in the input feature map by reshaping it into its original 2D spatial form. The input feature map is reshaped, and global average pooling and global max pooling are applied across the channel dimension. The resulting pooled features are concatenated and processed by a 7 × 7 convolutional (

C o n v

) layer with sigmoid activation to generate spatial attention weights. These weights are applied to the feature map to emphasize the important spatial regions.

A_{c h a n n e l} = σ (C o n v (C o n c a t (A v g P o o l (X), M a x P o o l (X))))

(3)

The attended spatial features are then reshaped back into their original form and multiplied element-wise with the input feature map.

3.3.4. Combining Attention

Finally, the outputs of the channel attention, element-wise attention, and spatial attention mechanisms are concatenated along the feature dimension to form the final, enhanced feature representation. The combined attention map ensures that the most significant channels, elements, and spatial regions are retained, enabling the model to emphasize the critical features, while suppressing less-relevant information.

3.4. Multi-Scale Fusion Module

The multi-scale fusion module in Figure 3 was designed to fuse multi-scale features by using an Inception [30] module-style architecture. This module consists of several branches to capture information at different scales. Branch 1 uses 1 × 1 convolution to capture fine-grained and localized details from the feature maps. These operations reduce dimensionality and mitigate the computational cost. Branch 2 uses 3 × 3 convolution to capture medium-scale features, effectively balancing spatial resolution and contextual information. Branch 3 uses 5 × 5 convolution to capture broader and larger-scale features, which is crucial for identifying the patterns spanning larger regions in the image. Branch 4 applies hybrid pooling [31], a combination of max pooling and Hartley spectral pooling [32], by 1 × 1 convolution to capture and preserve very large-scale features and the global context. The Harley pooling technique transforms the input feature maps into a frequency domain using discrete Hartley transform. By operating in the frequency domain, Harley spectral pooling captures the global spatial structures, while filtering out high-frequency noises. This method retains more spatial information than max pooling, which alleviates the resolution loss issue inherent in max pooling by preserving more spatial information. In each of the branches, the ReLU activation function was applied to the convolution layers. These branches were concatenated along the channel axis, resulting in comprehensive feature representation across the different scales.

3.5. Classification Layer

After the multi-scale fusion module, the output features were flattened and passed through two fully connected layers. Dropout layers were used to prevent overfitting during training. The final fully connected layer employed the SoftMax activation function to produce class probabilities, corresponding to the number of target classes for the classification tasks in this study.

4. Experimental Results and Discussion

4.1. Dataset Description

The dataset used consists of postal-view X-ray images, which were sourced from the publicly available databases described as follows:

TB class: TBX11k [33], the National Library of Medicine (NLM) dataset [34], the NIAID TB dataset [35], and the Diagnox and PRORAD URS datasets [36].
COVID-19 class: the COVIDx-CXR-3 dataset [37] and the Extensive COVID-19 X-Ray and CT Chest Images dataset [38].
Pneumonia class: the Pneumonia dataset [39] and the RSNA Pneumonia Dataset [40]
Normal class: the normal class was assembled from the various datasets mentioned above.

For the training and evaluation of the proposed model, a combined dataset was used for each class to ensure more diverse and representative sample images, which contributed to the robustness and generalizability of the trained model. Additionally, pixel-to-pixel comparison was conducted to ensure no redundancy or repetition of images when combining the datasets. The combined dataset may have introduced biases due to the differences in annotation protocols across repositories, which may lead to variable image quality and inconsistent labels, impacting the model’s ability to generalize to unseen data. To mitigate these biases, we employed pixel normalization, which scales the pixel values of an image to a range of [0, 1] by dividing the pixel values by 255 (the maximum intensity value for an 8-bit image). The custom dataset comprises 19,621 samples for the COVID-19 class, 17,952 samples for healthy people, 10,285 samples for the Pneumonia class, and 2851 samples for the TB class. To enhance the model’s performance and prevent overfitting because of the data imbalance, we applied data augmentation during the data processing stage, as described in Section 4.5.

4.2. Experimental Parameters and Environment

For the experiments, we used two CNN-ViT component blocks, consisting of two base CNNs, EfficientNetB3 [41] and DenseNet-121 [42], respectively. We achieved the best results when we set all the CNN layers to be trainable and started with pre-trained weights, which helped leverage prior knowledge for better feature extraction. The hyperparameter configurations of the Vitt encoder are summarized in Table 1. The choice of six Vitt encoder layers and two multi-head self-attention blocks strikes a balance between capturing long-range dependencies and maintaining computational efficiency. The hidden dimension of 32 and the MLP dimension of 64 ensure compact, yet expression representations, which help prevent overfitting. A dropout rate of 0.5 was applied consistently to mitigate overfitting by randomly disabling the neurons during training. A patch size of eight and an image resolution of 224 × 224 pixels were selected to preserve the spatial details. In the classification layer, we used dense units of 1024 and 128 for the first and second layers, respectively, with a dropout rate of 0.5 after each layer. This configuration effectively balances feature dimensionality reduction with enhanced discriminative power. In the classification layer, we used dense units of 1024 and 128 for the first and second layers, respectively, with a dropout rate of 0.5 after each layer. SoftMax activation was used in the classification layer as it is suitable for multi-class classification. When compiling the model, we used the Adam optimization and a categorical cross-entropy loss. For the learning rate, we set its initial value as 1 × 10⁻⁵, and then applied ReduceLROnPlateau callback (factor = 0.1, patience = 5) from keras to decrease the learning rate to enhance the convergence of the model. The experiments were performed on a Nvidia RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA).

4.3. Evaluation Metrics

The performance of our proposed model was evaluated using the following evaluation metrics:

Accuracy represents the ratio of correctly classified cases to the total number of cases:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(4)

where true positives (TPs) represent the number of cases correctly classified into the class they belong to, and true negatives (TNs) represent the number of cases correctly classified as not corresponding to a class. False positives (FPs) represent the number of cases incorrectly classified into a class, and false negatives (FNs) represent the number of cases incorrectly classified as not belonging to a class.

Precision measures the ratio of positive, correctly predicted cases to the total number of positive classification predictions [43]:

P r e c i s i o n = \frac{T P}{T P + F P}

(5)

Recall measures the ratio of the actual positive, correctly predicted cases:

R e c a l l = \frac{T P}{T P + F N}

(6)

The F1-score measures the average of precision and recall [41].

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(7)

4.4. Classification Results of the Proposed Model

The performance of the proposed model is shown in Table 2. The model achieves an overall accuracy of 99.5% on the multi-class classification task when evaluated on the test set, with an individual class accuracy of 99.0% for the COVID-19 class, 100.0% for the healthy samples, 99.0% for the Pneumonia class, and 100.0% for the TB class. The precision, recall, and F1-score for each class are ranged from 0.99 to 1.00. The high performance across all the classes demonstrates the effectiveness of the proposed model in accurately classifying various types of abnormalities in medical imaging. In Figure 4, the confusion matrix provides a detailed view of the classification performance of the proposed model across the four classes. For the COVID-19 and Pneumonia classes, 99 out of 100 samples were correctly classified, with only 1 sample being misclassified for each. This slight misclassification may be caused by the overlapping features between the COVID-19 and Pneumonia chest radiographs, potentially due to cases with the infection of COVID-19 and Pneumonia at the same time. The healthy and TB classes show perfect classification, with 100 correct predictions for each. The ability to correctly classify most class samples indicates that the model effectively captures the unique patterns associated with these diseases.

4.5. Ablation Studies

Table 3 presents the results of our investigation on the impact of various data augmentation techniques on the performance of the model. In our experiment, we observed that CutMix [44] and RandAugment [45] are effective in improving the model’s performance, achieving an accuracy of 99.50%. As illustrated in Figure 5, CutMix involves combining patches from two images and mixing their labels proportionally to the area of the patches, which enables the model to focus on the less-discriminative parts of an object. RandAugment, on the other hand, applies a fixed number of randomly chosen transformations with adjustable magnitudes. The model without any data augmentation achieves an accuracy of 98.25%.

To further investigate the effectiveness of the proposed components, we analyzed the impact of the gated mechanism and the multi-scale fusion module on the overall performance of the model. Table 4 presents the results of this ablation study. When the gated mechanism was removed from the model, the accuracy dropped to 99.25%, indicating that the gated mechanism plays a crucial role in enabling the model to effectively control and select important features extracted from the CNN and ViT components. Similarly, removing the multi-scale fusion module resulted in a decrease in accuracy to 99.00%, indicating the importance of the multi-scale fusion module in enabling the model to capture features at multiple scales and improve the overall classification performance. Lastly, when both the gated mechanism and the multi-scale fusion module were added, the accuracy was 99.50%, indicating the positive effect of these proposed components in enhancing the model’s performance. Figure 6 illustrates the effectiveness of the proposed model using LIME [46] explainability analysis. The highlighted regions in the images (outlined in yellow) correspond to the most important areas identified by the model for classification. The figure showcases results across various samples, with the clear localization of critical regions relevant to the classification task. These visualizations confirm that the model successfully focuses on pertinent features, such as abnormal regions in chest X-rays, reinforcing the significance of the gated mechanism and the multi-scale fusion module in driving the attention towards diagnostically relevant areas.

Next, we examined the impact of increasing the number of component blocks in our model architecture, as shown in Table 5. This experiment clearly demonstrates that as the number of component blocks increases, the accuracy improves significantly. Starting with a single component block (CNN model: EfficientNetB3 [41]), the model achieves an accuracy of 96.50%. With two component blocks (CNN models: EfficientNetB3 and DenseNet-121 [42]), the classification accuracy increases to 99.50%. With three component blocks (CNN models: EfficientNetB3, DenseNet121, and MobileNet [47]), the accuracy slightly increases to 99.55%. This indicates that the inclusion of additional component blocks allows for the model to capture more diverse and complex features. The key contributor to performance improvement as the number of component blocks increases is the use of different CNN models in each component block. By employing a variety of CNN architectures, each block can learn distinct representations from the input data. This approach introduces complementary features from different CNNs, enhancing the model’s overall capacity to generalize well across various data patterns.

4.6. Comparison with Existing Models

The proposed CNN-ViT model achieves a classification accuracy of 99.50%, outperforming the existing hybrid approaches listed in Table 6. This performance improvement can be attributed to the integration of gated attention mechanisms and multi-scale fusion, which enable the effective combination of local and global features, addressing challenges like long-range dependencies and contextual learning. Despite its high performance, our model has inherent limitations. The reliance on extensively labeled datasets may limit its generalizability, particularly in scenarios with data scarcity or varying annotation standards. These limitations underscore the need for future work exploring self-supervised learning (SSL) to reduce the dependency on labeled data, while maintaining accuracy.

5. Conclusions

The results of this study demonstrate that the proposed CNN-ViT hybrid model, incorporating a gated mechanism with attention and a multi-scale fusion module, outperforms the state-of-the-art studies in pulmonary disease classification. Achieving an accuracy of 99.50%, the model’s superior performance is attributed to the seamless integration of CNN and ViT encoder layers, which balances local feature extraction with long-range dependency capture. This hybrid approach addresses the individual limitations of CNN and ViT architectures by effectively combining the CNN’s strength in capturing local features with the ViT’s capacity for global feature representation. The success of this model suggests that a hybrid CNN-ViT architecture with enhanced feature selection can serve as a robust solution for complex, multi-class medical image classification tasks.

The proposed model uses a different CNN architecture in different component blocks, such as EfficientNetB3 and DenseNet-121, for feature extraction. EfficientNetB3 captures the features efficiently with fewer parameters, while DenseNet-121 focuses on feature reuse through densely connected layers, enhancing representation learning. Feeding the output of these CNN architectures to the ViT allows for the model to leverage various feature maps from different CNN architectures and utilize the strength of the ViT to capture long-range dependencies across these features to learn the subtle disease patterns in medical images.

The broader impacts of this research can be extended to real-world applications, particularly in resource-constrained clinical settings. The model’s efficiency and high accuracy make it suitable for the rapid diagnosis of pulmonary diseases, which is critical for early detection and treatment in underserved areas. Additionally, this framework can be adapted for other medical imaging tasks, providing a generalized approach for disease diagnosis.

To ensure repeatability of the proposed method, we conducted all the experiments multiple times under consistent conditions, including random seed initialization, fixed dataset splits, and the hyperparameter configurations. The results reported represent the average performance of our model across different runs, ensuring the reproducibility and robustness of the findings.

A significant challenge to this research is the availability of labeled medical imaging data, as annotating large datasets is labor-intensive and requires domain expertise. We plan to address this limitation by leveraging SSL techniques, such as SimCLR [51] (Simple Contrastive Learning of Representation) and BYOL [52] (Bootstrap Your Own Latent). We will implement the SSL technique using a teacher–student concept, where a teacher model performs a pre-text task to extract meaningful features, such as predicting the image rotations, the patch orders, or colorization, and a student model learns from these features for downstream tasks like classification. By using SSL, we aim to reduce the dependency on labeled data, while enabling the model to achieve competitive accuracy.

Furthermore, reducing model complexity remains a priority, as attention mechanisms like MHA add considerable computational demands. In future work, exploring attention mechanisms that lower the computational load, such as Linformer [53], Performer [54], or ProbSparse [55] attention, could streamline the architecture without sacrificing the model’s performance. Additionally, we observed that the model tends to confuse some images between the Pneumonia and COVID-19 classes. This may be due to the fact that COVID-19 patients are often complicated with Pneumonia [56], leading to overlapping features in chest X-ray images. The datasets used for training and evaluation were sourced from different repositories, increasing the likelihood that some images labeled as COVID-19 might also exhibit signs of Pneumonia. Refining the model to address these misclassifications will also involve curating more refined datasets that emphasize subtle distinctions between these classes or applying transfer learning with domain-specific datasets to enhance the model’s ability to distinguish overlapping visual features. Such improvements would enhance the model’s practicality in clinical settings, advancing its applicability to a broader range of diagnostic challenges.

Lastly, despite the strong performance of the proposed model, one limitation is that it was not calibrated during this study. Model calibration is crucial, particularly in clinical applications, as it ensures that the predicted probabilities are reflective of the true likelihood of correctness. Calibration addresses the issue of overconfidence, where the model’s high-level confidence in predictions does not always correspond to the correct prediction. In future work, we aim to incorporate temperature scaling, a post hoc calibration technique to align the predicted probabilities with actual accuracy to not only bolster the reliability of the model’s predictions, but also strengthen its suitability for practical deployment in clinical settings.

Author Contributions

Conceptualization, O.C. and X.Y.; methodology, O.C. and X.Y.; software, O.C.; validation, O.C. and X.Y.; formal analysis, O.C.; investigation, O.C. and X.Y.; resources, X.Y.; data curation, O.C.; writing—original draft preparation, O.C.; writing—review and editing, O.C. and X.Y.; visualization, O.C. and X.Y.; supervision, X.Y.; project administration, X.Y.; funding acquisition, X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by No. 202400490001 of Handong Global University Research Grants and the National Research Foundation (NRF), Korea, under project BK21 FOUR (No.5199990314060).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

World Health Organization [WHO]. Global Tuberculosis Report 2022; WHO Press, World Health Organization: Geneva, Switzerland, 2022; Available online: https://www.who.int/publications/i/item/9789240061729 (accessed on 3 November 2024).
Showkatian, E.; Salehi, M.; Ghaffari, H.; Reiazi, R.; Sadighi, N. Deep learning-based automatic detection of tuberculosis disease in chest X-ray images. Pol. J. Radiol. 2022, 87, 118–124. [Google Scholar] [CrossRef] [PubMed]
World Health Organization [WHO]. Chest Radiography in Tuberculosis Detection: Summary of Current WHO Recommendations and Guidance on Programmatic Approaches; WHO Press, World Health Organization: Geneva, Switzerland, 2016; Available online: https://www.who.int/publications/i/item/9789241511506 (accessed on 3 November 2024).
Acharya, V.; Dhiman, G.; Prakasha, K.; Bahadur, P.; Choraria, A.; M, S.; J, S.; Prabhu, S.; Chadaga, K.; Viriyasitavat, W.; et al. AI-assisted tuberculosis detection and classification from chest X-rays using a deep learning normalization-free network model. Comput. Intell. Neurosci. 2022, 2022, 2399428. [Google Scholar] [CrossRef] [PubMed]
Kotei, E.; Thirunavukarasu, R. Ensemble technique coupled with deep transfer learning framework for automatic detection of tuberculosis from chest X-ray radiographs. Healthcare 2022, 10, 2335. [Google Scholar] [CrossRef] [PubMed]
Alshmrani, G.M.M.; Ni, Q.; Jiang, R.; Pervaiz, H.; Elshennawy, N.M. A deep learning architecture for multi-class lung diseases classification using chest X-ray (CXR) images. Alex. Eng. J. 2022, 64, 923–935. [Google Scholar] [CrossRef]
Lin, A.; Chen, B.; Xu, J.; Zheng, Z.; Lu, G. DS-TransUNET: Dual SWIN Transformer U-Net for medical image segmentation. IEEE Trans. Instrum. Meas. 2022, 71, 1–15. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Khan, A.; Rauf, Z.; Sohail, A.; Rehman, A.; Asif, H.M.; Asif, A.; Farooq, U. A survey of the vision transformers and its CNN-transformer based variants. arXiv 2023, arXiv:2305.09880. [Google Scholar]
Khan, A.; Rauf, Z.; Khan, A.R.; Rathore, S.; Khan, S.H.; Shah, S.; Farooq, U.; Asif, H.; Asif, A.; Zahoora, U.; et al. A recent survey of vision transformers for medical image segmentation. arXiv 2023, arXiv:2312.00634. [Google Scholar]
Yunusa, H.; Qin, S.; Chukkol, A.H.A.; Yusuf, A.A.; Bello, I.; Lawan, A. Exploring the Synergies of Hybrid CNNs and ViTs Architectures for Computer Vision: A survey. arXiv 2024, arXiv:2402.02941. [Google Scholar]
Mousavi, M.; Hosseini, S. A deep convolutional neural network approach using medical image classification. BMC Med. Inform. Decis. Mak. 2024, 24, 239. [Google Scholar] [CrossRef]
Rajaraman, S.; Antani, S. Modality-specific deep learning model ensembles toward improving TB detection in chest radiographs. IEEE Access 2020, 8, 27318–27326. [Google Scholar] [CrossRef] [PubMed]
Vinayakumar, R.; Acharya, V.; Alazab, M. A multichannel EfficientNet deep learning-based stacking ensemble approach for lung disease detection using chest X-ray images. Clust. Comput. 2022, 26, 1181–1203. [Google Scholar]
Sasikaladevi, N.; Revathi, A. Deep learning framework for the robust prognosis of tuberculosis from radiography images based on fused linear triangular interpolation. Res. Sq. 2022. [Google Scholar] [CrossRef]
Urooj, S.; Suchitra, S.; Krishnasamy, L.; Sharma, N.; Pathak, N. Stochastic learning-based artificial neural network model for an automatic tuberculosis detection system using chest X-ray images. IEEE Access 2022, 10, 103632–103643. [Google Scholar] [CrossRef]
Mabrouk, A.; Redondo, R.P.D.; Dahou, A.; Elaziz, M.A.; Kayed, M. Pneumonia detection on chest X-ray images using ensemble of deep convolutional neural networks. Appl. Sci. 2022, 12, 6448. [Google Scholar] [CrossRef]
Sun, W.; Pang, Y.; Zhang, G. CCT: Lightweight Compact Convolutional Transformer for lung disease CT image classification. Front. Physiol. 2022, 13, 1066999. [Google Scholar] [CrossRef]
Ukwuoma, C.C.; Qin, Z.; Heyat, M.B.B.; Akhtar, F.; Smahi, A.; Jackson, J.; Qadri, S.F.; Muaad, A.Y.; Monday, H.N.; Nneji, G.U. Automated lung-related pneumonia and COVID-19 detection based on novel feature extraction framework and vision transformer approaches using chest X-ray images. Bioengineering 2022, 9, 709. [Google Scholar] [CrossRef]
Ren, K.; Hong, G.; Chen, X.; Wang, Z. A COVID-19 medical image classification algorithm based on transformer. Sci. Rep. 2023, 13, 5359. [Google Scholar] [CrossRef]
Gu, A.; Gulcehre, C.; Paine, T.; Hoffman, M.; Pascanu, R. Improving the gating mechanism of recurrent neural networks. arXiv 2019, arXiv:1910.09890. [Google Scholar]
Zhang, K.; Xu, G.; Jin, Y.K.; Qi, G.; Yang, X.; Bai, L. Palmprint recognition based on gating mechanism and adaptive feature fusion. Front. Neurorobotics 2023, 17, 1203962. [Google Scholar] [CrossRef]
Schlemper, J.; Oktay, O.; Schaap, M.; Heinrich, M.; Kainz, B.; Glocker, B.; Rueckert, D. Attention gated networks: Learning to leverage salient regions in medical images. Med. Image Anal. 2019, 53, 197–207. [Google Scholar] [CrossRef] [PubMed]
Fang, W.; Han, X. Spatial and channel attention modulated network for medical image segmentation. In Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops, Singapore, 20–23 May 2021; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2021; pp. 3–17. [Google Scholar]
Valanarasu, J.M.J.; Oza, P.; Hacihaliloglu, I.; Patel, V.M. Medical transformer: Gated axial-attention for medical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2021; pp. 36–46. [Google Scholar]
Woo, S.; Park, J.; Lee, J.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation networks. arXiv 2017, arXiv:1709.01507. [Google Scholar]
Matlock, M.K.; Datta, A.; Dang, N.L.; Jiang, K.; Swamidass, S.J. Deep learning long-range information in undirected graphs with wave networks. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–8. [Google Scholar]
Dong, X.; Long, C.; Xu, W.; Xiao, C. Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, China, 20–24 October 2021. [Google Scholar]
Szegedy, C.; Liu, N.W.; Jia, N.Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. arXiv 2014, arXiv:1409.4842. [Google Scholar]
Punn, N.S.; Agarwal, S. Inception U-Net architecture for semantic segmentation to identify nuclei in microscopy cell images. ACM Trans. Multimed. Comput. Commun. Appl. 2020, 16, 1551–6857. [Google Scholar] [CrossRef]
Zhang, H.; Ma, J. Hartley Spectral pooling for deep learning. arXiv 2018, arXiv:1810.04028. [Google Scholar] [CrossRef]
Liu, Y.; Wu, Y.; Ban, Y.; Wang, H.; Cheng, M. Rethinking computer-aided tuberculosis diagnosis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2646–2655. [Google Scholar]
Jaeger, S.; Candemir, S.; Antani, S.; Wang, Y.J.; Lu, P.; Thoma, G.R. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant. Imaging Med. Surg. 2014, 4, 475–477. [Google Scholar]
Rahman, T.; Khandakar, A.; Kadir, M.A.; Islam, K.R.; Islam, K.F.; Mazhar, R.; Hamid, T.; Islam, M.T.; Kashem, S.; Mahbub, Z.B.; et al. Reliable tuberculosis detection using chest X-ray with deep learning, segmentation and visualization. IEEE Access 2020, 8, 191586–191601. [Google Scholar] [CrossRef]
Chauhan, A.; Chauhan, D.; Rout, C. Role of Gist and PHOG Features in computer-aided diagnosis of tuberculosis without segmentation. PLoS ONE 2014, 9, e112980. [Google Scholar] [CrossRef]
Wang, L.; Lin, Z.Q.; Wong, A. COVID-Net: A Tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images. Sci. Rep. 2020, 10, 19549. [Google Scholar] [CrossRef]
El-Shafai, W.; El-Samie, F.A. Extensive COVID-19 X-Ray and CT Chest Images Dataset (Version V3, Vol. 3) [Dataset]. Mendeley Data. 2020. Available online: https://data.mendeley.com/datasets/8h65ywd2jr/3 (accessed on 5 November 2024).
Kermany, D.; Goldbaum, M.H.; Cai, W.; Valentim, C.C.S.; Liang, H.; Baxter, S.L.; McKeown, A.; Yang, G.; Wu, X.; Yan, F.; et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 2018, 172, 1122–1131.e9. [Google Scholar] [CrossRef] [PubMed]
Stein, A.; Wu, C.; Carr, C.; Shih, G.; Dulkowski, J.; Chen, L.; Prevedello, L.; Kohli, M.; McDonald, M.; Kalpathy, P.; et al. RSNA pneumonia detection challenge [Dataset]. Kaggle. 2018. Available online: https://kaggle.com/competitions/rsna-pneumonia-detection-challenge (accessed on 5 November 2024).
Tan, M.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. arXiv 2016, arXiv:1608.06993. [Google Scholar]
Margarat, G.S.; Hemalatha, G.; Mishra, A.; Shaheen, H.; Maheswari, K.; Tamijeselvan, S.; Kumar, U.P.; Banupriya, V.; Ferede, A.W. Early diagnosis of tuberculosis using deep learning approach for IoT-based healthcare applications. Comput. Intell. Neurosci. 2022, 303, 1–9. [Google Scholar] [CrossRef]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. CutMix: Regularization strategy to train strong classifiers with localizable features. arXiv 2019, arXiv:1905.04899. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. RandAugment: Practical automated data augmentation with a reduced search space. arXiv 2019, arXiv:1909.13719. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?”: Explaining the predictions of any classifier. arXiv 2016, arXiv:1602.04938. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Barhoumi, Y.; Rasool, G. ScopeFormer: N-CNN-VIT Hybrid Model for Intracranial Hemorrhage Classification. arXiv 2021, arXiv:2107.04575. [Google Scholar]
Chen, J.; Wu, P.; Zhang, X.; Xu, R.; Liang, J. Add-Vit: CNN-Transformer Hybrid Architecture for small data paradigm processing. Neural Process. Lett. 2024, 56, 198. [Google Scholar] [CrossRef]
Shah, S.A.; Taj, I.; Usman, S.M.; Shah, S.N.H.; Imran, A.S.; Khalid, S. A hybrid approach of vision transformers and CNNs for detection of ulcerative colitis. Sci. Rep. 2024, 14, 24771. [Google Scholar] [CrossRef] [PubMed]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. arXiv 2020, arXiv:2002.05709. [Google Scholar]
Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar]
Grill, J.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E.; Doersch, C.; Pires, B.A.; Guo, Z.D.; Azar, M.G.; et al. Bootstrap your own latent: A new approach to self-supervised Learning. arXiv 2020, arXiv:2006.07733. [Google Scholar]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking attention with performers. arXiv 2020, arXiv:2009.14794. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. arXiv 2020, arXiv:2012.07436. [Google Scholar] [CrossRef]
Ufuk, F.; Savaş, R. COVID-19 pneumonia: Lessons learned, challenges, and preparing for the future. Diagn. Interv. Radiol. 2022, 28, 576–585. [Google Scholar] [CrossRef]

Figure 1. The proposed hybrid architecture.

Figure 2. Gated mechanism with attention.

Figure 3. Inception-styled multi-scale fusion module proposed in this study.

Figure 4. A confusion matrix for the proposed model.

Figure 5. Impact of different augmentation methods on original images.

Figure 6. Impact of gated mechanism and multi-scale fusion using LIME explainability analysis.

Table 1. Hyperparameter configurations of ViT encoder.

Hyperparameters	Value
Number of ViT encoder layers	6
Hidden dimension	32
Multi-layer perceptron dimension	64
Number of multi-head self-attention blocks	2
Dropout rate	0.5
Patch size	8
Image channels	3
Image size	224 × 224
Epoch	51

Table 2. Classification results of proposed model.

Category	Precision	Recall	F1-Score	Accuracy	Overall Accuracy
COVID-19	0.99	0.99	0.99	99.0%	99.5%
Healthy	0.99	1.00	1.00	100.0%
Pneumonia	1.00	0.99	0.99	99.0%
Tuberculosis	1.00	1.00	1.00	100.0%

Table 3. Impact of data augmentation on classification accuracy.

Data Augmentation	Classification Accuracy
CutMix [44]	99.50%
RandAugment [45]	99.50%
Without augmentation	98.25%

Table 4. Impact of gated mechanism and multi-scale fusion module on classification accuracy.

Models	Classification Accuracy
Without the gated mechanism	99.25%
Without the multi-scale fusion module	99.00%
With both modules	99.50%

Table 5. Impact of increasing number of component blocks on classification accuracy.

Models	Classification Accuracy
1 component block (CNN model: EfficientNetB3 [41])	96.50%
2 component blocks (CNN models: EfficientNetB3 and DenseNet-121 [42])	99.50%
3 component blocks (CNN models: EfficientNetB3, DenseNet-121, and MobileNet [47])	99.55%

Table 6. Comparison with existing hybrid models on multi-class classification tasks.

Studies	Classification Accuracy
Barhoumi and Rasool [48]	98.04%
Chen et al. [49]	96.60%
Shah et al. [50]	90.00%
The proposed model	99.50%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chibuike, O.; Yang, X. Convolutional Neural Network–Vision Transformer Architecture with Gated Control Mechanism and Multi-Scale Fusion for Enhanced Pulmonary Disease Classification. Diagnostics 2024, 14, 2790. https://doi.org/10.3390/diagnostics14242790

AMA Style

Chibuike O, Yang X. Convolutional Neural Network–Vision Transformer Architecture with Gated Control Mechanism and Multi-Scale Fusion for Enhanced Pulmonary Disease Classification. Diagnostics. 2024; 14(24):2790. https://doi.org/10.3390/diagnostics14242790

Chicago/Turabian Style

Chibuike, Okpala, and Xiaopeng Yang. 2024. "Convolutional Neural Network–Vision Transformer Architecture with Gated Control Mechanism and Multi-Scale Fusion for Enhanced Pulmonary Disease Classification" Diagnostics 14, no. 24: 2790. https://doi.org/10.3390/diagnostics14242790

APA Style

Chibuike, O., & Yang, X. (2024). Convolutional Neural Network–Vision Transformer Architecture with Gated Control Mechanism and Multi-Scale Fusion for Enhanced Pulmonary Disease Classification. Diagnostics, 14(24), 2790. https://doi.org/10.3390/diagnostics14242790

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Convolutional Neural Network–Vision Transformer Architecture with Gated Control Mechanism and Multi-Scale Fusion for Enhanced Pulmonary Disease Classification

Abstract

1. Introduction

2. Related Work

2.1. Pulmonary Disease Detection Based on CNN Architecture

2.2. Pulmonary Disease Detection Based on Transformer Architecture

2.3. Gated Mechanisms

2.4. Attention Mechanisms

2.5. Graph-Based Hybrid Models

3. Methods

3.1. Model Architecture

3.2. Component Blocks

3.3. Gated Mechanism with Attention

3.3.1. Channel-Wise Attention

3.3.2. Element-Wise Attention

3.3.3. Spatial-Wise Attention

3.3.4. Combining Attention

3.4. Multi-Scale Fusion Module

3.5. Classification Layer

4. Experimental Results and Discussion

4.1. Dataset Description

4.2. Experimental Parameters and Environment

4.3. Evaluation Metrics

4.4. Classification Results of the Proposed Model

4.5. Ablation Studies

4.6. Comparison with Existing Models

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI