Comparison Between Early and Intermediate Fusion of Multimodal Techniques: Lung Disease Diagnosis
Abstract
1. Introduction
2. Background
2.1. Medical Part
2.1.1. Medical Data Collection
- Image: Medical imaging techniques, including computed tomography (CT), X-rays, magnetic resonance imaging (MRI), and digital pathology, provide visual depictions of internal structures and abnormalities. These images are essential components of a diagnosis, revealing the fine details required for detecting and describing various medical disorders [11].
- Text: Textual data, including clinical notes, medical literature, and electronic health records, provide a narrative thread that intertwines the patient’s medical history, medical journey, and contextual information that are essential for an accurate diagnosis [11].
- Speech: Speech recordings offer a special way to comprehend the symptoms and feelings of patients. This modality gives the diagnostic process a qualitative component by capturing subtleties such as tone, pace, and articulation [11].
- Genetic data: A molecular layer is introduced by genetic data to clarify the innate predispositions, vulnerabilities, and genetic markers that may affect the manifestation of diseases [11].
- Physiological signals: Real-time snapshots of cardiac and brain activity are provided by signal data. This dynamic modality provides important insights into the anomalies and patterns linked to neurological or cardiac disorders by efficiently capturing temporal variations [11].
2.1.2. Lung Disease Symptoms
- COVID-19: Coronaviruses (such as COVID-19) are enveloped, positive-sense, nonsegmented, and single-strand ribonucleic acid viruses belonging to the Coronaviridae family. Under an electron microscope, the viruses have a distinctive shape as they have viral spike peplomers that protrude from the viral envelope that resemble crowns. Coronaviruses are widely spread among humans and mammals [12]. COVID-19’s most common symptoms include cough, fever, fatigue, dyspnea, and myalgia. Headaches, hemoptysis, sputum, and gastrointestinal symptoms are the less common symptoms [13]. In chest imaging, patients with COVID-19 exhibit the usual radiological features, such as bilateral and multifocal ground glass opacities and consolidations with peripheral and basal predominance. Less observed features include cavitation, pleural effusion, bronchiectasis, lymphadenopathy, and septal thickening [14].
- Pneumonia: Pneumonia is an infection that causes inflammation in one or both lungs’ alveoli, or air sacs, resulting in fluid or pus buildup that hinders oxygen exchange [15]. The symptoms involve fever, chills, a persistent cough (often mixed with phlegm), fatigue, weakness, and chest pain that becomes worse when you cough or breathe deeply. Pneumonia usually shows up in chest imaging as patches of lung consolidation that resemble dense or patchy opacities. These opacities might be localized (lobar pneumonia) or spread over both lungs (bronchopneumonia). Air bronchograms (visible air-filled bronchi against a backdrop of consolidated lung tissue), pleural effusion (fluid around the lungs), and blurring of normal lung and diaphragm boundaries are other possible symptoms [16].
- Lung cancer: Lung cancer (LC) is a kind of cancer that starts in the cells of the lungs that exchange oxygen and carbon dioxide when breathing. It occurs when abnormal lung cells develop uncontrollably, resulting in tumors that can impair lung function [17]. The symptoms can include a persistent cough; chest pain; shortness of breath; hemoptysis (coughing up blood); and inexplicable weight loss, fatigue, and recurring infections. In chest imaging, it may also involve pleural effusion, enlarged lymph nodes, abnormal shadows, lung masses or nodules, and regions of lung collapse (atelectasis) [18].
2.2. Technical Aspects
2.2.1. Multimodal Definition
2.2.2. Multimodal Data Fusion
- Early fusion: Early fusion involves concatenating input data or features from heterogeneous sources before a deep learning network processes them. Early fusion multimodal techniques tend to outperform single-modal deep learning methods. However, early fusion multimodal techniques are only appropriate for homogeneous modalities since they have homogeneous features across different modalities. The performance of early fusion will be unreliable if the modalities, such as text and images, are heterogeneous. The best choices for direct modeling in early fusion are convolutional neural network (CNN), recurrent neural network (RNN), and fully connected neural network (FCN) [4].
- Intermediate fusion: Intermediate fusion uses various mapping techniques in a fusion or shared representation layer to convert input data from different sources into higher-level representations. Intermediate fusion enables the model to learn mappings between diverse data, which is beneficial for the decision-making process [4].
- Late fusion: Models are trained independently for every modality considered by late fusion. The outcomes of each model are then combined using a soft-max function to average the probabilities of each model. This strategy assumes that the separate models have contributed equally. Heterogeneous modalities can be handled by late fusion strategies since an imbalance in the number of features across several modalities has no impact on a decision. Some latent relationships may remain hidden because late fusion strategies may not learn the correlation between features of different modalities. Thus, in situations when the correlation between various modalities is lower, late fusion can be helpful [4].
3. Literature Review
3.1. Literature Selection Methodology
3.1.1. Keyword Filtering Stage
3.1.2. Publisher Filtering Stage
3.1.3. Year Filtering Stage
3.1.4. Abstract Filtering Stage
3.2. Literature Selection Taxonomy
3.2.1. Single-Modal
- 1.
- DL Model for the Diagnosis of Lung Diseases using CXR ImagesThe research of [21] provides a grid search-optimized deep learning framework for multi-class lung disease classification. The lung diseases included COVID-19, pneumonia (PNEU), tuberculosis (TB), lung cancer (LC), and lung opacity (LO), which were diagnosed using chest X-ray (CXR) images. It presents two models: one for lung disease detection and the other for disease classification. Using a dataset with about 98,991 images, the models scored 99.82% and 98.75% accuracies in the detection and classification tasks, respectively. However, the study’s dependence on hyperparameter tuning and dataset balance raises concerns regarding its generalizability to diverse clinical data from the real world.
- 2.
- DL Model for the Diagnosis of Lung Diseases using CT Scan ImagesThe research of [23] uses a deep learning system to tackle the problem of detecting pneumonia and COVID-19 through CT imaging. The approach includes segmenting the lungs and infection areas using a U-Net model after preprocessing CT scans with Contrast Limited Adaptive Histogram Equalization to improve contrast. For classification, a three-layered CNN architecture was used, with fourfold cross-validation. The database contained 3,138 annotated images from 20 CT scan cases. According to the experimental results, lung segmentation had a score of 98%, infection segmentation had a score of 91%, and classification accuracy was 98%. The small dataset size was the limitation of the article.
- 3.
- DL Model for the Diagnosis of Lung Diseases using Cough SoundsThe study of [24] explores the detection of COVID-19 using convolutional neural network (CNN), converting cough sounds into mel-spectrogram images. Using 121 cough audio recordings from the Virufy dataset, six CNN models (VGG-16, VGG-19, LeNet-5, AlexNet, ResNet-50, and ResNet-152) were tested with different input sizes. The best classification performance was demonstrated by the AlexNet model, which achieved the greatest accuracy of 86.5% at an input size of 227 × 227. One of the limitations was the dependence on a limited dataset.
3.2.2. Multi-Modal
- 1.
- DL Model for the Diagnosis of Lung Diseases using Different Medical DataA novel transformer-based deep learning model was proposed in the article of [26], which diagnosed tuberculosis (TB) by combining imaging and clinical data. The issue addressed was the necessity of employing multimodal data for early and precise TB identification since only depending on imaging or clinical data might not be enough. Convolutional neural network (CNN) was used to extract visual features and to analyze the clinical data using a denoising autoencoder to extend feature dimensions. A cross-modal transformer module was then used for fusing to create a single feature representation. The dataset included 3256 patients’ X-ray images and 3558 matched clinical records, which were gathered at a government hospital in Uttarakhand, India. According to experimental results, the model outperforms conventional fusion procedures with a classification accuracy of 95%. The comparatively limited number of clinical features is one of the study’s limitations.
3.3. Discussion
- Most studies use the intermediate fusion strategy since it achieves high accuracy in different input formats. Our challenge, and the contribution of this research, was to use—as well as to compare—both intermediate fusion and early fusion strategies to achieve high accuracy.
- To our knowledge thus far, most researchers use a multimodal approach with only two types of medical data. Our study will use three types of medical data.
- To our knowledge thus far, only a few studies have implemented and compared the single-modal approach for each medical data type. Our research will implement a single-modal approach for each data type and compare the results with the proposed approach to illustrate the impact of combining diverse data types. The results of this will also be compared with other works.
- Some studies use a small dataset. Our work will solve this limitation by using multiple datasets for each data type.
- Some studies have issues with data imbalance. This work will handle class imbalance using augmentation techniques.
- Most researchers use convolutional neural network (CNN) architectures, which are the common and powerful classification architectures in deep learning. This study also used the CNN architecture in the classification of lung diseases.
4. Research Methodology
4.1. Dataset Collection
4.1.1. Lung Disease CXR Datasets
4.1.2. Lung Disease CT Scan Datasets
4.1.3. Lung Disease Cough Sound Datasets
4.2. Data Pre-Processing
4.2.1. Converting Cough Sounds into an Image Using a Mel-Spectrogram
4.2.2. Handling and Imbalanced Class Dataset
4.2.3. Image Enhancement
- The CXR, CT scans, and CSI were transformed into grayscale images.
- The CXR, CT scans, and CSI were cropped to remove any undesired items, text, and patient information.
- Any CSI image that was black or white was removed as they do not have useful information and were, thus, not needed.
- Each image’s size was changed to be compatible with the model.
4.3. Proposed Model
4.3.1. Feature Extraction and Classification
- 1.
- Feature extraction: Xception (Extreme Inception) was proposed by Francois Chollet in 2016. The model’s name is taken from “Extreme Inception” because it expands and improves the Inception model’s architecture. The Xception model improves the performance and efficiency of the parameters by using depthwise separable convolutions instead of the standard convolutional operations used in the Inception model [38]. It was used in this study with fine-tuned hyperparameters, and we also modified some layers, such as modifying Dense layers to fit the class number of lung disease in this research and adding a Dropout layer between layers to avoid overfitting. The architecture of Xception is shown in Figure 9. The reason for using the Xception model was because it has the following advantages [39]:
- Parameter efficiency—depthwise separable convolutions use much less parameters than traditional convolutions, making the model lighter and more efficient in parameter utilization.
- Performance—in a variety of visual recognition tasks (e.g., difficult images), the Xception model outperforms many advanced models of its time, including the original Inception model.
- Adaptability—the Xception model is more suited for deployment on devices with limited computational resources because it has fewer parameters.
- 2.
- Classification: Convolutional neural networks, or CNNs, are a family of deep learning models that was especially created to analyze data that has a grid-like topology, such as audio spectrograms (2D time–frequency grids) or images (2D pixel grids) [40]. In many fields, such as image classification [41], object identification [42], and medical image analysis [1], CNN has demonstrated state-of-the-art performance. In this study, a convolutional neural network (CNN) was employed as a classifier. The CNN’s convolutional and fully connected layers were used to train the model to directly assign the input data from each modality to the appropriate disease class. In particular, the network’s output features were passed via a fully connected layer with 256 units and ReLU activation. Then, a dense output layer with four neurons and a softmax activation function were used to predict the probability distribution over the four target classes.
4.3.2. Multimodal Techniques
- 1.
- Multimodal approach using early fusion: We used an early fusion approach to combine several modalities within a single classifier. Raw inputs are preprocessed for each modality, and these are then linearly projected to a common dimensionality d if needed. The modality-specific vectors are produced as and . After that, they are concatenated to create a joint representation:The shared classifier head receives this fused vector and is made up of a fully connected layer with 256 units and ReLU activation, followed by a four-way softmax output layer that generates the class posterior distribution. This design enables the classifier to directly learn decision boundaries across the joint input space while keeping the fusion process simple and computationally efficient (such as, for example, concatenation without cross-attention).
- 2.
- Multimodal approach using intermediate fusion: To produce modality-specific representations, we used an intermediate fusion technique, where each modality is initially independently processed through its own encoder. Let and represent the intermediate features derived from Modalities A and B, respectively. These intermediate representations are further transformed into a shared latent space and then fused instead of directly concatenating the raw features. Compared with early fusion, this method preserves modality-specific processing while enabling the model to learn cross-modal interactions at a deeper level.
- 3.
- Tri-modal data construction: In this study, a tri-modal instance was constructed by associating the chest X-ray images, CT scan images, and cough sound samples that shared the same disease label. Due to the absence of patient-level identifiers across the publicly available datasets, strict patient-level pairing across modalities was not feasible. Therefore, each tri-modal sample was formed through label-based alignment, where one sample from each modality corresponding to the same disease class was grouped to create a single training instance.This strategy enables the model to learn complementary representations across heterogeneous modalities while maintaining consistent diagnostic labels. The resulting framework captures cross-modal correlations at the label level rather than at the individual patient level. This design choice reflects a practical scenario in which different diagnostic tests may be independently available but still jointly contribute to clinical decision making.
5. Results and Discussion
5.1. Experimentation Setup
5.1.1. Training Settings
5.1.2. Evaluation Metrics
- Accuracy (Acc) “gives the percentage of correctly classified cases, including both positive and negative ones”. The accuracy Equation (2) is as follows:
- Sensitivity (Recall) “gives the proportion of positives that were correctly identified (true positive)”. The sensitivity Equation (3) is as follows:
- Specificity “gives the proportion of negatives that were correctly identified (true negative)”. The specificity Equation (4) is as follows:
- Precision gives the “percentage of samples predicted to be positive that are actually positive”. The precision Equation (5) is as follows:
- F1-score (F1) gives “the harmonic mean of precision and recall”. The F1-score Equation (6) is as follows:
- Area Under the Receiver Operating Characteristic Curve (AUC-ROC) “summarizes the ROC curve of true positives against false positives”. The AUC-ROC Equation (7) is as follows:where and .
- Area Under Precision–Recall Curve (AUC-PR) gives “a summarization of the curve of precision against recall”. The AUC-PR Equation (8) is as follows:
5.1.3. Single Models as Baselines
5.2. Results
5.2.1. Multimodal Approach Using Early Fusion
5.2.2. Multimodal Approach Using Intermediate Fusion
5.2.3. Comparison Between the Proposed Models and Single Models
- Comparing between multimodal approaches using early and intermediate fusion:The comparison between early and intermediate fusion demonstrates the obvious advantage of using intermediate integration in multimodal learning (see Table 9). While both approaches performed well, the intermediate fusion model outperformed the early fusion model with an overall accuracy of 98% versus 97%. This improvement was constant across different evaluation metrics, including F1-score, specificity, and AUC-based measurements, demonstrating intermediate fusion’s robustness in processing complicated multimodal inputs. By allowing each modality to develop modality-specific representations before integration, intermediate fusion was able to preserve complementary features and minimize the noise that frequently results from early feature concatenation, thereby contributing to a higher performance. This finding is corroborated by the ROC-AUC and PR-AUC curves, as shown in Figure 17 and Figure 18, which show that the intermediate fusion model has better precision–recall balance and sharper discrimination ability.
- Comparing between multimodal approaches and single models (baselines):Along with the fusion techniques, both multimodal approaches were evaluated against three baseline single-modality models: the chest X-ray (CXR), the CT scan, and the cough sound models. The findings clearly show that multimodal integration performs significantly better than any single model, attaining higher generalization and predictive accuracy (see Table 10). This improvement in performance demonstrates the power of merging disparate medical data, as complimentary data from imaging and acoustic modalities can give an improved understanding of disease patterns. Notably, even while the CXR and CT scan models worked well as standalone models, they enhanced performance even more when combined in a multimodal framework, particularly in difficult situations.
5.3. Discussion
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.; van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef]
- World Health Organization (WHO). Available online: https://www.who.int (accessed on 25 January 2025).
- Dhivya, N.; Sharmila, P. Multimodal Feature and Transfer Learning in Deep Ensemble Model for Lung Disease Prediction. J. Data Acquis. Process 2023, 38, 271. [Google Scholar]
- Kumar, S.; Ivanova, O.; Melyokhin, A.; Tiwari, P. Deep-learning-enabled multimodal data fusion for lung disease classification. Inform. Med. Unlocked 2023, 42, 101367. [Google Scholar] [CrossRef]
- Behrad, F.; Abadeh, M.S. An overview of deep learning methods for multimodal medical data mining. Expert Syst. Appl. 2022, 200, 117006. [Google Scholar] [CrossRef]
- Yao, D.; Xu, Z.; Lin, Y.; Zhan, Y. Accurate and intelligent diagnosis of pediatric pneumonia using X-ray images and blood testing data. Front. Bioeng. Biotechnol. 2023, 11, 1058888. [Google Scholar] [CrossRef] [PubMed]
- Hayat, N.; Geras, K.J.; Shamout, F.E. MedFuse: Multi-modal fusion with clinical time-series data and chest X-ray images. In Proceedings of the Machine Learning for Healthcare Conference, Durham, NC, USA, 5–6 August 2022; pp. 479–503. [Google Scholar]
- Tolle, L.B. Challenges in the diagnosis and management of patients with fibrosing interstitial lung disease. Case Rep. Pulmonol. 2022, 2022, 9942432. [Google Scholar] [CrossRef]
- Khader, F.; Müller-Franzes, G.; Wang, T.; Han, T.; Tayebi Arasteh, S.; Haarburger, C.; Stegmaier, J.; Bressem, K.; Kuhl, C.; Nebelung, S.; et al. Multimodal deep learning for integrating chest radiographs and clinical parameters: A case for transformers. Radiology 2023, 309, e230806. [Google Scholar] [CrossRef]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef]
- Xu, X.; Li, J.; Zhu, Z.; Zhao, L.; Wang, H.; Song, C.; Chen, Y.; Zhao, Q.; Yang, J.; Pei, Y. A Comprehensive Review on Synergy of Multi-Modal Data and AI Technologies in Medical Diagnosis. Bioengineering 2024, 11, 219. [Google Scholar] [CrossRef] [PubMed]
- Kooraki, S.; Hosseiny, M.; Myers, L.; Gholamrezanezhad, A. Coronavirus (COVID-19) outbreak: What the department of radiology should know. J. Am. Coll. Radiol. 2020, 17, 447–451. [Google Scholar] [CrossRef]
- Huang, C.; Wang, Y.; Li, X.; Ren, L.; Zhao, J.; Hu, Y.; Zhang, L.; Fan, G.; Xu, J.; Gu, X.; et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 2020, 395, 497–506. [Google Scholar] [CrossRef]
- Rousan, L.A.; Elobeid, E.; Karrar, M.; Khader, Y. Chest x-ray findings and temporal lung changes in patients with COVID-19 pneumonia. BMC Pulm. Med. 2020, 20, 245. [Google Scholar] [CrossRef] [PubMed]
- Mackenzie, G. The definition and classification of pneumonia. Pneumonia 2016, 8, 14. [Google Scholar] [CrossRef]
- Melbye, H.; Straume, B.; Aasebøs, U.; Dale, K. Diagnosis of pneumonia in adults in general practice relative importance of typical symptoms and abnormal chest signs evaluated against a radiographic reference standard. Scand. J. Prim. Health Care 1992, 10, 226–233. [Google Scholar] [CrossRef] [PubMed]
- Bradley, S.H.; Abraham, S.; Callister, M.E.; Grice, A.; Hamilton, W.T.; Lopez, R.R.; Shinkins, B.; Neal, R.D. Sensitivity of chest X-ray for detecting lung cancer in people presenting with symptoms: A systematic review. Br. J. Gen. Pract. 2019, 69, e827–e835. [Google Scholar] [CrossRef]
- Bradley, S.H.; Bhartia, B.S.; Callister, M.E.; Hamilton, W.T.; Hatton, N.L.F.; Kennedy, M.P.; Mounce, L.T.; Shinkins, B.; Wheatstone, P.; Neal, R.D. Chest X-ray sensitivity and lung cancer outcomes: A retrospective observational study. Br. J. Gen. Pract. 2021, 71, e862–e868. [Google Scholar] [CrossRef] [PubMed]
- Gao, J.; Li, P.; Chen, Z.; Zhang, J. A survey on deep learning for multimodal data fusion. Neural Comput. 2020, 32, 829–864. [Google Scholar] [CrossRef]
- Alloqmani, A.; Abushark, Y.B.; Khan, A.I.; Alsolami, F. Deep learning based anomaly detection in images: Insights, challenges and recommendations. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 205–215. [Google Scholar] [CrossRef]
- Ashwini, S.; Arunkumar, J.; Prabu, R.T.; Singh, N.H.; Singh, N.P. Diagnosis and multi-classification of lung diseases in CXR images using optimized deep convolutional neural network. Soft Comput. 2024, 28, 6219–6233. [Google Scholar] [CrossRef]
- Ullah, N.; Marzougui, M.; Ahmad, I.; Chelloug, S.A. Deeplungnet: An effective dl-based approach for lung disease classification using cris. Electronics 2023, 12, 1860. [Google Scholar] [CrossRef]
- Mahmoudi, R.; Benameur, N.; Mabrouk, R.; Mohammed, M.A.; Garcia-Zapirain, B.; Bedoui, M.H. A deep learning-based diagnosis system for COVID-19 detection and pneumonia screening using CT imaging. Appl. Sci. 2022, 12, 4825. [Google Scholar] [CrossRef]
- Nafiz, M.F.; Kartini, D.; Faisal, M.R.; Indriani, F.; Hamonangan, T. Automated Detection of COVID-19 Cough Sound using Mel-Spectrogram Images and Convolutional Neural Network. J. Ilm. Tek. Elektro Komput. Dan Inform. (JITEKI) 2023, 9, 535–548. [Google Scholar] [CrossRef]
- Islam, R.; Abdel-Raheem, E.; Tarique, M. A study of using cough sounds and deep neural networks for the early detection of COVID-19. Biomed. Eng. Adv. 2022, 3, 100025. [Google Scholar] [CrossRef]
- Kumar, S.; Sharma, S. An improved deep learning framework for multimodal medical data analysis. Big Data Cogn. Comput. 2024, 8, 125. [Google Scholar] [CrossRef]
- Nalluri, S.; Sasikala, R. Detection and Difference of Pneumonia from other Chest/Lung Disease using Multi-model Data: A Hybrid Classification Model. Int. J. Intell. Syst. Appl. Eng. 2024, 12, 328–344. [Google Scholar]
- Chowdhury, M.E.; Rahman, T.; Khandakar, A.; Mazhar, R.; Kadir, M.A.; Mahbub, Z.B.; Islam, K.R.; Khan, M.S.; Iqbal, A.; Al Emadi, N.; et al. Can AI help in screening viral and COVID-19 pneumonia? IEEE Access 2020, 8, 132665–132676. [Google Scholar] [CrossRef]
- Malik, H.; Anees, T. Chest Diseases Using Different Medical Imaging and Cough Sounds; Version 1; Mendeley Data: Multan, Pakistan, 2023. [Google Scholar] [CrossRef]
- Soares, E.; Angelov, P. SARS-COV-2 Ct-Scan Dataset; Kaggle: San Francisco, CA, USA, 2020. [Google Scholar] [CrossRef]
- Maftouni, M.; Law, A.C.C.; Shen, B.; Grado, Z.J.K.; Zhou, Y.; Yazdi, N.A. A robust ensemble-deep learning model for COVID-19 diagnosis based on an integrated CT scan images database. In Proceedings of the IIE Annual Conference, Online, 22–25 May 2021; Proceedings. Institute of Industrial and Systems Engineers (IISE): Peachtree Corners, GA, USA, 2021; pp. 632–637. [Google Scholar]
- Yan, J. COVID-19 and Common Pneumonia Chest CT Dataset; Version 1; Mendeley Data: Multan, Pakistan, 2020. [Google Scholar] [CrossRef]
- Maleki, N. CT-Scan Images; Version 1; Mendeley Data: Multan, Pakistan, 2020. [Google Scholar] [CrossRef]
- Alyasriy, H.; AL-Huseiny, M. The IQ-OTHNCCD Lung Cancer Dataset; Version 2; Mendeley Data: Multan, Pakistan, 2021. [Google Scholar] [CrossRef]
- Pahar, M.; Klopper, M.; Warren, R.; Niesler, T. COVID-19 cough classification using machine learning and global smartphone recordings. Comput. Biol. Med. 2021, 135, 104572. [Google Scholar] [CrossRef]
- Liao, S.; Song, C.; Wang, X.; Wang, Y. A classification framework for identifying bronchitis and pneumonia in children based on a small-scale cough sounds dataset. PLoS ONE 2022, 17, e0275479. [Google Scholar] [CrossRef]
- Thornton, B. Audio Recognition Using Mel Spectrograms and Convolution Neural Networks; Noiselab University of California: San Diego, CA, USA, 2019. [Google Scholar]
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
- Li, S.; Qu, H.; Dong, X.; Dang, B.; Zang, H.; Gong, Y. Leveraging deep learning and xception architecture for high-accuracy mri classification in alzheimer diagnosis. arXiv 2024, arXiv:2403.16212. [Google Scholar] [CrossRef]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Available online: http://www.deeplearningbook.org (accessed on 10 April 2025).
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
- Zhao, H.; Yan, L.; Hou, Z.; Lin, J.; Zhao, Y.; Ji, Z.; Wang, Y. Error Analysis Strategy for Long-term Correlated Network Systems: Generalized Nonlinear Stochastic Processes and Dual-Layer Filtering Architecture. IEEE Internet Things J. 2025, 12, 33731–33745. [Google Scholar] [CrossRef]


















| Medical Data | [Ref.] (Year) | Methodology | Lung Disease Types | Dataset (#: Sample Size) | Results (%: Performance of the Model Used) | Limitation |
|---|---|---|---|---|---|---|
| CXR Image | [21] (2024) | Convolutional neural networks (CNNs) | COVID-19, Pneumonia (PNEU), Tuberculosis (TB), Lung Cancer (LC), Lung Opacity. | Many online resources: (98,991) | 98.75% | The study’s generalizability to real-world medical data may be poor. |
| [22] (2023) | Deep learning framework with 20 layers andCNN | Tuberculosis (TB), Pneumonia (PNEU), COVID-19, Lung opacity (LO) | Multiple publicly available datasets: (26,145) | 97.47% | Class imbalance | |
| [23] (2022) | A three-layered CNN architecture | Pneumonia and COVID-19. | COVID-19 CT Lung and Infection Segmentation Dataset: (3138) | 98% | Small dataset size. | |
| CT Scan Image | [24] (2023) | CNNs: VGG-16, VGG-19 LeNet-5, AlexNet, ResNet-50, and ResNet-152 models | COVID-19 | Virufy dataset: (121) | VGG-16 = 70.3% VGG-19 = 73% LeNet-5 = 83.8% AlexNet = 86.5% ResNet-50 = 78.4% ResNet-152 = 64.9 | Small dataset size. |
| Cough Sounds | [25] (2022) | Deep neural networks (DNN) | COVID-19 | Virufy dataset: (121) | 89.2% | Small dataset size. |
| [Ref.] (Year) | Medical Data | Fusion Strategy | Input Data Type | Lung Disease Types | Methodology | Dataset (#: Sample Size) | Pre-Processing Steps | Results (%:Performance of the Model Used) | Limitation | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CXR | CT | Cough Sounds | Text | |||||||||
| [26] (2024) | ✓ | - | - | ✓ | NA | Image & Text | Tuberculosis. | Autoencoder, CNN, and Cross-Modal Transformer. | The dataset was obtained from Government Medical College in Uttarakhand in India: CXR: (3558) Text: (3558) | 1. Removing noise. 2. Image resizing. 3. Data augmentation. | Accuracy = 95% | The lower availability of the number of medical records |
| [27] (2024) | ✓ | - | - | ✓ | Intermediate fusion | Image & Text | Pneumonia. | CNN and Deep Maxout. | The dataset was collected manually: CXR: (2258) Text: NA | 1. Normalization. 2. Image enhancement. | Accuracy = 95.28 % | 1. Used small dataset. 2. Did not implement the single-modal for each CXR, and Text and comparing it with the proposed multimodal. |
| [4] (2023) | ✓ | - | - | ✓ | Intermediate & Late fusion | Image & Text | Presence of lung diseases as general or not | DenseNet121, DenseNet169, and ResNet50 (CNN), Long short-term memory (LTSM) and Attention. | MIMIC-IV dataset: CXR: (1156) Text: (1156) | 1. Image resizing. 2. Data augmentation. 3. Normalization | Late fusion Accuracy = 89.99% Intermediate fusion Accuracy = 93.15% | Used small dataset. |
| [6] (2023) | ✓ | - | - | ✓ | Early fusion | Image & Text | Pneumonia | Se-ResNet50 and spatial attention modules (SAM) | Hainan women’s and children’s medical center dataset: CXR: (3100) Text: (799) Guangzhou Women and Children Medical Center dataset (GZCMC): CXR: (5856) Text: NA | 1. Cropped the lung area. 2. Normalization | Accuracy = 77.81% | There are imbalanced pneumonia data between bacterial pneumonia and viral pneumonia. |
| COVID-19 | Pneumonia | Lung Cancer | Healthy | Total | |
|---|---|---|---|---|---|
| CXR | 2 | 2 | 1 | 2 | 7 |
| CT scan | 2 | 1 | 2 | 2 | 7 |
| Cough sound | 2 | 2 | 1 | 2 | 7 |
| Total | 6 | 5 | 4 | 6 | 21 |
| COVID-19 | Pneumonia | Lung Cancer | Healthy | |
|---|---|---|---|---|
| CXR | 3633 | 1360 | 470 | 10,200 |
| CT scan | 8845 | 945 | 1506 | 8122 |
| Cough sound | 948 | 30 | 50 | 1416 |
| COVID-19 | Pneumonia | Lung Cancer | Healthy | Total | |
|---|---|---|---|---|---|
| CXR | 500 | 500 | 500 | 500 | 2000 |
| CT scan | 500 | 500 | 500 | 500 | 2000 |
| Cough sound | 500 | 500 | 500 | 500 | 2000 |
| Total | 1500 | 1500 | 1500 | 1500 | 6000 |
| Parameter | Value | Description |
|---|---|---|
| num_patches | 196 | Number of image patches (H × W/patch_size2) |
| patch_size | 16 | Size of each square patch |
| num_channels | 3 | Number of input channels (RGB) |
| learning rate | 0.0001 | Learning rate |
| Epochs | 30 | Number of epochs |
| dropout_rate | 0.1 | Dropout probability |
| num_heads | 12 | Number of attention heads |
| num_classes | 4 | Number of output classes |
| Model | Accuracy | F1-Score | Precision | Recall | Specificity | AUC-ROC | AUC-PR |
|---|---|---|---|---|---|---|---|
| Multimodal using early fusion | 97% | 97.5% | 97.6% | 97% | 99.2% | 98.3% | 95.8% |
| Model | Accuracy | F1-Score | Precision | Recall | Specificity | AUC-ROC | AUC-PR |
|---|---|---|---|---|---|---|---|
| Multimodal using intermediate fusion | 98% | 97.7% | 97.5% | 98% | 99% | 99% | 97% |
| Model | Accuracy | F1-Score | Precision | Recall | Specificity | AUC-ROC | AUC-PR |
|---|---|---|---|---|---|---|---|
| Multimodal using early fusion | 97% | 97.5% | 97.6% | 97% | 99.2% | 98.3% | 95.8% |
| Multimodal using intermediate fusion | 98% | 97.7% | 97.5% | 98% | 99% | 99% | 97% |
| Model | Accuracy | F1-Score | Precision | Recall | Specificity | AUC-ROC | AUC-PR |
|---|---|---|---|---|---|---|---|
| Single model for CXR | 94% | 94% | 95% | 95% | 98% | 96% | 91% |
| Single model for CT scan | 94% | 94% | 94% | 94% | 98% | 96% | 90% |
| Single model for Cough sound | 79% | 78% | 78% | 79% | 93% | 86% | 69% |
| Multimodal using early fusion | 97% | 97.5% | 97.6% | 97% | 99.2% | 98.3% | 95.8% |
| Multimodal using intermediate fusion | 98% | 97.7% | 97.5% | 98% | 99% | 99% | 97% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Alloqmani, A.; Abushark, Y.B. Comparison Between Early and Intermediate Fusion of Multimodal Techniques: Lung Disease Diagnosis. AI 2026, 7, 16. https://doi.org/10.3390/ai7010016
Alloqmani A, Abushark YB. Comparison Between Early and Intermediate Fusion of Multimodal Techniques: Lung Disease Diagnosis. AI. 2026; 7(1):16. https://doi.org/10.3390/ai7010016
Chicago/Turabian StyleAlloqmani, Ahad, and Yoosef B. Abushark. 2026. "Comparison Between Early and Intermediate Fusion of Multimodal Techniques: Lung Disease Diagnosis" AI 7, no. 1: 16. https://doi.org/10.3390/ai7010016
APA StyleAlloqmani, A., & Abushark, Y. B. (2026). Comparison Between Early and Intermediate Fusion of Multimodal Techniques: Lung Disease Diagnosis. AI, 7(1), 16. https://doi.org/10.3390/ai7010016

