Comparison Between Early and Intermediate Fusion of Multimodal Techniques: Lung Disease Diagnosis

Alloqmani, Ahad; Abushark, Yoosef B.

doi:10.3390/ai7010016

Open AccessArticle

Comparison Between Early and Intermediate Fusion of Multimodal Techniques: Lung Disease Diagnosis

by

Ahad Alloqmani

^*

and

Yoosef B. Abushark

Computer Science Department, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia

^*

Author to whom correspondence should be addressed.

AI 2026, 7(1), 16; https://doi.org/10.3390/ai7010016

Submission received: 3 December 2025 / Revised: 20 December 2025 / Accepted: 31 December 2025 / Published: 7 January 2026

(This article belongs to the Section Medical & Healthcare AI)

Download

Browse Figures

Versions Notes

Abstract

Early and accurate diagnosis of lung diseases is essential for effective treatment and patient management. Conventional diagnostic models trained on a single data type often miss important clinical information. This study explored a multimodal deep learning framework that integrates cough sounds, chest radiograph (X-rays), and computed tomography (CT) scans to enhance disease classification performance. Two fusion strategies, early and intermediate fusion, were implemented and evaluated against three single-modality baselines. The dataset was collected from different sources. Each dataset underwent preprocessing steps, including noise removal, grayscale conversion, image cropping, and class balancing, to ensure data quality. Convolutional neural network (CNN) and Extreme Inception (Xception) architectures were used for feature extraction and classification. The results show that multimodal learning achieves superior performance compared with single models. The intermediate fusion model achieved 98% accuracy, while the early fusion model reached 97%. In contrast, single CXR and CT models achieved 94%, and the cough sound model achieved 79%. These results confirm that multimodal integration, particularly intermediate fusion, offers a more reliable framework for automated lung disease diagnosis.

Keywords:

multimodal approach; lung diseases; deep learning; medical data; early fusion; intermediate fusion

1. Introduction

Litjens et al. [1] reported that lungs serve an important part in the human system by expanding and relaxing to bring in oxygen and release carbon dioxide. The most common and harmful health issues in the world are lung diseases. Every day, a large number of people die from lung diseases, particularly from COVID-19, lung cancer, and pneumonia. Lung diseases are dangerous if they are not detected early [1]. As of 10 February 2023, the World Health Organization (WHO) reported that 6,812,798 people have died from COVID-19 worldwide, and there have been 753,479,439 confirmed cases. In 2022, there were around 1.8 million lung cancer deaths and 2.5 million new cases. Additionally, it estimates that 2.18 million people died from pneumonia, and there were 344 million new cases in 2021 [2]. When diagnosing these lung diseases using clinical analysis, fatigue, stress, and insensitivity can be the reasons preventing rapid treatment for patients. As a result, healthcare professionals choose to diagnose with lung sounds, such as coughing and breathing, as well as diagnostic imaging modalities, such as computed tomography (CT), chest X-ray (CXR), positron emission tomography (PET), and magnetic resonance imaging (MRI) [3].

One of the most effective and promising techniques in computer vision is deep learning. Deep learning (DL) has also shown promise in automatic medical diagnosis in recent years [4]. Numerous techniques based on deep learning have been created to detect and diagnose a range of illnesses, including lung disease [4]. In fact, there is increasing interest in applying deep learning to diagnose and identify diseases such as COVID-19, pneumonia, and lung cancer using audio data and medical imaging [5].

A multimodal method for lung disease diagnosis is an approach that incorporates many forms of patient data, such as imaging, cough sounds, clinical notes, and lab test results, to improve the diagnostic accuracy and reliability [4]. Each type of data adds a unique set of insights to multimodal deep learning: imaging data shows how diseases manifest visually; cough sounds show respiratory conditions through unique acoustic features; clinical notes can show symptoms, previous conditions, and treatment history; and lab results add quantitative health measures [6]. To provide an accurate diagnosis in some situations, a combination of examinations and testing could be required [7]. Studies have indicated that incorporating several data sources can improve a model’s capacity to identify diseases and lower diagnostic mistakes [4,6,7].

There are some challenges in the diagnosis of lung disease: (1) Medical professionals may encounter challenges due to overlapping symptoms, such as fever, cough, and sore throat, which can result in misdiagnosis or delays in treatment [8]. (2) The shortcomings of traditional deep learning diagnostic techniques, which often rely on single-modality data such as X-ray images, limit the ability to find correlations or interactions between various data types, which affects accuracy and performance [9].

Multimodal approaches that incorporate many data sources, such as imaging (X-ray and CT scan images) and cough sounds, present a promising way to address these challenges in light of the recent developments in deep learning and artificial intelligence [10]. Therefore, it is crucial to propose a robust multimodal approach that leverages several forms of medical data to increase the accuracy, and efficacy, of lung disease diagnosis.

This paper aims to apply, as well as compare, intermediate fusion and early fusion strategies on different input formats with high and effective accuracy. Furthermore, the aim is to also develop and implement the deep learning architectures of Xception (Extreme Inception) and convolutional neural network (CNN) in the integration of multimodal data to diagnose lung diseases such as pneumonia, COVID-19, and lung cancer. Moreover, multimodal learning is performed using the data collected from multiple public sources without patient-level pairing across modalities. Consequently, the proposed framework focuses on cross-modal label-level integration rather than patient-aligned multimodal fusion. In addition, we evaluate the performance of the proposed multimodal approach by comparing the results with single-modality models and other works. The results demonstrate that the proposed multimodal models outperform single models. The intermediate fusion model achieved an accuracy of 98%. The early fusion model reached 97%. The single-modality models scored lower. The CXR and CT models reached 94%, and the cough sound model reached 79%. We improved our results by combining multiple data sources. This approach captures complementary information from images and audio, giving more accurate and reliable predictions.

This study does not propose a novel network architecture or fusion mechanism. Instead, its contribution lies in a systematic and controlled evaluation of tri-modal learning for lung disease diagnosis under realistic data constraints. Unlike many prior works that rely on a single dataset or paired modalities only, this study integrates three clinically distinct data types (sourced from multiple public repositories): chest X-ray images, CT scans, and cough sounds. This work contributes empirical insights into how early and intermediate fusion behave when modalities substantially differ in signal structure, acquisition process, and diagnostic strength. In particular, this study highlights how intermediate fusion mitigates modality dominance and noise propagation in heterogeneous settings. By implementing comparable single-modality baselines under identical preprocessing and training conditions, our analysis isolated the practical benefit of fusion rather than architectural novelty. This study is, therefore, positioned as an applied benchmarking and design validation contribution, and it is aimed at informing future multimodal clinical system design rather than proposing a new algorithmic theory.

The remainder of this paper is organized as follows: Section 2 provides a background of the topic. Section 3 describes how to select the literature and reviews relevant literature, highlighting existing approaches and gaps. Section 4 describes the methodology employed for data collection and analysis. Section 5 presents the results and discusses them. Finally, Section 6 concludes this paper by summarizing key findings, outlining limitations, and suggesting directions for future research.

2. Background

This section illustrates the background required to understand the different concepts of this research and classifies them into medical and technical aspects. This topic covers two fields, medical and technical. The medical subsection explains the terms, methods, and concepts for understanding of the healthcare side. The technical subsection explains the algorithms for understanding of the computer science side. We write these sections to provide clear terminology for both fields.

2.1. Medical Part

2.1.1. Medical Data Collection

The clinical diagnostic procedure is intrinsically complex, requiring the creating and analysis of several data types, including images, text, speech, genetic data, and physiological signals (see Figure 1). This intricacy results from the cooperative interaction of several data sources, such as images that catch anatomical structures, speech explaining the symptoms of a patient, textual data that describes medical history, genetic data that defines innate vulnerability, and physiological data obtained from electrocardiograms (ECGs) and electroencephalograms (EEGs). Each modality provides distinct and significant insights that lead to a more comprehensive understanding of patients’ physiological states [11].

Image: Medical imaging techniques, including computed tomography (CT), X-rays, magnetic resonance imaging (MRI), and digital pathology, provide visual depictions of internal structures and abnormalities. These images are essential components of a diagnosis, revealing the fine details required for detecting and describing various medical disorders [11].
Text: Textual data, including clinical notes, medical literature, and electronic health records, provide a narrative thread that intertwines the patient’s medical history, medical journey, and contextual information that are essential for an accurate diagnosis [11].
Speech: Speech recordings offer a special way to comprehend the symptoms and feelings of patients. This modality gives the diagnostic process a qualitative component by capturing subtleties such as tone, pace, and articulation [11].
Genetic data: A molecular layer is introduced by genetic data to clarify the innate predispositions, vulnerabilities, and genetic markers that may affect the manifestation of diseases [11].
Physiological signals: Real-time snapshots of cardiac and brain activity are provided by signal data. This dynamic modality provides important insights into the anomalies and patterns linked to neurological or cardiac disorders by efficiently capturing temporal variations [11].

2.1.2. Lung Disease Symptoms

COVID-19: Coronaviruses (such as COVID-19) are enveloped, positive-sense, nonsegmented, and single-strand ribonucleic acid viruses belonging to the Coronaviridae family. Under an electron microscope, the viruses have a distinctive shape as they have viral spike peplomers that protrude from the viral envelope that resemble crowns. Coronaviruses are widely spread among humans and mammals [12]. COVID-19’s most common symptoms include cough, fever, fatigue, dyspnea, and myalgia. Headaches, hemoptysis, sputum, and gastrointestinal symptoms are the less common symptoms [13]. In chest imaging, patients with COVID-19 exhibit the usual radiological features, such as bilateral and multifocal ground glass opacities and consolidations with peripheral and basal predominance. Less observed features include cavitation, pleural effusion, bronchiectasis, lymphadenopathy, and septal thickening [14].
Pneumonia: Pneumonia is an infection that causes inflammation in one or both lungs’ alveoli, or air sacs, resulting in fluid or pus buildup that hinders oxygen exchange [15]. The symptoms involve fever, chills, a persistent cough (often mixed with phlegm), fatigue, weakness, and chest pain that becomes worse when you cough or breathe deeply. Pneumonia usually shows up in chest imaging as patches of lung consolidation that resemble dense or patchy opacities. These opacities might be localized (lobar pneumonia) or spread over both lungs (bronchopneumonia). Air bronchograms (visible air-filled bronchi against a backdrop of consolidated lung tissue), pleural effusion (fluid around the lungs), and blurring of normal lung and diaphragm boundaries are other possible symptoms [16].
Lung cancer: Lung cancer (LC) is a kind of cancer that starts in the cells of the lungs that exchange oxygen and carbon dioxide when breathing. It occurs when abnormal lung cells develop uncontrollably, resulting in tumors that can impair lung function [17]. The symptoms can include a persistent cough; chest pain; shortness of breath; hemoptysis (coughing up blood); and inexplicable weight loss, fatigue, and recurring infections. In chest imaging, it may also involve pleural effusion, enlarged lymph nodes, abnormal shadows, lung masses or nodules, and regions of lung collapse (atelectasis) [18].

2.2. Technical Aspects

2.2.1. Multimodal Definition

The term “multimodal” describes systems that process and combine multiple data types (modalities), including text, images, audio, and video, to improve interaction and decision making. It is able to provide more information than single-modal approaches by utilizing modality-specific information [19].

2.2.2. Multimodal Data Fusion

Combining many modalities to obtain more accurate information and using that information to make decisions during the disease diagnosis process is made possible by multimodal data fusion. How to integrate data from heterogeneous modalities to provide an accurate diagnosis is the main challenge. The ability of deep learning models to learn from hierarchical input structures makes them valuable for the multimodal data fusion of data from heterogeneous input sources. There are three types of deep learning-based multimodal fusion strategies: early, intermediate, and late fusion [4]. An outline of fusion techniques is shown in Figure 2.

Early fusion: Early fusion involves concatenating input data or features from heterogeneous sources before a deep learning network processes them. Early fusion multimodal techniques tend to outperform single-modal deep learning methods. However, early fusion multimodal techniques are only appropriate for homogeneous modalities since they have homogeneous features across different modalities. The performance of early fusion will be unreliable if the modalities, such as text and images, are heterogeneous. The best choices for direct modeling in early fusion are convolutional neural network (CNN), recurrent neural network (RNN), and fully connected neural network (FCN) [4].
Intermediate fusion: Intermediate fusion uses various mapping techniques in a fusion or shared representation layer to convert input data from different sources into higher-level representations. Intermediate fusion enables the model to learn mappings between diverse data, which is beneficial for the decision-making process [4].
Late fusion: Models are trained independently for every modality considered by late fusion. The outcomes of each model are then combined using a soft-max function to average the probabilities of each model. This strategy assumes that the separate models have contributed equally. Heterogeneous modalities can be handled by late fusion strategies since an imbalance in the number of features across several modalities has no impact on a decision. Some latent relationships may remain hidden because late fusion strategies may not learn the correlation between features of different modalities. Thus, in situations when the correlation between various modalities is lower, late fusion can be helpful [4].

3. Literature Review

In this section, we introduce the criteria for selecting the literature and literature taxonomy.

3.1. Literature Selection Methodology

The multimodal approach that combines different types of medical data for lung diseases, as well as the single-modal approach that uses a single type of data, were reviewed using an existing selection methodology that was modified from [20]. The procedure for selecting the literature is described in this section (see Figure 3).

3.1.1. Keyword Filtering Stage

At least one of the following keywords was found in the title of the articles obtained from the Google Scholar search engine, IEEE Xplore, and Scopus: (1) multimodal deep learning, (2) multimodal approach, (3) single-modal deep learning, or (4) single-modal approach. A total of 20 articles were the results of this stage.

3.1.2. Publisher Filtering Stage

The following publishers were caught in the literature selection methodology: (1) IEEE; (2) Elsevier; (3) Springer; (4) MDPI; (5) Frontiers; (6) PMLR; (7) Public Library of Science San Francisco, CA USA; (8) Nature Publishing Group UK London; and (9) other. The percentage of articles for each publication is shown in Figure 4. The results dropped from 20 to 15 at this stage.

3.1.3. Year Filtering Stage

Only the years 2019–2024 were taken into consideration in the literature selection methodology (which also focused on recent research articles). The percentage of the articles that were published in each year is shown in Figure 5. The results dropped from 15 to 11 at this stage.

3.1.4. Abstract Filtering Stage

Concerning the 11 papers from the previous phase, an abstract reading was performed to narrow down the selection of the most significant articles that particularly focused on single-modal and multimodal approaches of lung diseases. Thus, nine articles were finally selected from the literature.

3.2. Literature Selection Taxonomy

In this thesis, we classified lung disease diagnosis studies into two categories: single-modal and multi-modal. The studies classified into the single-modal category had applications of CXR images, CT scan images, and cough sounds. In contrast, the multi-modal studies had applications of all the medical data types.

3.2.1. Single-Modal

The single-modal approach uses a single type of data input to evaluate and generate insights for specific applications, resulting in efficient and focused processing. This section will focus on using individual data modalities (e.g., CXR images only, CT scan images only, and cough sounds only) to diagnose lung diseases.

1.: DL Model for the Diagnosis of Lung Diseases using CXR Images
The research of [21] provides a grid search-optimized deep learning framework for multi-class lung disease classification. The lung diseases included COVID-19, pneumonia (PNEU), tuberculosis (TB), lung cancer (LC), and lung opacity (LO), which were diagnosed using chest X-ray (CXR) images. It presents two models: one for lung disease detection and the other for disease classification. Using a dataset with about 98,991 images, the models scored 99.82% and 98.75% accuracies in the detection and classification tasks, respectively. However, the study’s dependence on hyperparameter tuning and dataset balance raises concerns regarding its generalizability to diverse clinical data from the real world.

The paper of [22] discusses the difficulty of correctly detecting and classifying lung diseases such as tuberculosis (TB), pneumonia (PNEU), COVID-19, and lung opacity (LO) using chest radiograph images (CRIs). It suggests using DeepLungNet, a 20-layer deep learning framework that uses convolutional neural network (CNN) with fire modules for feature extraction, batch normalization, and Leaky ReLU activation. Two integrated datasets were used to validate the model’s performance; it showed robustness through six-class classification and averaged 97.47% accuracy for five-class classification. Despite data augmentation, limitations include possible class imbalance problems and difficulties with generalizability.

2.: DL Model for the Diagnosis of Lung Diseases using CT Scan Images
The research of [23] uses a deep learning system to tackle the problem of detecting pneumonia and COVID-19 through CT imaging. The approach includes segmenting the lungs and infection areas using a U-Net model after preprocessing CT scans with Contrast Limited Adaptive Histogram Equalization to improve contrast. For classification, a three-layered CNN architecture was used, with fourfold cross-validation. The database contained 3,138 annotated images from 20 CT scan cases. According to the experimental results, lung segmentation had a score of 98%, infection segmentation had a score of 91%, and classification accuracy was 98%. The small dataset size was the limitation of the article.
3.: DL Model for the Diagnosis of Lung Diseases using Cough Sounds
The study of [24] explores the detection of COVID-19 using convolutional neural network (CNN), converting cough sounds into mel-spectrogram images. Using 121 cough audio recordings from the Virufy dataset, six CNN models (VGG-16, VGG-19, LeNet-5, AlexNet, ResNet-50, and ResNet-152) were tested with different input sizes. The best classification performance was demonstrated by the AlexNet model, which achieved the greatest accuracy of 86.5% at an input size of 227 × 227. One of the limitations was the dependence on a limited dataset.

In the paper of [25], a deep learning-based method for early COVID-19 detection with cough sound samples is presented. It tackled the issues of traditional tests like RT-PCR’s expense, accessibility, and invasiveness. A deep neural network (DNN) was used to identify the cough sounds after they had been preprocessed, and acoustic features across time, frequency, and mixed domains were extracted. There were 121 cough sound samples in the dataset, which came from the Virufy database. The findings show 89.2% testing accuracy. However, they also used a small dataset.

3.2.2. Multi-Modal

Multimodal approach research has made progress by investigating techniques to combine several data types to improve diagnostic accuracy and speed. This section will introduce recent studies of multimodal approach-based lung disease diagnosis.

1.: DL Model for the Diagnosis of Lung Diseases using Different Medical Data
A novel transformer-based deep learning model was proposed in the article of [26], which diagnosed tuberculosis (TB) by combining imaging and clinical data. The issue addressed was the necessity of employing multimodal data for early and precise TB identification since only depending on imaging or clinical data might not be enough. Convolutional neural network (CNN) was used to extract visual features and to analyze the clinical data using a denoising autoencoder to extend feature dimensions. A cross-modal transformer module was then used for fusing to create a single feature representation. The dataset included 3256 patients’ X-ray images and 3558 matched clinical records, which were gathered at a government hospital in Uttarakhand, India. According to experimental results, the model outperforms conventional fusion procedures with a classification accuracy of 95%. The comparatively limited number of clinical features is one of the study’s limitations.

The authors of [27] presented a multimodal approach that uses text data and X-ray images to identify pneumonia early. To distinguish pneumonia from other lung diseases, a hybrid classifier composed of Deep Maxout and convolutional neural network (CNN) was used. The text and image format of the dataset were manually collected. The suggested model outperforms traditional models with an improved accuracy of 95.28%. The limitations of the work were that it did not implement the model with each CXR and that it utilized text as individual datasets and compared it with the proposed multimodal approach. In addition, they used a small dataset.

In the paper of [4], a multimodal technique, which uses chest X-ray images and clinical data to identify lung diseases and is based on late and intermediate fusion, is proposed. It uses DenseNet121, DenseNet169, and ResNet50 architectures to process chest X-ray images, and it processes clinical data using long short-term memory (LSTM) and attention models. The dataset used is MIMIC-IV, which contains X-rays images and clinical data. The results of the experiment indicate that the intermediate fusion-based model performs 93.15% more accurately than the late fusion model. However, the proposed approaches used a small number of patients.

The research of [6] introduced a two-phase training multimodal pneumonia classification approach that combines data from blood tests and X-ray images. The architectures that were used were the Se-ResNet50 network and spatial attention modules (SAMs). The Guangzhou Women and Children’s Medical Center in China (GZCMC) public dataset and the Hainan Women and Children’s Medical Center dataset were the two datasets used in the study. The performance of the proposed approach achieved 77.81%, and the experiments showed that the modality fusion approach performed better than any single modality approach. However, extremely imbalanced pneumonia data between bacterial pneumonia and viral pneumonia was used.

3.3. Discussion

Single-modal approaches have various drawbacks that reduce their efficacy in complex scenarios. These approaches that are dependent on a single data source are restricted in their ability to identify the correlations or interactions between different types of data, which decreases the overall performance and accuracy [19]. Moreover, single-modal approaches are less useful for multidimensional applications since they cannot handle tasks that require the integration of several data types, such as merging text and images [9].

While multimodal learning has been increasingly explored in medical diagnosis, the majority of existing studies focus on dual-modality combinations, most commonly pairing medical imaging with either clinical data or audio signals. Several works integrate chest X-ray images with CT scans, imaging with electronic health records, or cough sounds with spectrogram-based features. Studies employing more than two modalities do exist; however, they remain relatively limited and often rely on highly constrained or institution-specific datasets. Moreover, many multi-modal studies with three or more inputs focus on structured clinical variables rather than heterogeneous signal types, such as audio and imaging. In contrast, the present study evaluated the joint integration of three diagnostically distinct and heterogeneous modalities—chest X-rays, CT scans, and cough sounds—under a unified experimental protocol. This positioning does not imply the absence of prior tri-modal work, but rather highlights the relative scarcity of systematic comparative evaluations using three heterogeneous diagnostic signals under consistent training and evaluation conditions.

Table 1 and Table 2 summarize the reviewed related works, and the following was observed regarding multimodal approaches:

Most studies use the intermediate fusion strategy since it achieves high accuracy in different input formats. Our challenge, and the contribution of this research, was to use—as well as to compare—both intermediate fusion and early fusion strategies to achieve high accuracy.
To our knowledge thus far, most researchers use a multimodal approach with only two types of medical data. Our study will use three types of medical data.
To our knowledge thus far, only a few studies have implemented and compared the single-modal approach for each medical data type. Our research will implement a single-modal approach for each data type and compare the results with the proposed approach to illustrate the impact of combining diverse data types. The results of this will also be compared with other works.
Some studies use a small dataset. Our work will solve this limitation by using multiple datasets for each data type.
Some studies have issues with data imbalance. This work will handle class imbalance using augmentation techniques.
Most researchers use convolutional neural network (CNN) architectures, which are the common and powerful classification architectures in deep learning. This study also used the CNN architecture in the classification of lung diseases.

4. Research Methodology

The main contribution of this research is in designing an effective multi-modal approach that detects and classifies lung diseases. This section describes the details of the dataset used and model development. Figure 6 illustrates the main stages of the proposed multimodal approach.

4.1. Dataset Collection

There are multiple datasets for each type, such as lung imaging (X-rays and CT scans) and cough sounds, of medical data. The dataset used was complete and diverse across disease types (COVID-19, pneumonia, and lung cancer). The three subsections are listed below. The first section contains databases of CXR images, the second section include databases of CT scan images, and the third section consists of cough sound databases. The numbers of images and sources used are shown in Table 3 and Table 4.

The datasets used in this study were obtained from multiple public repositories, with each modality independently collected from different sources. Due to the absence of shared patient identifiers across these datasets, it was not possible to establish patient-level correspondence between chest X-ray images, CT scans, and cough sound recordings. Consequently, samples from different modalities could not be paired based on individual patients. To address this limitation, multimodal instances were constructed using disease-label alignment, where samples from each modality corresponding to the same diagnostic class were grouped to form a tri-modal training instance. This strategy enabled the integration of heterogeneous modalities while maintaining consistent class semantics across inputs. Although this approach does not capture patient-specific correlations, it allows the model to learn the cross-modal relationships associated with disease characteristics. This design choice is explicitly acknowledged to ensure methodological clarity and to prevent misinterpretation of the multimodal framework as patient-aligned fusion.

To improve transparency regarding dataset composition, the contribution of each data source to each disease class was clarified. For chest X-ray data, COVID-19 samples were aggregated from multiple publicly available repositories, including [28,29], contributing a total of 3633 images before balancing. Pneumonia and lung cancer X-ray samples were sourced from [28,29], while normal cases were drawn from [28,29]. CT scan samples were similarly aggregated from distinct repositories, each contributing disease-specific volumes, as listed in Table 4. Cough sound samples were obtained from fewer publicly available datasets, which explains the smaller raw sample size for certain classes (particularly pneumonia). No dataset contributed samples to more than one split. The reported class counts represent raw totals before balancing and augmentation. This clarification ensures the traceability of data origins and improves reproducibility.

4.1.1. Lung Disease CXR Datasets

A total of seven publicly available datasets on different chest diseases (COVID-19, pneumonia, and lung cancer) were collected from a wide range of sources. The following datasets were used: COVID-19 [28,29], Pneumonia [28,29], Lung Cancer [29], and Healthy [28,29] (see Figure 7).

4.1.2. Lung Disease CT Scan Datasets

A total of seven publicly available datasets were collected. The following datasets were used: COVID-19 [30,31], Pneumonia [32], Lung Cancer [33,34], and Healthy [30,31] (see Figure 7).

4.1.3. Lung Disease Cough Sound Datasets

Seven public datasets were gathered. The following datasets were used: COVID-19 [29,35], Pneumonia [29,36], Lung Cancer [29], and Healthy [29,35] (see Figure 8).

4.2. Data Pre-Processing

This section presents the pre-processing of each data type to ensure compatibility. A comprehensive preprocessing pipeline was applied to all datasets to ensure data quality and consistency across modalities. The preprocessing steps included noise removal, normalization, resizing, grayscale conversion for image data, and feature standardization for audio inputs. Data augmentation techniques were employed to improve model generalization and to reduce overfitting. Importantly, all augmentation procedures were exclusively applied to the training set. No augmented samples, or their derived variants, were included in the validation or testing sets to prevent data leakage and inflated performance estimates. Augmentation techniques for image data included random rotation, horizontal flipping, scaling, and contrast adjustment, while audio data augmentation involved time stretching and noise injection. By strictly isolating augmentation to the training phase, the evaluation protocol preserved the integrity of the validation and test sets, ensuring that the reported results reflected genuine model generalization rather than memorization of augmented patterns.

4.2.1. Converting Cough Sounds into an Image Using a Mel-Spectrogram

A mel-spectrogram technique was used to convert cough sounds into images. A time–frequency representation of an audio signal that maps the spectrum onto the Mel scale—a pitch perceptual scale that more closely resembles how humans hear sound—is called a mel-spectrogram. It is produced by first calculating the signal’s Short-Time Fourier Transform (STFT), which records its frequency content over time. Next, a Mel filter bank is applied to transform the linear frequency axis into the Mel scale. The outcome is a 2D representation that emphasizes perceptually significant frequency information (time on one axis and Mel frequency bands on the other). As a result, it is frequently utilized in tasks related to speech, music, and audio processing [37].

4.2.2. Handling and Imbalanced Class Dataset

Class imbalance occurs when one class (the majority class) has substantially more samples than another class (the minority class), which is a problem in many real-world applications. It has been noted that (see Table 4) the majority of the lung disease classes of the CXR, CT scan, and cough sound datasets are unbalanced. Since this is a problem, we standardized the dataset to ensure balance. Due to the number of images greatly varying across the lung disease classes (some classes containing nearly 8000 images, while others had as few as 30), each class was specifically set up to have 500 images. Classes with more than 500 images were downsampled to this size, while classes with fewer than 500 images were upsized to 500 using data augmentation techniques (see Table 5). The number 500 images was chosen because this is the maximum number that one can increase to without causing overfitting and data replication.

Moreover, the original datasets exhibited significant class imbalance, with certain disease categories being overrepresented relative to others. To mitigate this issue, controlled downsampling of majority classes and augmentation of minority classes were applied to achieve a balanced dataset. While this approach ensures uniform class representation during training, aggressive downsampling may discard valuable real samples from majority classes. To address this concern, alternative imbalance-handling strategies were considered. These included the use of class-weighted loss functions, focal loss to emphasize harder-to-classify samples, and balanced batch sampling to ensure equal class representation within each mini-batch. Such methods can reduce bias without removing large portions of original data. Although uniform class balancing was adopted in the current study for experimental consistency, future work will explore these alternative strategies to preserve more majority-class information while maintaining robust performance across underrepresented disease categories.

Class balancing was performed to mitigate the severe class imbalance across modalities and disease categories. For classes with more than 500 samples, downsampling was applied using random selection, which was stratified by dataset source to preserve diversity and to prevent the dominance of a single acquisition pipeline. For classes with fewer than 500 samples, data augmentation was exclusively used within the training set to synthetically increase the sample count. Image-based augmentation included horizontal flipping, rotation within ±15 degrees, random scaling (±10%), and contrast adjustment. For cough audio signals, augmentation was applied in the time–frequency domain using spectrogram-based transformations, including time shifting, frequency masking, and additive noise at low signal-to-noise ratios. No augmented samples were included in the validation or testing sets. While augmentation improves class balance, it may introduce bias; therefore, its effect is discussed as a limitation in the Discussion Section.

4.2.3. Image Enhancement

The dataset consisted of images in the form of CXR, CT scans, and cough sounds. The pre-processing steps of these images are as follows:

The CXR, CT scans, and CSI were transformed into grayscale images.
The CXR, CT scans, and CSI were cropped to remove any undesired items, text, and patient information.
Any CSI image that was black or white was removed as they do not have useful information and were, thus, not needed.
Each image’s size was changed to be compatible with the model.

4.3. Proposed Model

In this section, the feature extraction and classification of the proposed model will be described. Furthermore, it explains the proposed multimodal as well as illustrates the difference between multimodal approaches using early fusion and those using intermediate fusion.

4.3.1. Feature Extraction and Classification

1.

Feature extraction: Xception (Extreme Inception) was proposed by Francois Chollet in 2016. The model’s name is taken from “Extreme Inception” because it expands and improves the Inception model’s architecture. The Xception model improves the performance and efficiency of the parameters by using depthwise separable convolutions instead of the standard convolutional operations used in the Inception model [38]. It was used in this study with fine-tuned hyperparameters, and we also modified some layers, such as modifying Dense layers to fit the class number of lung disease in this research and adding a Dropout layer between layers to avoid overfitting. The architecture of Xception is shown in Figure 9. The reason for using the Xception model was because it has the following advantages [39]:

Parameter efficiency—depthwise separable convolutions use much less parameters than traditional convolutions, making the model lighter and more efficient in parameter utilization.
Performance—in a variety of visual recognition tasks (e.g., difficult images), the Xception model outperforms many advanced models of its time, including the original Inception model.
Adaptability—the Xception model is more suited for deployment on devices with limited computational resources because it has fewer parameters.

2.

Classification: Convolutional neural networks, or CNNs, are a family of deep learning models that was especially created to analyze data that has a grid-like topology, such as audio spectrograms (2D time–frequency grids) or images (2D pixel grids) [40]. In many fields, such as image classification [41], object identification [42], and medical image analysis [1], CNN has demonstrated state-of-the-art performance. In this study, a convolutional neural network (CNN) was employed as a classifier. The CNN’s convolutional and fully connected layers were used to train the model to directly assign the input data from each modality to the appropriate disease class. In particular, the network’s output features were passed via a fully connected layer with 256 units and ReLU activation. Then, a dense output layer with four neurons and a softmax activation function were used to predict the probability distribution over the four target classes.

4.3.2. Multimodal Techniques

The proposed multimodal framework integrates three heterogeneous data modalities: chest X-ray images, CT scans, and cough sound recordings. The feature representations extracted from these modalities are denoted as

z C X R

,

z C T

, and

z C S

, respectively. In the early fusion strategy, these modality-specific feature vectors were projected to a common dimensional space and concatenated to form a unified representation that is directly passed to the classifier. In contrast, the intermediate fusion strategy processes each modality through a dedicated encoder before mapping the extracted features into a shared latent space. Fusion is then performed at the representation level, enabling the model to capture cross-modal interactions while preserving modality-specific characteristics. This explicit tri-modal formulation ensures clarity in the fusion process and aligns the mathematical description with the implemented architecture. A schematic diagram illustrating both fusion strategies is included to enhance interpretability.

1.: Multimodal approach using early fusion: We used an early fusion approach to combine several modalities within a single classifier. Raw inputs are preprocessed for each modality, and these are then linearly projected to a common dimensionality d if needed. The modality-specific vectors are produced as $z_{A} \in R^{d}$ and $z_{B} \in R^{d}$ . After that, they are concatenated to create a joint representation:

$h = [z_{A}; z_{B}] \in R^{2 d} .$

(1)

The shared classifier head receives this fused vector and is made up of a fully connected layer with 256 units and ReLU activation, followed by a four-way softmax output layer that generates the class posterior distribution. This design enables the classifier to directly learn decision boundaries across the joint input space while keeping the fusion process simple and computationally efficient (such as, for example, concatenation without cross-attention).
2.: Multimodal approach using intermediate fusion: To produce modality-specific representations, we used an intermediate fusion technique, where each modality is initially independently processed through its own encoder. Let $z_{A} \in R^{d_{A}}$ and $z_{B} \in R^{d_{B}}$ represent the intermediate features derived from Modalities A and B, respectively. These intermediate representations are further transformed into a shared latent space and then fused instead of directly concatenating the raw features. Compared with early fusion, this method preserves modality-specific processing while enabling the model to learn cross-modal interactions at a deeper level.
3.: Tri-modal data construction: In this study, a tri-modal instance was constructed by associating the chest X-ray images, CT scan images, and cough sound samples that shared the same disease label. Due to the absence of patient-level identifiers across the publicly available datasets, strict patient-level pairing across modalities was not feasible. Therefore, each tri-modal sample was formed through label-based alignment, where one sample from each modality corresponding to the same disease class was grouped to create a single training instance.
This strategy enables the model to learn complementary representations across heterogeneous modalities while maintaining consistent diagnostic labels. The resulting framework captures cross-modal correlations at the label level rather than at the individual patient level. This design choice reflects a practical scenario in which different diagnostic tests may be independently available but still jointly contribute to clinical decision making.

5. Results and Discussion

In this section, the experimental setup and results of the proposed model are shown and compared with the baseline approaches. Next, the strengths, limitations, and key insights are highlighted through analysis and discussion of findings.

5.1. Experimentation Setup

Details of the training parameters, data split strategy, and software environment used to assess the proposed model are described in this section.

The Kaggle cloud-based platform, which comes with pre-installed machine learning libraries, including TensorFlow 2.20.0, PyTorch 2.7.0, Keras 3.0, and scikit-learn 1.8.0, as well as Jupyter Notebook 1.0.0 interfaces, was used for all experiments. By providing free access to GPUs and TPUs, Kaggle makes it possible to effectively train and test models without requiring extra local hardware. The framework was run on NVIDIA Tesla P100 GPUs with a memory of 13 GB.

5.1.1. Training Settings

The dataset was split into training, validation, and testing subsets in an 80:10:10 ratio. The dataset distribution is shown in Figure 10. Given the use of multi-source datasets, particular care was taken to prevent data leakage during training and evaluation. Since patient-level identifiers were unavailable, source-aware splitting was adopted to ensure that samples originating from the same dataset source did not simultaneously appear across training, validation, and test sets. This strategy reduces the risk of the model learning dataset-specific artifacts rather than disease-relevant features. Training was performed using fixed splits, and hyperparameters were kept consistent across experiments to ensure fair comparison between single-modality and multimodal models. This splitting protocol provides a more conservative and realistic assessment of generalization performance, particularly in scenarios involving heterogeneous data sources. By enforcing strict separation across dataset sources, the reported results better reflect the robustness of the proposed framework.

Consistent hyperparameters were applied throughout all experiments, and random seeds were fixed for every library to guarantee equity and reproducibility. The Adam optimization technique, with a learning rate of 0.0001, batch size of 16, and epochs of 30, was used to train the model. The input shape was (224, 224, 3), with the number of color channels being 3 and the input size being 224 × 224. The model was trained using categorical cross-entropy as the loss function. The same configuration was applied to all single-modality and multimodal models to ensure that the observed performance differences arose from fusion strategy rather than optimization bias. This clarification improves transparency and aligns reported configurations with the implemented architecture. While this approach allows controlled evaluation, it does not capture performance variability across different splits. The authors acknowledge that cross-validation or external dataset validation would provide a stronger assessment of generalization. However, the heterogeneous and multi-source nature of the data, combined with the absence of patient identifiers in some datasets, limited the feasibility of k-fold cross-validation without risking data leakage. As a result, results should be interpreted as indicative rather than definitive. Future work will focus on cross-dataset evaluation, leave-one-dataset-out testing, and prospective validation to better assess robustness under domain shift. Table 6 summarizes the hyperparameters of the proposed model.

5.1.2. Evaluation Metrics

Several common evaluation measures were used to evaluate the proposed model’s efficacy and reliability. This section will illustrate the evaluation metrics in detail.

Let

T P

,

T N

,

F P

, and

F N

denote true positives, true negatives, false positives, and false negatives, respectively. The evaluation metrics are defined below.

Accuracy (Acc) “gives the percentage of correctly classified cases, including both positive and negative ones”. The accuracy Equation (2) is as follows:

$ACC = \frac{T P + T N}{T P + T N + F P + F N} .$

(2)
Sensitivity (Recall) “gives the proportion of positives that were correctly identified (true positive)”. The sensitivity Equation (3) is as follows:

$Recall = \frac{T P}{T P + F N} .$

(3)
Specificity “gives the proportion of negatives that were correctly identified (true negative)”. The specificity Equation (4) is as follows:

$Specificity = \frac{T N}{T N + F P} .$

(4)
Precision gives the “percentage of samples predicted to be positive that are actually positive”. The precision Equation (5) is as follows:

$Precision = \frac{T P}{T P + F P} .$

(5)
F1-score (F1) gives “the harmonic mean of precision and recall”. The F1-score Equation (6) is as follows:

$F 1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} .$

(6)
Area Under the Receiver Operating Characteristic Curve (AUC-ROC) “summarizes the ROC curve of true positives against false positives”. The AUC-ROC Equation (7) is as follows:

$AUC - ROC = \int_{0}^{1} TPR (F P R) d (FPR),$

(7)

where $TPR = \frac{T P}{T P + F N}$ and $FPR = \frac{F P}{F P + T N}$ .
Area Under Precision–Recall Curve (AUC-PR) gives “a summarization of the curve of precision against recall”. The AUC-PR Equation (8) is as follows:

$AUC - PR = \int_{0}^{1} Precision (Recall) d (Recall) .$

(8)

5.1.3. Single Models as Baselines

To make a fair comparison, we implemented a single model for each type of data (CXR, CT scans, and cough sounds) to diagnose lung diseases (COVID-19, pneumonia, and lung cancer). Therefore, we had three single models that used the same dataset, the same pre-processing steps, and the same feature extraction and classification.

5.2. Results

This section displays and compare the results of the proposed model using several evaluation measures.

5.2.1. Multimodal Approach Using Early Fusion

The early fusion approach’s experimental results show excellent performance on all evaluation metrics, as shown in Table 7, with an overall accuracy of 97%. According to the confusion matrix (see Figure 11), incredibly few classes were misclassified, and the majority of courses were correctly classified. In particular, the model maintained excellent recall (97%) and precision (97.6%), demonstrating its ability to accurately identify actual cases across several diseases categories while reducing false positives. The 99.2% specificity further demonstrates the model’s accuracy in identifying negative cases, lowering the possibility of false alarms. The balanced F1-score of 97.5% indicates how well the model manages the trade-offs between recall and precision. Furthermore, the model’s strong discriminative ability and consistent performance, even in the face of class imbalance, are highlighted by the ROC-AUC (98.3%) and PR-AUC (95.8%) values. These findings collectively imply that early fusion is a successful method for combining multimodal medical data. The proposed model’s training and validation accuracy and loss curves are depicted in Figure 12, which demonstrates little overfitting and steady convergence. The prediction examples for each of the four classes are shown in Figure 13.

5.2.2. Multimodal Approach Using Intermediate Fusion

The intermediate fusion strategy, which had an overall accuracy of 98.0%, produced better outcomes than the early fusion approach (see Table 8). This fact illustrates how efficiently each modality learned its own representations prior to integration. With only a few minor misclassifications, the confusion matrix, as shown in Figure 14, demonstrates that almost all samples across the four medical categories were accurately categorized. The model demonstrated an outstanding ability to accurately identify positive cases while preserving reliability in detecting real negatives, as seen by its high precision (97.5%) and recall (98%). While the balanced F1-score (97.7%) indicates the model’s robustness in managing class-level performance consistently, the specificity (99%) further demonstrates that the model successfully reduces false positive predictions. Furthermore, the ROC-AUC (99%) and PR-AUC (97%) values demonstrate the model’s robust discriminative ability and stability in situations with unbalanced data. All of these findings point to the superiority of intermediate fusion over early fusion as an integration method, which enhances the multimodal medical classification’s prediction accuracy and reliability. The accuracy and loss curves for training and validation of the suggested model are displayed in Figure 15, which demonstrates steady convergence and minimal overfitting. The four classes’ prediction examples are shown in Figure 16.

5.2.3. Comparison Between the Proposed Models and Single Models

Comparing between multimodal approaches using early and intermediate fusion:
The comparison between early and intermediate fusion demonstrates the obvious advantage of using intermediate integration in multimodal learning (see Table 9). While both approaches performed well, the intermediate fusion model outperformed the early fusion model with an overall accuracy of 98% versus 97%. This improvement was constant across different evaluation metrics, including F1-score, specificity, and AUC-based measurements, demonstrating intermediate fusion’s robustness in processing complicated multimodal inputs. By allowing each modality to develop modality-specific representations before integration, intermediate fusion was able to preserve complementary features and minimize the noise that frequently results from early feature concatenation, thereby contributing to a higher performance. This finding is corroborated by the ROC-AUC and PR-AUC curves, as shown in Figure 17 and Figure 18, which show that the intermediate fusion model has better precision–recall balance and sharper discrimination ability.

Comparing between multimodal approaches and single models (baselines):
Along with the fusion techniques, both multimodal approaches were evaluated against three baseline single-modality models: the chest X-ray (CXR), the CT scan, and the cough sound models. The findings clearly show that multimodal integration performs significantly better than any single model, attaining higher generalization and predictive accuracy (see Table 10). This improvement in performance demonstrates the power of merging disparate medical data, as complimentary data from imaging and acoustic modalities can give an improved understanding of disease patterns. Notably, even while the CXR and CT scan models worked well as standalone models, they enhanced performance even more when combined in a multimodal framework, particularly in difficult situations.

5.3. Discussion

The experimental findings show that the suggested multimodal approach employing intermediate fusion consistently performs better than the single models and early fusion. Three single models were built, one for each type of medical data (CXR, CT scans, and cough sounds), and they were used as baselines against the multimodal approaches to assure fairness in the comparison. The outcomes show that both multimodal approaches outperformed any single model, underscoring the significance of utilizing a variety of data modalities. However, the biggest benefit came from intermediate fusion, which enables each modality to learn and enhance its own feature representations before integrating them, capturing the complementary and non-redundant information between modalities. On the other hand, it must be noted that early fusion is still better than single models, but it is constrained by the direct concatenation of raw features, which might add noise and miss signals that are particular to discriminative modality. From a statistical perspective, intermediate fusion ensures a more balanced contribution from all modalities, reducing the possibility of feature dominance and increasing resilience. The results are, thus, compatible with previous research on multimodal learning, where it has been demonstrated that intermediate fusion provides better generalization and predictive performance than early fusion [10]. Overall, these findings clearly show that multimodal learning outperforms single-modality approaches by integrating complementary information from various medical data sources, allowing the model to achieve greater accuracy and robustness than when relying on any single model alone.

The analysis of error behavior and performance limitations is consistent with recent studies on correlated error patterns in complex networked systems. In particular, the framework proposed by [43] has provided valuable insights into how error propagation and correlation can influence performance evaluation, reinforcing the importance of per-class analysis, ROC assessment, and robustness evaluation in multimodal learning systems. The authors acknowledge that performance may decrease under cross-dataset or external validation scenarios due to domain shift across acquisition pipelines. The absence of external clinical validation is a limitation of this study, and the results should be viewed as an upper-bound estimate under curated public datasets. Future work will prioritize cross-dataset evaluation and prospective validation to assess robustness under real-world deployment conditions.

6. Conclusions and Future Work

This study examined the efficacy of multimodal deep learning approaches for medical data classification by combining chest X-ray images, CT scan images, and cough sounds. Two multimodal fusion methods, the early fusion and intermediate fusion, were tested against three single-modality baseline models. The experimental findings show that both multimodal approaches performed significantly better than the single models. The best performance was obtained by intermediate fusion, which recorded an overall accuracy of 98%; moreover, it achieved a significantly higher accuracy than the CXR (94%), CT (94%), and cough sound (79%) models, as well as scoring higher than the 97% achieved by early fusion. All evaluation criteria, including precision, recall, specificity, F1-score, and AUC measurements, consistently demonstrated the superiority of intermediate fusion. These findings emphasize the significance of merging diverse medical datasets to acquire complementary information and to improve classification robustness. According to these results, multimodal learning is an efficient framework for improving automated medical diagnosis.

Despite promising results, several limitations should be acknowledged. The heterogeneous nature of the datasets introduces variability in acquisition protocols and signal quality, which may affect generalization to unseen clinical environments. In addition, cough sound data are inherently noisy and subject to environmental interference, potentially limiting their standalone diagnostic reliability. While multimodal fusion mitigates some modality-specific weaknesses, it does not eliminate noise-related uncertainty. Future research will focus on explainable artificial intelligence techniques to improve transparency and clinical trust by highlighting modality contributions and decision rationale. Incorporating uncertainty estimation and clinician-in-the-loop evaluation will further enhance practical applicability. Furthermore, we will concentrate on producing explainable artificial intelligence (XAI), where the objective is to make the model’s decisions interpretable and transparent. This will increase trust, highlight influential features, and contribute to strengthening the model.

Author Contributions

Conceptualization, A.A.; methodology, A.A.; software, A.A.; validation, A.A. and Y.B.A.; formal analysis, A.A. and Y.B.A.; investigation, A.A. and Y.B.A.; resources, A.A.; data curation, A.A.; writing–original draft preparation, A.A.; writing—review and editing, A.A. and Y.B.A.; visualization, A.A.; supervision, Y.B.A.; project administration, Y.B.A.; funding acquisition, Y.B.A. All authors have read and agreed to the published version of the manuscript.

Funding

This project was funded by the Deanship of Scientific Research (DSR) at King Abdulaziz University, Jeddah, Saudi Arabia, under grant no. (IPP: 662-611-2025). The authors, therefore, acknowledge, with thanks, the DSR for their technical and financial support.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All datasets referenced or analyzed in this article are publicly accessible through the sources cited in the References Section.

Acknowledgments

ChatGPT-5.2 was used solely to paraphrase the selected text during the preparation of this manuscript. The authors reviewed and verified all generated content.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.; van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef]
World Health Organization (WHO). Available online: https://www.who.int (accessed on 25 January 2025).
Dhivya, N.; Sharmila, P. Multimodal Feature and Transfer Learning in Deep Ensemble Model for Lung Disease Prediction. J. Data Acquis. Process 2023, 38, 271. [Google Scholar]
Kumar, S.; Ivanova, O.; Melyokhin, A.; Tiwari, P. Deep-learning-enabled multimodal data fusion for lung disease classification. Inform. Med. Unlocked 2023, 42, 101367. [Google Scholar] [CrossRef]
Behrad, F.; Abadeh, M.S. An overview of deep learning methods for multimodal medical data mining. Expert Syst. Appl. 2022, 200, 117006. [Google Scholar] [CrossRef]
Yao, D.; Xu, Z.; Lin, Y.; Zhan, Y. Accurate and intelligent diagnosis of pediatric pneumonia using X-ray images and blood testing data. Front. Bioeng. Biotechnol. 2023, 11, 1058888. [Google Scholar] [CrossRef] [PubMed]
Hayat, N.; Geras, K.J.; Shamout, F.E. MedFuse: Multi-modal fusion with clinical time-series data and chest X-ray images. In Proceedings of the Machine Learning for Healthcare Conference, Durham, NC, USA, 5–6 August 2022; pp. 479–503. [Google Scholar]
Tolle, L.B. Challenges in the diagnosis and management of patients with fibrosing interstitial lung disease. Case Rep. Pulmonol. 2022, 2022, 9942432. [Google Scholar] [CrossRef]
Khader, F.; Müller-Franzes, G.; Wang, T.; Han, T.; Tayebi Arasteh, S.; Haarburger, C.; Stegmaier, J.; Bressem, K.; Kuhl, C.; Nebelung, S.; et al. Multimodal deep learning for integrating chest radiographs and clinical parameters: A case for transformers. Radiology 2023, 309, e230806. [Google Scholar] [CrossRef]
Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef]
Xu, X.; Li, J.; Zhu, Z.; Zhao, L.; Wang, H.; Song, C.; Chen, Y.; Zhao, Q.; Yang, J.; Pei, Y. A Comprehensive Review on Synergy of Multi-Modal Data and AI Technologies in Medical Diagnosis. Bioengineering 2024, 11, 219. [Google Scholar] [CrossRef] [PubMed]
Kooraki, S.; Hosseiny, M.; Myers, L.; Gholamrezanezhad, A. Coronavirus (COVID-19) outbreak: What the department of radiology should know. J. Am. Coll. Radiol. 2020, 17, 447–451. [Google Scholar] [CrossRef]
Huang, C.; Wang, Y.; Li, X.; Ren, L.; Zhao, J.; Hu, Y.; Zhang, L.; Fan, G.; Xu, J.; Gu, X.; et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 2020, 395, 497–506. [Google Scholar] [CrossRef]
Rousan, L.A.; Elobeid, E.; Karrar, M.; Khader, Y. Chest x-ray findings and temporal lung changes in patients with COVID-19 pneumonia. BMC Pulm. Med. 2020, 20, 245. [Google Scholar] [CrossRef] [PubMed]
Mackenzie, G. The definition and classification of pneumonia. Pneumonia 2016, 8, 14. [Google Scholar] [CrossRef]
Melbye, H.; Straume, B.; Aasebøs, U.; Dale, K. Diagnosis of pneumonia in adults in general practice relative importance of typical symptoms and abnormal chest signs evaluated against a radiographic reference standard. Scand. J. Prim. Health Care 1992, 10, 226–233. [Google Scholar] [CrossRef] [PubMed]
Bradley, S.H.; Abraham, S.; Callister, M.E.; Grice, A.; Hamilton, W.T.; Lopez, R.R.; Shinkins, B.; Neal, R.D. Sensitivity of chest X-ray for detecting lung cancer in people presenting with symptoms: A systematic review. Br. J. Gen. Pract. 2019, 69, e827–e835. [Google Scholar] [CrossRef]
Bradley, S.H.; Bhartia, B.S.; Callister, M.E.; Hamilton, W.T.; Hatton, N.L.F.; Kennedy, M.P.; Mounce, L.T.; Shinkins, B.; Wheatstone, P.; Neal, R.D. Chest X-ray sensitivity and lung cancer outcomes: A retrospective observational study. Br. J. Gen. Pract. 2021, 71, e862–e868. [Google Scholar] [CrossRef] [PubMed]
Gao, J.; Li, P.; Chen, Z.; Zhang, J. A survey on deep learning for multimodal data fusion. Neural Comput. 2020, 32, 829–864. [Google Scholar] [CrossRef]
Alloqmani, A.; Abushark, Y.B.; Khan, A.I.; Alsolami, F. Deep learning based anomaly detection in images: Insights, challenges and recommendations. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 205–215. [Google Scholar] [CrossRef]
Ashwini, S.; Arunkumar, J.; Prabu, R.T.; Singh, N.H.; Singh, N.P. Diagnosis and multi-classification of lung diseases in CXR images using optimized deep convolutional neural network. Soft Comput. 2024, 28, 6219–6233. [Google Scholar] [CrossRef]
Ullah, N.; Marzougui, M.; Ahmad, I.; Chelloug, S.A. Deeplungnet: An effective dl-based approach for lung disease classification using cris. Electronics 2023, 12, 1860. [Google Scholar] [CrossRef]
Mahmoudi, R.; Benameur, N.; Mabrouk, R.; Mohammed, M.A.; Garcia-Zapirain, B.; Bedoui, M.H. A deep learning-based diagnosis system for COVID-19 detection and pneumonia screening using CT imaging. Appl. Sci. 2022, 12, 4825. [Google Scholar] [CrossRef]
Nafiz, M.F.; Kartini, D.; Faisal, M.R.; Indriani, F.; Hamonangan, T. Automated Detection of COVID-19 Cough Sound using Mel-Spectrogram Images and Convolutional Neural Network. J. Ilm. Tek. Elektro Komput. Dan Inform. (JITEKI) 2023, 9, 535–548. [Google Scholar] [CrossRef]
Islam, R.; Abdel-Raheem, E.; Tarique, M. A study of using cough sounds and deep neural networks for the early detection of COVID-19. Biomed. Eng. Adv. 2022, 3, 100025. [Google Scholar] [CrossRef]
Kumar, S.; Sharma, S. An improved deep learning framework for multimodal medical data analysis. Big Data Cogn. Comput. 2024, 8, 125. [Google Scholar] [CrossRef]
Nalluri, S.; Sasikala, R. Detection and Difference of Pneumonia from other Chest/Lung Disease using Multi-model Data: A Hybrid Classification Model. Int. J. Intell. Syst. Appl. Eng. 2024, 12, 328–344. [Google Scholar]
Chowdhury, M.E.; Rahman, T.; Khandakar, A.; Mazhar, R.; Kadir, M.A.; Mahbub, Z.B.; Islam, K.R.; Khan, M.S.; Iqbal, A.; Al Emadi, N.; et al. Can AI help in screening viral and COVID-19 pneumonia? IEEE Access 2020, 8, 132665–132676. [Google Scholar] [CrossRef]
Malik, H.; Anees, T. Chest Diseases Using Different Medical Imaging and Cough Sounds; Version 1; Mendeley Data: Multan, Pakistan, 2023. [Google Scholar] [CrossRef]
Soares, E.; Angelov, P. SARS-COV-2 Ct-Scan Dataset; Kaggle: San Francisco, CA, USA, 2020. [Google Scholar] [CrossRef]
Maftouni, M.; Law, A.C.C.; Shen, B.; Grado, Z.J.K.; Zhou, Y.; Yazdi, N.A. A robust ensemble-deep learning model for COVID-19 diagnosis based on an integrated CT scan images database. In Proceedings of the IIE Annual Conference, Online, 22–25 May 2021; Proceedings. Institute of Industrial and Systems Engineers (IISE): Peachtree Corners, GA, USA, 2021; pp. 632–637. [Google Scholar]
Yan, J. COVID-19 and Common Pneumonia Chest CT Dataset; Version 1; Mendeley Data: Multan, Pakistan, 2020. [Google Scholar] [CrossRef]
Maleki, N. CT-Scan Images; Version 1; Mendeley Data: Multan, Pakistan, 2020. [Google Scholar] [CrossRef]
Alyasriy, H.; AL-Huseiny, M. The IQ-OTHNCCD Lung Cancer Dataset; Version 2; Mendeley Data: Multan, Pakistan, 2021. [Google Scholar] [CrossRef]
Pahar, M.; Klopper, M.; Warren, R.; Niesler, T. COVID-19 cough classification using machine learning and global smartphone recordings. Comput. Biol. Med. 2021, 135, 104572. [Google Scholar] [CrossRef]
Liao, S.; Song, C.; Wang, X.; Wang, Y. A classification framework for identifying bronchitis and pneumonia in children based on a small-scale cough sounds dataset. PLoS ONE 2022, 17, e0275479. [Google Scholar] [CrossRef]
Thornton, B. Audio Recognition Using Mel Spectrograms and Convolution Neural Networks; Noiselab University of California: San Diego, CA, USA, 2019. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Li, S.; Qu, H.; Dong, X.; Dang, B.; Zang, H.; Gong, Y. Leveraging deep learning and xception architecture for high-accuracy mri classification in alzheimer diagnosis. arXiv 2024, arXiv:2403.16212. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Available online: http://www.deeplearningbook.org (accessed on 10 April 2025).
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Zhao, H.; Yan, L.; Hou, Z.; Lin, J.; Zhao, Y.; Ji, Z.; Wang, Y. Error Analysis Strategy for Long-term Correlated Network Systems: Generalized Nonlinear Stochastic Processes and Dual-Layer Filtering Architecture. IEEE Internet Things J. 2025, 12, 33731–33745. [Google Scholar] [CrossRef]

Figure 1. Multi-modality data [11].

Figure 2. Overview of fusion techniques [4].

Figure 3. Methodology for selecting literature.

Figure 4. Each publisher’s percentage out of the 9 articles.

Figure 5. Each year’s percentage out of the 9 articles.

Figure 6. Main stages of the proposed multimodal approach.

Figure 7. Our dataset CXR and CT scan image samples of some lung diseases.

Figure 8. Our dataset sound image samples of some lung diseases that used the mel-spectrogram technique.

Figure 9. Xception architecture.

Figure 10. Dataset distribution by class.

Figure 11. Confusion matrix of the multimodal approach using early fusion.

Figure 12. The model accuracy and loss curves of the multimodal approach using early fusion.

Figure 13. Prediction examples of the multimodal approach using early fusion.

Figure 14. Confusion matrix of the multimodal approach using intermediate fusion.

Figure 15. The model accuracy and loss curves of the multimodal approach using intermediate fusion.

Figure 16. Prediction examples of the multimodal approach using intermediate fusion.

Figure 17. The AUC-ROC and AUC-PR values of the multimodal approach using early fusion.

Figure 18. The AUC-ROC and AUC-PR values of the multimodal approach using intermediate fusion.

Table 1. Summary of recent related works regarding the single-modal approach to lung diseases diagnosis.

Medical Data	[Ref.] (Year)	Methodology	Lung Disease Types	Dataset (#: Sample Size)	Results (%: Performance of the Model Used)	Limitation
CXR Image	[21] (2024)	Convolutional neural networks (CNNs)	COVID-19, Pneumonia (PNEU), Tuberculosis (TB), Lung Cancer (LC), Lung Opacity.	Many online resources: (98,991)	98.75%	The study’s generalizability to real-world medical data may be poor.
	[22] (2023)	Deep learning framework with 20 layers andCNN	Tuberculosis (TB), Pneumonia (PNEU), COVID-19, Lung opacity (LO)	Multiple publicly available datasets: (26,145)	97.47%	Class imbalance
	[23] (2022)	A three-layered CNN architecture	Pneumonia and COVID-19.	COVID-19 CT Lung and Infection Segmentation Dataset: (3138)	98%	Small dataset size.
CT Scan Image	[24] (2023)	CNNs: VGG-16, VGG-19 LeNet-5, AlexNet, ResNet-50, and ResNet-152 models	COVID-19	Virufy dataset: (121)	VGG-16 = 70.3% VGG-19 = 73% LeNet-5 = 83.8% AlexNet = 86.5% ResNet-50 = 78.4% ResNet-152 = 64.9	Small dataset size.
Cough Sounds	[25] (2022)	Deep neural networks (DNN)	COVID-19	Virufy dataset: (121)	89.2%	Small dataset size.

Table 2. Summary of recent related works regarding the multimodal approach to lung diseases diagnosis.

[Ref.] (Year)	Medical Data				Fusion Strategy	Input Data Type	Lung Disease Types	Methodology	Dataset (#: Sample Size)	Pre-Processing Steps	Results (%:Performance of the Model Used)	Limitation
[Ref.] (Year)	CXR	CT	Cough Sounds	Text	Fusion Strategy	Input Data Type	Lung Disease Types	Methodology	Dataset (#: Sample Size)	Pre-Processing Steps	Results (%:Performance of the Model Used)	Limitation
[26] (2024)	✓	-	-	✓	NA	Image & Text	Tuberculosis.	Autoencoder, CNN, and Cross-Modal Transformer.	The dataset was obtained from Government Medical College in Uttarakhand in India: CXR: (3558) Text: (3558)	1. Removing noise. 2. Image resizing. 3. Data augmentation.	Accuracy = 95%	The lower availability of the number of medical records
[27] (2024)	✓	-	-	✓	Intermediate fusion	Image & Text	Pneumonia.	CNN and Deep Maxout.	The dataset was collected manually: CXR: (2258) Text: NA	1. Normalization. 2. Image enhancement.	Accuracy = 95.28 %	1. Used small dataset. 2. Did not implement the single-modal for each CXR, and Text and comparing it with the proposed multimodal.
[4] (2023)	✓	-	-	✓	Intermediate & Late fusion	Image & Text	Presence of lung diseases as general or not	DenseNet121, DenseNet169, and ResNet50 (CNN), Long short-term memory (LTSM) and Attention.	MIMIC-IV dataset: CXR: (1156) Text: (1156)	1. Image resizing. 2. Data augmentation. 3. Normalization	Late fusion Accuracy = 89.99% Intermediate fusion Accuracy = 93.15%	Used small dataset.
[6] (2023)	✓	-	-	✓	Early fusion	Image & Text	Pneumonia	Se-ResNet50 and spatial attention modules (SAM)	Hainan women’s and children’s medical center dataset: CXR: (3100) Text: (799) Guangzhou Women and Children Medical Center dataset (GZCMC): CXR: (5856) Text: NA	1. Cropped the lung area. 2. Normalization	Accuracy = 77.81%	There are imbalanced pneumonia data between bacterial pneumonia and viral pneumonia.

Table 3. Summary of the dataset sources of CXR, CT scans, and Cough sound of lung diseases.

	COVID-19	Pneumonia	Lung Cancer	Healthy	Total
CXR	2	2	1	2	7
CT scan	2	1	2	2	7
Cough sound	2	2	1	2	7
Total	6	5	4	6	21

Table 4. Summary of the dataset images of CXR, CT scans, and Cough sound of lung diseases.

	COVID-19	Pneumonia	Lung Cancer	Healthy
CXR	3633	1360	470	10,200
CT scan	8845	945	1506	8122
Cough sound	948	30	50	1416

Table 5. Summary of the dataset images of CXR, CT scans, and Cough sound of lung diseases after handling imbalanced class.

	COVID-19	Pneumonia	Lung Cancer	Healthy	Total
CXR	500	500	500	500	2000
CT scan	500	500	500	500	2000
Cough sound	500	500	500	500	2000
Total	1500	1500	1500	1500	6000

Table 6. Hyperparameter values are used to fine-tune the proposed models.

Parameter	Value	Description
num_patches	196	Number of image patches (H × W/patch_size²)
patch_size	16	Size of each square patch
num_channels	3	Number of input channels (RGB)
learning rate	0.0001	Learning rate
Epochs	30	Number of epochs
dropout_rate	0.1	Dropout probability
num_heads	12	Number of attention heads
num_classes	4	Number of output classes

Table 7. Results of multimodal using early fusion.

Model	Accuracy	F1-Score	Precision	Recall	Specificity	AUC-ROC	AUC-PR
Multimodal using early fusion	97%	97.5%	97.6%	97%	99.2%	98.3%	95.8%

Table 8. Results of multimodal using intermediate fusion.

Model	Accuracy	F1-Score	Precision	Recall	Specificity	AUC-ROC	AUC-PR
Multimodal using intermediate fusion	98%	97.7%	97.5%	98%	99%	99%	97%

Table 9. Comparison of multimodal using early fusion and multimodal using intermediate fusion.

Model	Accuracy	F1-Score	Precision	Recall	Specificity	AUC-ROC	AUC-PR
Multimodal using early fusion	97%	97.5%	97.6%	97%	99.2%	98.3%	95.8%
Multimodal using intermediate fusion	98%	97.7%	97.5%	98%	99%	99%	97%

Table 10. Comparison of multimodals and single models (Baselines).

Model	Accuracy	F1-Score	Precision	Recall	Specificity	AUC-ROC	AUC-PR
Single model for CXR	94%	94%	95%	95%	98%	96%	91%
Single model for CT scan	94%	94%	94%	94%	98%	96%	90%
Single model for Cough sound	79%	78%	78%	79%	93%	86%	69%
Multimodal using early fusion	97%	97.5%	97.6%	97%	99.2%	98.3%	95.8%
Multimodal using intermediate fusion	98%	97.7%	97.5%	98%	99%	99%	97%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alloqmani, A.; Abushark, Y.B. Comparison Between Early and Intermediate Fusion of Multimodal Techniques: Lung Disease Diagnosis. AI 2026, 7, 16. https://doi.org/10.3390/ai7010016

AMA Style

Alloqmani A, Abushark YB. Comparison Between Early and Intermediate Fusion of Multimodal Techniques: Lung Disease Diagnosis. AI. 2026; 7(1):16. https://doi.org/10.3390/ai7010016

Chicago/Turabian Style

Alloqmani, Ahad, and Yoosef B. Abushark. 2026. "Comparison Between Early and Intermediate Fusion of Multimodal Techniques: Lung Disease Diagnosis" AI 7, no. 1: 16. https://doi.org/10.3390/ai7010016

APA Style

Alloqmani, A., & Abushark, Y. B. (2026). Comparison Between Early and Intermediate Fusion of Multimodal Techniques: Lung Disease Diagnosis. AI, 7(1), 16. https://doi.org/10.3390/ai7010016

Article Menu

Comparison Between Early and Intermediate Fusion of Multimodal Techniques: Lung Disease Diagnosis

Abstract

1. Introduction

2. Background

2.1. Medical Part

2.1.1. Medical Data Collection

2.1.2. Lung Disease Symptoms

2.2. Technical Aspects

2.2.1. Multimodal Definition

2.2.2. Multimodal Data Fusion

3. Literature Review

3.1. Literature Selection Methodology

3.1.1. Keyword Filtering Stage

3.1.2. Publisher Filtering Stage

3.1.3. Year Filtering Stage

3.1.4. Abstract Filtering Stage

3.2. Literature Selection Taxonomy

3.2.1. Single-Modal

3.2.2. Multi-Modal

3.3. Discussion

4. Research Methodology

4.1. Dataset Collection

4.1.1. Lung Disease CXR Datasets

4.1.2. Lung Disease CT Scan Datasets

4.1.3. Lung Disease Cough Sound Datasets

4.2. Data Pre-Processing

4.2.1. Converting Cough Sounds into an Image Using a Mel-Spectrogram

4.2.2. Handling and Imbalanced Class Dataset

4.2.3. Image Enhancement

4.3. Proposed Model

4.3.1. Feature Extraction and Classification

4.3.2. Multimodal Techniques

5. Results and Discussion

5.1. Experimentation Setup

5.1.1. Training Settings

5.1.2. Evaluation Metrics

5.1.3. Single Models as Baselines

5.2. Results

5.2.1. Multimodal Approach Using Early Fusion

5.2.2. Multimodal Approach Using Intermediate Fusion

5.2.3. Comparison Between the Proposed Models and Single Models

5.3. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI