Multitask Deep Learning-Based Pipeline for Gas Leakage Detection via E-Nose and Thermal Imaging Multimodal Fusion

: Innovative engineering solutions that are efﬁcient, quick, and simple to use are crucial given the rapid industrialization and technology breakthroughs in Industry 5.0. One of the areas receiving attention is the rise in gas leakage accidents at coal mines, chemical companies, and home appliances. To prevent harm to both the environment and human lives, rapid and automated detection and identiﬁcation of the gas type is necessary. Most of the previous studies used a single mode of data to perform the detection process. However, instead of using a single source/mode, multimodal sensor fusion offers more accurate results. Furthermore, the majority used individual feature extraction approaches that extract either spatial or temporal information. This paper proposes a deep learning-based (DL) pipeline to combine multimodal data acquired via infrared (IR) thermal imaging and an array of seven metal oxide semiconductor (MOX) sensors forming an electronic nose (E-nose). The proposed pipeline is based on three convolutional neural networks (CNNs) models for feature extraction and bidirectional long-short memory (Bi-LSTM) for gas detection. Two multimodal data fusion approaches are used, including intermediate and multitask fusion. Discrete wavelet transform (DWT) is utilized in the intermediate fusion to combine the spatial features extracted from each CNN, providing spectral–temporal representation. In contrast, in multitask fusion, the discrete cosine transform (DCT) is used to merge all of the features obtained from the three CNNs trained with the multimodal data. The results show that the proposed fusion approach has boosted the gas detection performance reaching an accuracy of 98.47% and 99.25% for intermediate and multitask fusion, respectively. These results indicate that multitask fusion is superior to intermediate fusion. Therefore, the proposed system is capable of detecting gas leakage accurately and could be used in industrial applications.


Introduction
Technological breakthroughs are assisting humanity in resolving economic and social issues.Several issues in the manufacturing sector are being solved by technological advances, but there are still risks that could harm the surrounding environment.A frequent concern in several industries is the leakage of gases.The impacts of industrial disasters brought on by gas leakage, include explosions, burns, accidents, leakage, and waste discharges [1].An unintended break, crack, or porous area in a joint or piece of equipment that rejects other fluids and gases while allowing a closed medium to escape is called a gas leak.Other sources of such leakages that occur in our daily lives involve careless waste disposal and residential cooking, which produce unneeded emissions.Certain dangerous gases, such as flammable and toxic gases, if utilized carelessly, may cause incidents.Industrial accidents have the potential to seriously harm the local population, the environment, and the connected, intelligent ecosystem [2].Fire and smoke leaks necessitate the rapid evacuation of people with mobility disabilities since smoke emissions during leakages cause vision difficulties that are difficult to see clearly.If these harmful vapours are not carefully controlled, breathing them in can cause fainting, unconsciousness, and perhaps a major catastrophe.Since gas leaks are dangerous, human intervention is not an option.Instead, machines must act quickly, accurately, and robustly to assist humans.Therefore, identifying gas leakages within a short period is of the highest importance.
While an instrument is installed in any plant or industrial setting, a quality control procedure called a gas leak test should be completed.The manual examination of gas pipelines and vessels is a typical step in the gas leakage detection process.These strategies cost a lot of money, time, and effort but are ineffective.As pipeline length and plant structure complexity rise, these manual procedures lose effectiveness [3].Additionally, some of the earlier methods for detecting gases, such as colorimetric tape and gas chromatography, had limitations in that they needed costly supplies and skilled workers to employ [4].
On the other hand, metal oxide semiconductor (MOX) sensors have been recently utilized for detecting gases thanks to the development of electronic sensors, producing an array of various sensors called the "Electronic Nose" (E-Nose).The three components of an E-nose are typically a gas sensor array, a signal processing block, and a pattern recognition system.The gas sensor array detects gas and converts it to an electrical signal.Enose improved detection precision and overcame human intervention's constraints.Due to the rapid advancement of artificial intelligence (AI) approaches, they have been employed in a number of industries, including healthcare, medicine [5][6][7][8], education [9][10][11] finance [12], navigation [13], renewable energy [14], agriculture [15] Refs.[16][17][18].Motivated by the promising results achieved in these fields, various AI methods have been adopted to detect gases using the E-nose.The right choice of suitable feature extraction and AI methodology, including machine/deep learning (DL) methods, results in an effective E-nose.Furthermore, thermal imaging has also been recently used as a means of gas detection.In contrast to normal conditions, the temperature of the immediate area rises when a gas leak occurs.Thermal imaging cameras can identify and evaluate the temperature increase.Utilizing this principle, gas leaks can be detected [19].
Numerous studies have investigated the use of machine/DL methods along with Enoses for gas detection solely [3,[20][21][22][23][24].However, detection systems based on only gas sensors have some limitations.They are unable to distinguish between gases when there is less gas in the air, which can lead to false positives or false negatives.Due to their reduced sensitivity, some common sensors are unable to detect some gases, which affects the measurement's overall accuracy and resilience.Furthermore, in a mixed gas environment, the sensors are unable to detect gas accurately.Additionally, they are vulnerable to and constrained by their functioning parameters [25].Less work has employed machine/DL methods and thermal infrared cameras alone for detecting and identifying gases [26][27][28].Nevertheless, using thermal imaging for gas detection has drawbacks, such as decreased precision and accuracy.Thermal imaging may have more false positives because it relies on temperature detection.Noise and distorted images can potentially weaken the robustness of vision-based systems [29].Higher resolution thermal cameras are expensive, and such systems are often not practical from an economic standpoint.Single modality sensing techniques may fall short of the system's needed accuracy and resilience since they are only capable of detecting certain sensor features.The temporal and spatial properties of a single sensor are one of its limitations [20].While a system based on thermal imaging can detect gas existence, it cannot distinguish the type of gas.Consequently, the idea of multimodal/multisensor data fusion was developed.To achieve better results than any single modality utilized alone, data fusion integrates information from numerous sources.However, very few studies [30], ref. [25] have considered combining both E-nose and thermal cameras for detecting and identifying gases.
The detection process in the earlier methods that utilized thermal infrared cameras or gas detectors was handled by a single DL model.However, using multiple DL structures and integrating the attributes that these models generate may boost detection accuracy [31].These models also only used the spatial features that could be extracted from the input thermal images or temporal information of gas measurements.However, combining spatial, spectral, and temporal features can enhance detecting performance [32].Additionally, lowering these attribute sizes can enhance the accuracy of recognition even more.Furthermore, studies that employed multimodal data from both E-nose and thermal cameras used a single convolutional neural network (CNN) structure for thermal images.Nevertheless, employing CNNs of different constructions merge the benefits of all these architectures, leading to an enhancement in the detection accuracy.Moreover, such studies employed the gas measurements directly as inputs to the long-short-term memory (LSTM) DL model making use of only temporal information.Nonetheless, converting these numerical measurements to heatmap images and feeding them to a CNN may benefit from spatial and temporal information, thus improving the performance of detection.In addition, previous multi-modal-based methods relied only on obtaining features from a single domain, however, attaining features from multiple domains could probably improve recognition performance.
In this paper, an affordable option based on the fusion of multiple DL models is proposed for gas leakage detection and identification.The proposed model combines the data from the gas sensors of an E-nose after converting it to heatmap images along with the images taken by low-cost thermal cameras using AI-based multimodal fusion.The DLbased pipeline for gas leakage detection and identification consists of multiple DL models of different architectures, which is not the common scenario in previous studies that rely on a single DL model.The proposed model integrates features from several domains, not only one domain, like existing models for gas leakage detection.Several CNN models are utilized to extract spatial features from multimodal data.Furthermore, the LSTM DL model is used to obtain temporal information from multimodal data and perform the detection and identification tasks.The presented model investigates the impact of numerous fusion methodologies based on transformation approaches that represent data in multiple domains, which is not the case in current models for gas leakage detection.First, for gas leakage detection and identification, two fusion methodologies for CNN and LSTM are introduced and studied.The first is intermediate fusion, while the other is multitask fusion.In the intermediate fusion procedure, discrete wavelet transform (DWT) is used to merge features recovered from each CNN trained with each data modality in order to decrease the dimension of the features and obtain spectral-temporal information instead of relying only on spatial, spectral, or temporal representation like previous studies.Additionally, throughout the multitask fusion, the discrete cosine transform (DCT) approach is used to combine features extracted from all CNNs trained on multimodal input and reduce the dimension of the features and attain spectral information.

Previous Works
The classical methods for gas detection using e-nose relied on traditional data analysis techniques such as principal component analysis (PCA) [33], multiple discriminant analysis [34], cluster analysis [35], computational fluid dynamics (CFD) [36], cyclic voltammetry (CV) curve [37], and least squares development algorithm [38].Furthermore, machine learning techniques (and early AI methods) were used on electronic noses for over 30 years along with pattern recognition methods.There has been a lot of research published on using such techniques to identify gases and detect gas leaks [39][40][41][42][43][44].For example, using a variety of multivariate analysis approaches and principal component analysis (PCA), the study's authors [45] developed an E-nose relying on several sensors that could detect and identify three explosives.Furthermore, using both simulated and actual data, an artificial neural network (ANN) was utilized along with an E-nose to detect gas leaks at a testing location [46].However, as is sometimes the case with machine learning, this model was heavily reliant on sensor data, and interference from unforeseen winds caused the model to significantly exaggerate the leak rates.In a different study that examined the use of ANN and E-nose in pipe gas leak detection, leakages and their locations were detected [47].However, the detection accuracy of the system was dependent on the pressured flow and was largely reliant on the network setup.In another study [20], a hybrid approach was proposed based on the fusion of feature selection approaches and multiple classifiers to identify gases and their concentration levels, achieving gas type recognition and concentration levels of 99.73% and 97.54%, respectively.An E-Nose based on six MOX sensors was created by Zhang et al. [48] to detect flammable and hazardous gases.The authors extracted time information as well as frequency features.These features are fed into a classifier using a support vector machine (SVM).An E-Nose based on three MOX sensors was suggested by Manjula et al. [23] to recognize gases that are present in the air.The authors used time signals as features to feed five machine learning classifiers, where the Random Forest (RF) classifier achieved the highest accuracy of 97.7%.Similarly, in order to identify gases in the atmosphere, Ragila et al. [49] used 6 MOX sensors.The accuracy achieved by the ANN, which used time signals as input, was 93.33%.Despite the high accuracy achieved using the previous studies, the authors did not consider the interference of a mixture of gases.They are also dependent on conventional machine learning approaches.However, DL approaches are superior as they do not require preprocessing techniques.
In [50], the concentration of gases is determined using an array of eight different gas sensors.Convolutional neural networks (CNNs) are used in this work to perform gas classification.While in [51], a framework was proposed for gas leakage detection from pipelines.In this study, the authors used several DL models, including CNN, LSTM, and autoencoders, to perform the detection process, attaining an accuracy of 92%.Similarly, to detect gas leakage, Pan et al. [52] adopted a DL method with a hybrid framework made of CNN and LSTM.Likewise, [3] employed CNN and LSTM to obtain spatial-temporal information to detect gas leakage using limited simulated data.It has been demonstrated that DL systems can more accurately classify data by learning features from gas sensor values.A hybrid Deep Belief Network and stacked autoencoders-based fast gas identification technique was presented in [53].Following that, these attributes are used to build the Softmax classifiers.All of the previous directly relied on data from gas sensors using sequential procedures.However, as mentioned earlier, there are some problems with relying solely on a detection and identification technique based on gas sensors.
Thermal imaging has also been used as a means of gas leakage detection [26,54].However, few studies have employed it to achieve this purpose.Among them, in [27], machine learning models have been applied to infrared (IR) thermal images for gas detection.The identification of gas leaks in rural areas using thermal surveillance cameras is proposed in [28] as tensor-based leakage detection (TBLD).Various classification methods are investigated in the stage of leakage classification.A residual network containing 50 layers was applied to accurately identify gas leakage (ResNet50).Likewise, an IR thermal imaging system is created in the study [55] for the monitoring and detection of flammable gas leakages.Several machine learning algorithms for imaging processing and gas leakage detection were utilized.
Despite the promising results achieved using the earlier methods, using solely gas sensors or thermal imaging has limits that make it less accurate and precise, as explained in the previous section.Gas sensors and thermal images, however, provide more details about the gas being studied [56] and increase accuracy using fusion [57].Nevertheless, a limited number of studies have considered the multimodal fusion of thermal images and the multiple gas sensors of the E-nose.According to Narkhede et al. [25], the accuracy attained using the multimodal fusion of thermal pictures and gas sensor data was 96% as opposed to the accuracy of the separate modalities of gas sensor data and thermal images, which were each 93% and 82%, respectively.Likewise, the study [30] employed multimodal fusion of thermal imaging and sensor data for detecting gas leakages.The authors of such a study compared multitask and intermediate fusion methods.The results indicated that, as opposed to intermediate fusion, multitask fusion is more reliable and accurate.Due to the fusion model's incorporation of data from both modalities, its accuracy is superior to that of separate models.Additionally, compared to the individual models, false positives and false negatives are far reduced.Thus, in this study, a DL-based multimodal fusion pipeline is introduced.The proposed pipeline employs two fusion methods, including intermediate fusion and multitask fusion, to detect gas leakage and identify different gases.In contrast to previous multimodal data fusion methods for gas detection, the proposed pipeline uses three CNNs with different architectures to benefit from the advantage of each model.Furthermore, instead of depending only on spatial or temporal information to construct the classification model, it combines multimodal features extracted from each CNN trained with each data modality using DWT to obtain spectraltemporal representation as well as a spatial demonstration.DWT is also used to diminish the dimension of features after intermediate fusion.Finally, it utilizes the bidirectional LSTM (Bi-LSTM) DL model to perform the classification process, which usually performs better than the classical LSTM model [58].Finally, it employs DCT to further reduce the dimension of the features used to build the classification model after multitask fusion which consequently lowers the training complexity and time.

Fusion Methods of Multimodal Data
According to the multimodal machine learning paradigm, multimodal fusion techniques can be model-based or model-agnostic [59].For multimodal applications, modelagnostic fusion approaches are more prevalent; two or more modalities are used in this fusion to fulfill the following tasks.Thus, model-independent fusion for gas detection using data from thermal imaging and sensors is considered here.The three types of modelagnostic fusion are early, late, and intermediate fusion [60].These techniques have been widely used with machine learning and DL models [61][62][63].In the earlier approach of fusion, actual data or information from the early stages of data processing are concatenated, as seen in Figure 1a.Early fusion aids in capturing and processing interactions between modalities at the data level [31].However, it is frequently not viable to combine diverse data, such as 2-D photos, with 1-D tabulated or time-series data.
On the other hand, in late fusion, as depicted in Figure 1b, predictions from individual modalities are made using statistical techniques like mode, mean, median, majority voting, etc. Due to the fact that it combines decisions, it is sometimes referred to as decision-level fusion.When there is a temporal link between the modalities, this method is favored.The late fusion can combine any form of data, but it just combines the model outputs; it does not mix data or features.As seen in Figure 1c, intermediate fusion fuses features obtained from different modalities and distinct feature extraction methods.The intermediate fusion type combines data representation at more abstract levels, enabling data from diverse sources [64].Combining features from several models usually improves performance [65].Multitask-like fusion is a model-based fusion that adheres to the multitask learning concept, where models are simultaneously trained on a variety of tasks, as seen in Figure 1d.Because it utilizes shared representation across several tasks, multitask learning offers improved efficiency and accuracy.In multimodal multitask models, representations are shared not only among tasks but also between modalities, improving generalization.In multitask fusion, several classifiers are used, including several that can categorize fused data from gas sensors as well as data from thermal cameras.Two classifiers can be thought of as two different tasks, even though they are both performing the same task-identifying gas-which explains why this approach is referred to as multitask fusion [66].

Deep Learning Models
The convolutional neural network (CNN) is a subset of DL techniques that are frequently utilized to solve classification issues with images and have recently been used in gas detection applications [50].The structure of the CNN is based on perceptron models.In contrast to the traditional ANN, these networks automatically extract information from the image, and as a result, they have lately gained attention as a hot research area.The primary benefit of CNNs is that they may perform classification directly from images without the need for extra processes used in conventional machine learning techniques (such as preprocessing, segmentation, and feature extraction) [67].Convolutional, pooling, and fully connected (FC) layers are the three primary layers of a CNN.Parts of the image are convolved with a small-size filter in the earlier layers.The spatial information of the original input image is then used to create a feature map.These feature maps have a high dimension; consequently, the main goal of the pooling layers is to compress this enormous dimension.The FC layers then compile the input from the preceding layers and generate class scores [68].This study employs three CNNs of distinct architectures involving ResNet-50 [69], Inception [70], and MobileNet [71].
Another DL approach is the recurrent neural network (RNN) which is commonly used for sequential or time series data.To understand the stream of errors in RNN, Hochreiter and Schmidhuber [72] developed the Long Short-Term Memory (LSTM) DL architecture [73].An input gate, forget gate and output gate makes up the LSTM architecture.The long-range temporal dependency is recognized by these gates.The basic idea behind the bidirectional LSTM (Bi-LSTM) is to introduce two different LSTMs, both of which are connected to the same output layer, throughout every training cycle.

Multimodal Dataset for Gas Leakage Detection
This study makes use of the multimodal gas detection dataset in [74], which represents multimodal data produced by modern smart applications.To identify the different types of gases and determine their concentrations, A system with a thermal camera and gas sensors is used to collect the gas measurements.The unit used to acquire the dataset employed in this article can be found in [74].A thermal camera is a tool that uses IR light to measure temperature fluctuations.The image sensor of a camera acts as an IR temperature sensor, and each pixel monitors the temperature of every spot simultaneously.The photos are produced using a temperature-based format and are displayed as RGB.Thermal cameras are not restricted to dark areas and can work in any environment, independent of their shape or texture, unlike traditional picture cameras [75].With 206 × 156 thermal Sensors, a 36-degree field of view, a measuring range of 40 °C to 330 °C, a sampling frequency of less than 9 Hz, and 32,136 thermal Pixels, the Seek compact thermographic camera model UW-AAA that can be attached to several Android mobiles was utilized in this study makes it simple to examine a thermal image.
Seven MOX sensors including MQ2, MQ135, MQ3, MQ8 MQ5, MQ7, and MQ6 are utilized to gather gas measurements.Such sensors are responsive to a number of gases, including carbon monoxide, methane, butane, LPG, alcohol, smoke, air quality, and others.Gas sensors work by turning chemical data into electrical data to detect the presence of gas.MOX gas sensors are suitable because of their small size, quick response time, and prolonged lifetime [76,77].Each sensor's heating element produces an analogue output voltage that reflects the concentration of gas.Various sensor properties, such as sensitivity, selectivity, detection limit, reaction time, etc., affect the gas sensor's performance.The paper [74] contains information about the gas sensors utilized and their sensitivity to different gases.The gas sensing devices were separated from one another by 1 mm throughout the data collection procedure.A pair of apparent gas sources were anticipated during the dataset collection process and taken into consideration, including smoke and fragrance.Among these sources, the first gas was produced when a fragrance spray was sprayed, whereas the other gas was produced when incense flames were lit.Carbon Monoxide, Nitrogen Dioxide, Carbon Dioxide, and Sulfur Dioxide, along with other gases in trace amounts, make up the majority of smoke.Besides, a gas combination was generated by mixing the two gases previously mentioned simultaneously.Moreover, to guarantee that we obtain a consistent output for fresh air, the gas sensors were calibrated by warming up for an hour before releasing the gas to be detected (No Gas).These form four classes of gases that will be detected and identified in this study.
The gas sensors and the thermal camera are used in conjunction to collect data on the existence of a gas that must be detected for the creation of a multimodal dataset.The readings were continually taken at regular intervals of two seconds for a period of ninety minutes.To ensure variation in concentration and discharge timings, the gas to be detected was released at intervals of 15 s for the first 30 min, 30 s for the following 30, and 45 s for the final 30 min.The sensor was brought to a steady state (no gas) after each discharge, and its output was used to confirm that it was calibrated.Each gas experiment lasts for a total of 1.5 h.The output of gas sensors is numerical numbers, and thermal images have a resolution of 32 × 32.The heat patterns of the gases may differ depending on how they are released during the data collection.Therefore, gases were distributed consistently in the same way to prevent disagreements while preserving homogeneity.Additionally, the right precautions have been considered to distribute the gas evenly at all times.Samples of the thermal images and matching gas sensor data for each class are displayed in Table 1.The numbers given in Table 1 indicate the measurements in volts that were obtained employing the gas detectors that were used.These digits were determined via a 10-bit analogue to digital converter and represented digital corresponding of the analogue outputs from sensors that detect gases.A total of 6400 samples were gathered, with 1600 samples from each of the four classes of perfume, smoke, perfume, and smoke mixed together, and a neutral environment (No gas) included in the dataset.
Table 1.Examples of data from measurements made by gas sensors and their related thermal pictures [74].The measurements from the gas sensors are MQ2, MQ3, MQ5, MQ6, MQ7, MQ8, and MQ135 in that order.

Proposed Pipeline for Gas Leakage Detection
This study proposed a DL-based pipeline for gas detection from a multimodal dataset based on thermal images and E-nose.The proposed pipeline consists of four stages involving preprocessing of the multimodal data, CNN models retraining and feature extraction, multimodal data fusion, detection, and identification.Initially, gas sensor measurements are converted to heatmaps images, and then these images, along with thermal images, are preprocessed to fit the size of the input layers of each CNN.These images are then augmented to boost the total sum of images used to feed the CNNs.Next, three pre-trained CNNs involving ResNet-50, Inception, and MobileNet are implemented and trained using each data modality separately after being preprocessed in the CNN models retraining and feature extraction stage.Spatial features are also extracted from each of the three CNNs trained either with thermal images or heatmap images of the gas measurements in this stage.Afterward, two multimodal data fusion methods are presented and applied to these features, including intermediate and multitask fusion.In the intermediate fusion, the discrete wavelet transform (DWT) is employed to fuse features obtained from each modality solely (either thermal images or heatmap images of gas measurements) and reduce their size as well as obtaining spatial-spectral-temporal information instead of relying on spatial data alone.While in multitask fusion, the discrete cosine transform (DCT) method is used to merge features of the CNNs trained with both infrared thermal images and heatmap images of gas measurements.DCT is also employed to diminish the huge feature space size resulting from the multitask fusion.Finally, in the last stage, a Bi-LSTM is constructed to perform the detection and identification processes through different scenarios.Figure 2 shows the steps of the proposed pipeline.

Preprocessing of Multimodal Data
Initially, gas sensor measurements of the seven MOX sensors are converted to heatmaps images.This means that each of the numerical gas measurements acquired during regular intervals of 2 s is converted to heatmaps RGB images, where each numerical value is mapped to a color intensity value of the RGB scale.After mapping the measurements, a colormap pattern (heatmap) is generated, forming an RGB image that is then saved in a jpg extension.Afterward, these images, as well as IR thermal images, are resized to correspond to the size of the input layers of the three CNNs.For ResNet-50 and Mo-bileNet, the dimension of the images after resizing is 224 × 224 × 3, while for the Inception CNN, it is equal to 229 × 229 × 3. Next, the data is split into 70-30% for training and testing.In order to improve the training performance of the CNNs, the augmentation process is essential.Augmentation is a procedure made to enlarge the number of images available in the training data, which consequently enhances training performance and avoids overfitting.Thus, several augmentation approaches are utilized for the training data, including shearing (0, 45) in x and y directions, translation (−30, 30), flipping in x and y directions, and scaling (0.9, 1.1).

CNN Models Re-Training and Feature Extraction
Transfer learning [78] (TL) is used in conjunction with three deep pre-trained CNNs.A pre-trained CNN is the one that was trained using TL.TL is the ability to find similarities among disparate data or information to speed up the learning process of a different classification problem with related features.This indicates that the pre-trained CNN can comprehend representations from big datasets like ImageNet and subsequently use these examples in different domains with the same classification challenge.It is frequently utilized because it might be difficult to locate large-scale datasets that are typically identified as ImageNet datasets.First, TL is used to change the CNN's output layer to 4. (Equal to how many classes there are in the multimodal dataset).Following that, several CNNs' settings are modified; these will be discussed later.Then, preprocessed data of each modality are used to retrain these CNNs.After the CNNs have completed their retraining process, TL is once more employed to extract spatial deep features from the final average pooling layer of each CNN.ResNet-50, Inception, and MobileNet deep features have lengths of 2048, 2048, and 1280, respectively.

Multimodal Data Fusion
In this stage, two fusion algorithms are applied, including intermediate and multitask fusion.In the former methodology, spatial deep features extracted from each CNN trained with each data modality are fused using DWT.By breaking down the data using a variety of perpendicular basis functions, DWT provides a spectral-temporal representation of the data.Each of the transforms that make up DWT belongs to a unique class of wavelet basis functions.A 1-D DWT is used to analyze 1-D data, convoluting the input data with low-pass and high-pass filters.The next step in DWT analysis is the dyadic decimation process, a down-sampling technique typically used to lessen aliasing distortion.After applying the 1-D DWT to the 1-D input data, two clusters of coefficients-the approximation coefficients CA1 and the detail coefficients CD1-are generated.To reach the second level of decomposition, this process can be repeated for the approximation coefficients CA1, and once more, two sets of coefficients will be produced: the second-level approximation coefficients CA2 and detail coefficients CD2.This procedure can be continued to create DWT with multiple decomposition levels.Spatial deep features extracted from each CNN trained with each data modality are concatenated, and then four levels of DWT are applied to these concatenated features in this step.The wavelet basis function used is the "Haar" wavelet.DWT can also be used as a feature reduction as at each decomposition level, the size of input data is reduced by a factor of 2, thus in this study, the details coefficients of the fourth DWT level (CD4) are used as input features to train the Bi-LSTM to detect and identify gases in the next stage of the proposed pipeline.The dimension of these reduced features is 256 for both ResNet-50 and Inception and 160 for MobileNet.
On the other hand, in the multitask fusion, all CD4 features extracted from the three CNNs trained with each multimodal data are fused using DCT.DCT is frequently used to break down data into basic frequency components.It displays information as a sum of cosine functions that fluctuate at various frequencies.Typically, the DCT is used to transform the data into DCT coefficients, which are divided into two categories; low frequencies are known as (DC coefficients), and high frequencies are known as (AC coefficients).High-frequency signals depict noise and minor fluctuations (details).While bright conditions are associated with low frequencies.The DCT coefficient matrix's dimensions match those of the input data [79].A reduction step is not carried out by the DCT by itself.However, by performing a second reduction phase where a small number of coefficients are chosen to construct feature vectors, it can compress the majority of the input's significant information into a reduced set of coefficients.A reduced set of DCT coefficients is then selected using zigzag scanning; after all, the CD4 features of the three CNNs trained with both modalities are fused using DCT.These reduced features are then fed to Bi-LSTM to accomplish gas detection and identification.

Gas Leakage Detection and Identification
In this stage, the Bi-LSTM classifier is used to detect and identify gases through three different scenarios.In the first scenario, spatial deep features extracted from each CNN trained with either IR thermal images or heatmaps images of gas sensor data are used individually to train the Bi-LSTM.Whereas in the second scenario, fused spatial-spectraltemporal features obtained in the intermediate fusion (fusion using DWT) are used to feed the Bi-LSTM.In other words, for each CNN, spatial deep features extracted using each data modality are fused using DWT, resulting in spatial-spectral-temporal representation.These features are used as inputs to Bi-LSTM.Finally, in the third scenario, fused features attained utilizing multitask fusion (fusion with DCT) are employed to train the Bi-LSTM.This means that deep features extracted from the three CNN trained with both modalities are fused using DCT, and the output of this process is used to train the Bi-LSTM.Figure 3 demonstrates the three scenarios for gas leakage detection and identification.

Setting of the Parameters
The initial learning rate, mini-batch size, validation frequency, and the number of epochs are only a few of the parameters that are changed for the three CNNs.Thirty total epochs and an initial learning rate of 1 × 10 −3 are used in this study.Mini-batch size and validation frequency are 10 and 448, respectively.The other CNN parameters, however, remain the same.The stochastic gradient descent with momentum is the optimization algorithm employed (SGDM).Five-fold cross-validation is used to assess the classification models' performance.The Bi-LSTM network has a batch size of 100, a validation frequency of 10, and 20 epochs.The softmax activation function is used.For the gate activation function, the sigmoid function is utilized.

Performance Evaluation Measures
Several assessment measures are used to evaluate the efficiency of the proposed pipeline being provided.These metrics include precision, sensitivity, accuracy, F1-score, Matthews correlation coefficient (MCC), and sensitivity.These rules are used to calculate Equations ( 1)-( 6): The true positive (TP) represents the proportion of instances that are correctly classified as positive, the false negative (FN) represents the proportion of samples that are mistakenly classified as negative, the true negative (TN) represents the proportion of instances that are correctly classified as negative, and the false positive (FP) represents the proportion of samples that are mistakenly classified as positive.

Results
This section will illustrate the results of the three scenarios of detecting and identifying gases using Bi-LSTM.Scenario I represents the use of each modality separately to train each CNN and extract features to train individually the Bi-LSTM classifier.Whereas Scenario II represents the intermediate fusion process, where features extracted from each CNN trained independently with the IR thermal images and heatmaps images of gas sensors measurements are fused and reduced using DWT.These fused and reduced features are then used to feed the Bi-LSTM classifier.Lastly, in Scenario III, features obtained from all CNNs trained with both modalities are fused and further diminished using DCT.These features are then used as inputs to the Bi-LSTM classifier.

Bi-LSTM Results of Scenario I
The Bi-LSTM results of Scenario I are shown in Table 2.This table shows that the features extracted from the IR thermal images are superior to the features obtained from gas sensor measurements.This is because the Bi-LSTM accuracy attained using spatial features of ResNet-50, Inception, and MobileNet trained with IR thermal images is 95.55%, 93.60%, and 94.22%, respectively.These accuracies are greater than that achieved by the Bi-LSTM fed with spatial features extracted from the three CNNs trained with gas sensor measurements (93.27%, 92.28%, and 93.27% for ResNet-50, Inception, and MobileNet, respectively.

Bi-LSTM Results of Scenario II
The results of the intermediate fusion approach of the proposed pipeline are illustrated in this section.Figure 4 displays the Bi-LSTM accuracy attained using the intermediate fusion procedure for each CNN of the proposed pipeline.Figure 4 indicates that intermediate fusion has improved the performance of the Bi-LSTM classifier attained for the fused features of the three CNNs individually trained with IR thermal images and gas sensor measurements.The Bi-LSTM accuracy attained with the intermediate fusion of features of ResNet-50, Inception, and MobileNet is 98.33%, 98.47%, and 97.90%, respectively.The results in Figure 4 also verify that utilizing spatial-spectral-temporal representation of features obtained after the intermediate fusion with DWT is superior to using spatial features only in Scenario I. Figure 5 shows the number of features used to train the Bi-LSTM before (Scenario I) and after the intermediate fusion (Scenario II) for the three CNNs.Although the results of Figure 4 prove that intermediate fusion using DWT has improved the performance of the Bi-LSTM, the results of Figure 5 prove that DWT has also successfully reduced the number of features used to build the Bi-LSTM after the intermediate fusion.This is clear as the number of features after the intermediate fusion using DWT is 256,160, and 256 for the spatial-spectral-temporal features obtained from ResNet-50, Inception, and MobileNet.The length of these features is much lower than the spatial features used in Scenario I, as shown in Figure 5.The confusion matrices for the Bi-LSTM after the intermediate fusion for the three CNNs are shown in Figure 6.

Bi-LSTM Results of Scenario III
The results of the multitask fusion procedure of the proposed pipeline are discussed in this section.In multitask fusion (Scenario III), DCT is utilized to fuse all the spatialspectral-temporal features obtained in Scenario II (from the three CNNs).An ablation study is conducted to select the number of features after DCT (multitask fusion).The results of this ablation study are shown in Figure 7.The results included in Figure 7 prove that multitask fusion with DCT has further enhancement in the performance of the proposed pipeline.This is obvious as Figure 7 indicates that starting from 100 features, the accuracy attained 98.56%, reaching accuracies of 99.18% and 99.25% at 350 and 500 features, respectively.These accuracies are greater than that obtained in Scenario II (intermediate fusion), which confirms that the multitask fusion is capable of boosting the model performance and is superior to intermediate fusion.

Discussion
A multimodal DL-based fusion pipeline is developed in this work for accurate gas identification and detection.Four different categories of gases were considered for data gathering using sensors, including an IR thermal camera to record the thermal signature of the gases and an array of seven gas sensors forming an E-nose to identify certain gases.There were four classes: two independent gases, alcohol vapor from perfume and smoke from incense sticks, one as a blend of these gases, and one without any gas.A total of 6400 samples of thermal images and gas sensors were included in the data collection, which is unique.Intermediate and multitask fusion techniques were used to combine these two data modalities and compared to the utilization of a single modality for gas leakage detection and identification.The detection and identification phase of the proposed pipeline is conducted using Bi-LSTM through three scenarios equivalent to using individual modality (Scenario I), intermediate fusion (Scenario II), and multitask fusion (Scenario III). Figure 8 shows a comparison between the highest accuracy attained using each scenario.Figure 8 proves that the intermediate fusion (Scenario II) of the multimodal data (gas sensor + IR thermal imaging) is better than using a single modality (Scenario II) for gas detection.This is because the accuracy has increased to reach 98.47% after intermediate fusion, which is greater than the 93.27% and 95.55% obtained by utilizing the gas sensor measurements and IR thermal imaging, respectively.Furthermore, the multitask fusion has an additional improvement in accuracy, reaching 99.25%.This verifies that multitask is better than intermediate fusion.

Comparisons
In order to demonstrate the competing ability of the proposed pipeline, its performance is evaluated against other recent studies for gas leakage detection based on the same dataset.The results of this comparison are shown in Table 4.The table proves the competitiveness of the proposed pipeline over related studies as it achieved accuracies of 0.985 and 0.992 using intermediate and multitask fusion, respectively.These accuracies are greater than that of 0.96 obtained in [25] based on early fusion and 0.945 and 0.969 attained in [30] using intermediate and multitask fusion.The reason for that is first; the proposed pipeline is based on multiple CNNs models instead of one.Furthermore, the intermediate fusion of the proposed model is accomplished using DWT, which extracts the spectral-temporal representation resulting in spatial-spectral-temporal information obtained from the multimodal dataset.This is not the case in the other studies, which use the spatial information of the data.Finally, the detection step of the proposed pipeline uses Bi-LSTM, which is superior to the traditional LSTM used in other studies and usually improves performance.

Complexity and Computational Analysis
A complexity analysis is conducted to determine the computational cost and the training time for models used for the detection and identification of gases.The proposed pipeline has two modes: offline and online.The deep learning model is trained in the offline mode to detect and identify gases using an already-collected dataset.The operation is offline because it requires a while (long time) to complete the training process for deep learning models to be able to detect and identify gases.On the other hand, in the online mode, the properly trained deep learning models which were originally built in the offline mode are used to instantly detect and identify new gas measurements (unobserved new samples of gas measurements acquired with e-nose and IR thermal camera) that weren't utilized during the offline stage.Since the process takes a short duration, constant observation and detection of gas leakage are done online.In the offline modes, three CNN models are trained independently using gas images and IR thermal images.Whereas, in the online mode, deep features extracted from the three CNNs are used to feed a Bi-LSTM, which performs gas detection and classification in real-time in different scenarios (Scenario II and III).Table 5 reveals the computational cost and the training time for deep learning models used for the detection and identification of gases.Table 5 compares the complexity of different scenarios of the proposed pipeline.
As indicated in Table 5, the complexity of deep learning models in the offline mode is large; this is due to a large number of layers each CNN have as well as the huge number of parameters.The deep learning models require the whole image to feed them to be trained.Also, the detection time is extremely long during the offline stage.However, in the online mode, deep features are used to train a Bi-LSTM model to immediately detect and identify gases in two scenarios (scenarios II and III).In Scenario II, 256 or 160 features (depending upon which CNN is used for feature extraction) are used to train the Bi-LSTM model.The complexity in this scenario is much lower since the Bi-LSTM consists of fewer parameters and only one layer.Therefore, the detection and identification time is much lower than that of the offline mode.Furthermore, in Scenario III, 500 features feed the Bi-LSTM model to directly detect and identify gases.Similarly, the detection and identification duration is much lower than that of the offline mode as the Bi-LSTM has lower complexity than the deep learning models used in the offline mode.
The proposed pipeline could be used in several practical applications, including environmental management, such as monitoring volatile Organic compounds (VOCs), observing environmental pollution due to gases, evaluating the quality of the air inside the home, and detecting combustible/hazardous gases in indoor/outdoor environments.However, the time taken for gas detection in the online mode of the proposed pipeline should be further degraded to produce an efficient model and speed up the gas detection procedure.w: number of weights

Limitations and Upcoming Prospects
Despite the interesting results achieved using the proposed pipeline, it experiences some limitations.Although the detection duration of the online mode of the proposed pipeline is much lower than that of the offline mode, it still needs a further reduction to be used in practical applications more effectively.Future work will consider using compact deep-learning models and maybe combining these models with traditional machinelearning classifiers, which have lower complexity to diminish the time executed for gas leakage detection to be employed in practical applications with better efficiency.Furthermore, the present research neglected to take into account determining the concentrations of blended gases.Prospective studies will concentrate on the semi-quantitative, immediate, and interference-free identification of blended gas concentrations.Future research will also explore the viability of simultaneously determining various gases and their concentration extents.The current study ignores accomplishing a typical task of identifying gases in shifting environmental circumstances.To identify altering points out of gas measurements, subsequent research will take into account altering recognition in metal oxide gas sensor outputs for open sampling structures.Upcoming experiments will also look at modifying the temperature conditions and examining how that affects the efficiency of the suggested pipeline.

Conclusions
This article described a method for evaluating the reliability of intelligent multimodal data for gas leakage detection and identification in the industry 5.0 environment.For gas detection and identification, we evaluated intermediate and multitask fusion and compared them using individual data modalities of gas sensor measurements and IR thermal imaging.The proposed pipeline is based on three CNNs for feature extraction and a Bi-LSTM for gas detection.In the intermediate fusion, spatial features extracted from each CNN were fused using DWT, which also reduced the dimension of features after fusion.The results of this fusion verified that intermediate fusion is capable of boosting gas detection performance.Furthermore, the spatial-spectral-temporal representation of the features obtained using DWT is superior to the spatial information.On the hand, in the multitask fusion, all the spatial-spectral-temporal features attained from the three CNNs that were trained with the multimodal data were combined using DCT.DCT was used as well to lower the length of the feature obtained after the multitask fusion.The results of the multitask fusion proved that this combination method has further enhancement on the detection performance of the proposed pipeline.Moreover, these results indicated that multitask fusion is superior to intermediate fusion.The performance of the proposed pipeline, when compared to other related studies, proved that the proposed pipeline outperformed other recent methods and could be used reliably in industrial applications.

Figure 2 .
Figure 2. Stages of the proposed pipeline.

Figure 3 .
Figure 3.The three scenarios of gas leakage detection and identification stage.

Figure 4 .
Figure 4.The Bi-LSTM accuracy attained using the intermediate fusion procedure for each CNN of the proposed pipeline.

Figure 5 .
Figure 5.The number of features used to train the Bi-LSTM before (Scenario I) and after the intermediate fusion (Scenario II).

Figure 6 .
Figure 6.Confusion Matrices of the B-LSTM were achieved after the intermediate fusion process of (a) ResNet-50 features, (b) Inception features, and (c) MobileNet features.

Figure 7 .
Figure 7.The number of features versus the accuracy attained after multitask fusion with DCT.

Figure 8 .
Figure 8.A comparison between the highest accuracy attained using each scenario.

Table 3 .
Performance metrics of the Bi-LSTM achieved after multitask fusion.

Table 4 .
Performance of the proposed pipeline compared to recent relevant studies based on the same multimodal dataset.

Table 5 .
The complexity and computation analysis of the proposed pipeline.: is the number of convolutional layers  : the total sum of filters in the lth layer  : the number of input channels of the lth layer  : the spatial size of the filter's kernel dimension  : the dimension of the output feature map d