Bearing Fault Diagnosis Using Multidomain Fusion-Based Vibration Imaging and Multitask Learning

Statistical features extraction from bearing fault signals requires a substantial level of knowledge and domain expertise. Furthermore, existing feature extraction techniques are mostly confined to selective feature extraction methods namely, time-domain, frequency-domain, or time-frequency domain statistical parameters. Vibration signals of bearing fault are highly non-linear and non-stationary making it cumbersome to extract relevant information for existing methodologies. This process even became more complicated when the bearing operates at variable speeds and load conditions. To address these challenges, this study develops an autonomous diagnostic system that combines signal-to-image transformation techniques for multi-domain information with convolutional neural network (CNN)-aided multitask learning (MTL). To address variable operating conditions, a composite color image is created by fusing information from multi-domains, such as the raw time-domain signal, the spectrum of the time-domain signal, and the envelope spectrum of the time-frequency analysis. This 2-D composite image, named multi-domain fusion-based vibration imaging (MDFVI), is highly effective in generating a unique pattern even with variable speeds and loads. Following that, these MDFVI images are fed to the proposed MTL-based CNN architecture to identify faults in variable speed and health conditions concurrently. The proposed method is tested on two benchmark datasets from the bearing experiment. The experimental results suggested that the proposed method outperformed state-of-the-arts in both datasets.


Introduction
Rotating machinery has become faster and more intelligent in recent years due to rapid innovation, and plays an increasingly vital role in many industries [1,2]. With this growth in popularity, maintenance procedures are necessary due to the critical nature of several vulnerabilities [3,4]. Rolling element bearings are the most critical components of rotating machinery. Severe working environments, alternative load conditions, and several other factors contribute to the failure of rolling element components of bearing which resulted in massive economic losses and fatalities [5]. Therefore, during the past few decades, industries have acknowledged the significance of establishing practical and dependable condition monitoring systems to address these concerns [6]. However, the acquired vibration signals from these bearings are non-stationary and non-linear in nature due to differences in clearance, friction, loads, and speed. Therefore, directly extracting significant feature information from those signals, or employing time and/or frequency domain-based analysis, is difficult [7]. As a result, developing a novel and effective method for monitoring the condition of bearings has become a difficult and worthwhile challenge [8,9]. the same input at the same time [25]. The contributions of this study are summarized as follows: (1) To address variable operating conditions, a composite color image is created by fusing information from multi-domains, such as the raw time-domain signal, the spectrum of time-domain signal, and the envelope spectrum of the time-frequency analysis. This 2-D composite image, named multi-domain fusion-based vibration imaging (MDFVI), is highly effective to generate a unique pattern even with variable speeds and loads. (2) The developed MDFVI images are further applied as inputs to the CNN-aided MTL network for automatic feature extraction and classification. The proposed network is capable of extracting features in parallel from the time-domain, the frequency-domain, and the time-frequency domain. Additionally, it is capable of predicting variable operating conditions simultaneously: (a) rotating speed and (b) fault types. As a result, multitasking capabilities for bearing fault diagnosis architecture are enabled. (3) The proposed method is tested on two benchmark datasets from the bearing experiment. The experimental results suggested that the proposed method outperformed state-of-the-arts in both datasets.
The rest of the manuscript is organized as follows: Section 2 discusses the technical basis of FFT, envelope analysis, CNN, and MTL networks while Section 3 presents the proposed methodology, Section 4 discusses the experimental analysis, and Section 5 provides the concluding remarks of the paper.

Technical Background
This section presents the technical background of signal processing techniques, convolutional neural networks, and the basics of multi-task learning.

Fast-Fourier Transform (FFT)
The signals of the rolling element bearings are non-linear and non-stationary in nature [26]. For this observed phenomenon, there are hidden periodicities in the signal structure, which carry additional information. FFT is an algorithm for computing the N point discrete Fourier transform (DFT). The N-point DFT can be expressed as: where p = 0, 1, 2, . . . , N − 1 and g, h = 0, 1, 2, . . . , N 2 − 1. In addition to that, X 1 (p) is the N 2 point of DFT of X(N), considered as odd-numbered. Moreover, both of these functions are periodic and discrete. Now, let us consider Then, W Here, W p N for p = 0, 1, 2, . . . , N − 1 are known as the Nth root of unity. Therefore, from Equations (2), and (3), we can get, Here, p = 0, 1, 2, . . . , N 2 − 1. Thus, instead of N complex multiplication, we can derive the frequency domain information from signal with N 2 multiplications. So, the  N). Therefore, by preserving the original amplitude and phase information, a fast Fourier transform (FFT) can process these vibration signals, severing them into their single sinusoidal oscillations at specific frequencies [27].

Envelope Analysis
When a localized fault in a rolling element bearing occurs, it interacts with another surface in the bearing each time it is loaded [28]. Vibrations are emitted as a result. Therefore, the generated periodic impulses excite many bearing resonances as well as the neighboring structure [29]. Consequently, extracting incipient information just from the frequency domain of a signal can be quite challenging. Therefore, an amplitude demodulation technique called envelope analysis is considered for extracting useful feature information from the vibration signals. To perform this analysis, it is necessary to extract the diagnostic information from the sample signal. Fortunately, the Hilbert transform demodulation technique can fabricate the analytic signal from the given sample signal to extract that information. The Hilbert transform of the real component is the imaginary factor of this analytic signal, which is a complex temporal signal. According to the following equation, the envelope e(t) of a signal x(t) is defined mathematically as the magnitude of the analytic signal.
In Equation (5),x(t) refers to the Hilbert transformation of the signal [28,29]. Because the bearing vibration signal is non-stationary and non-linear, Hilbert transform-based envelope analysis is used in this study to extract relevant information from the timefrequency domain.

Convolution Neural Network (CNN)
A convolutional neural architecture with an input layer, several convolutions and pooling layers, multiple fully connected layers, and one output layer is a feedforward network with the benefit of automatic feature information learning and overfitting problem handling [30,31]. Furthermore, several optimization techniques, such as global pooling, dropout, and batch normalization, are frequently incorporated with the fundamental architecture of a CNN to improve the diagnostic performance [32][33][34]. Deep architectures are often trained using two main principles, as shown in Figure 1, namely (1) forward propagation and (2) backward propagation. The design usually seeks to extract spatial information from the input across the anticipated layers during the forward propagation step. During the backward propagation stage, the network attempts to alter internal parameters based on the determined objective function [35]. The main goal of these architectures is to minimize the objective function [36]. It is also worth mentioning that when it comes to deep learning-based designs, there is no hard and fast rule for establishing the optimal number of layers. The overall number of layers is determined using a train-test process that is dependent on the input data type.

Forward Propagation
The convolution layers try to learn abstract features from the input in this step. By learning input properties with varied sizes of convolution kernels, this layer maintains the association between pixels in the input data [37]. An activation function is used in general to improve these convolved features, in addition to the added weights and bias factors [35]. The following equation can be used to describe the entire procedure: In Equation (6), x m n is the mth component of layer n, k n is the nth convolution region of the m − 1 layer feature map, w m in is the weight matrix, and b m n is the added bias. After calculating the overall operation's sum, as described in Equation (6), a non-linear activation function f called a Leaky RELU is used on it.
A pooling layer is used directly after the convolution layer to (a) remove redundancy from the retrieved features of the previous layer and (b) to reduce the number of training parameters. In this study, maxpooling is used as the pooling layer [38], which can achieve the maximum value of the convolutional output x m n as follows: This layer is placed right after the convolution layer discussed in the previous portion. Here, the output x m n of the convolution layer is down sampled. w m in and b m n are the weights and bias matrices respectively. In Equation (7), max x m−1 n denotes the described maxpooling function to reduce the dimensions of the attained convoluted feature maps.
Finally, numerous convolutions and pooling layers are stacked together to boost the depth of the network design. As a result, the final completely connected layer can extract the output category from the input. Typically, numerous fully connected layers are added one after another until the final one, which changes the output matrix in the filter to a column or row [39]. The final fully connected layer can be expressed by the following Equation (8): Here, f is the activation function that produces the probabilistic output from the input in Equation (8). w and b denote the weights and bias respectively. SoftMax is used as the final activation function in this study [39].

Backward Propagation
The objective function is determined when the forward propagation is complete to obtain the input sample's target. This objective function is commonly referred to as a loss function. The entire procedure's main goal is to lower the loss function between the target and actual output. The cross-entropy loss function is used in this work [35] and can be expressed as follows: Here, y z and y z are the actual target and predictive value of the zth sample, respectively. During the training procedure, the stochastic gradient descent approach is used to minimize the loss function. Due to the high computational cost of the dataset, it is not possible to train the neural network with the entire dataset at the same time [40]. Therefore, the entire dataset is divided into several smaller chunks, which are known as batches. Thus, to feed the complete dataset one-time, multiple batches are required. This process is called an epoch. To minimize the loss function by avoiding overfitting and underfitting problems, several epochs are fed to the network architecture to complete the total training process [31,40].

Multi-Task Learning with CNN
Multi-task learning (MTL) is a special case of transfer learning (TL) [25,41]. TL refers to the idea of transferrable knowledge. The key idea behind TL is to share the knowledge learned from a specific task with a different but relevant task. According to this principle, the main tasks in TL are generally very similar in nature, enabling the performance of the targeted tasks to be improved by sharing the trained model architecture and parameters [31,42]. Inductive learning and fine-tuned-based learning are the most suitable examples of TL [37]. Instead of sharing the model architecture separately, MTL network allows one single shared model for all the relevant tasks. Thus, MTL shares the model architecture with the trainable parameters among the relevant tasks and tries to minimize one objective function finally to generalize the model architecture [24]. Additionally, it helps to decrease the training times and reduce the storage space [43]. In this study, CNN-based MTL is used to develop the proposed diagnostic framework. This CNN-based framework simulates manifold tasks by communally learning transferable representations and task relationships [24]. The following equations express the idea of MTL: In Equation (10), {x t , y t } T t=1 refers to the pair of training samples from the original task T, where x t refers to the individual training input, and y t refers to the corresponding output. p is the total number of samples present in the training dataset. The goal is to provide a diagnostic framework based on CNN for a variety of tasks y t n for understanding and exchanging transferable factors in order to connect various tasks competently and actively. The essential principle of MTL is depicted in Figure 2 for visual comprehension. MTL-CNN is proposed in this paper for diagnostic purposes.

Proposed Methodology
The main purpose of this study is to determine the health statuses of rolling element bearings under changing speed settings. The suggested framework is depicted in Figure  3. As depicted in Figure 3, in the proposed framework, there are two main steps, i.e., (1) multi-domain fusion-based vibration imaging as the preprocessing step (MDFVI), and (2) multi-task based neural architecture (MTL-CNN) for performing the diagnostic analysis.

Proposed Methodology
The main purpose of this study is to determine the health statuses of rolling element bearings under changing speed settings. The suggested framework is depicted in Figure 3. As depicted in Figure 3, in the proposed framework, there are two main steps, i.e., (1) multidomain fusion-based vibration imaging as the preprocessing step (MDFVI), and (2) multitask based neural architecture (MTL-CNN) for performing the diagnostic analysis.

Proposed Methodology
The main purpose of this study is to determine the health statuses of rolling element bearings under changing speed settings. The suggested framework is depicted in Figure  3. As depicted in Figure 3, in the proposed framework, there are two main steps, i.e., (1) multi-domain fusion-based vibration imaging as the preprocessing step (MDFVI), and (2) multi-task based neural architecture (MTL-CNN) for performing the diagnostic analysis.

Multi-Domain Fusion Based Vibration Imaging (MDFVI)
Data preprocessing is a significant stage in a neural network-based diagnostic framework [44,45]. This process is challenging mainly for the following reasons: (a) the large volume of samples in the considered dataset, and (b) multiple features associated with the data. As a result, a lot of time is spent creating training samples that are highly dependent on the various operating conditions.
In this study, an efficient and speedy data preprocessing strategy based on increasing the characteristics of vibration signals under variable speed conditions is devised for signal-to-image conversion. The feature information is addressed in three domains in this suggested approach: (a) time domain, (b) frequency domain, and (c) time-frequency domain. Because the signal is non-stationary, neither the time domain nor the frequency domain can capture the signal's changes [46]. Though the time-frequency domain can depict the changing of frequencies over time from non-stationary signals, it is dependent on ideal window selection procedures to find the appropriate time and frequency resolution [47]. To handle these issues, in this framework, the feature information is captured from three domains for generalizing the feature space of an individual health condition. Figure 4 illustrates the whole process. The raw vibration signals are first split into smaller portions, as seen in Figure 4 with a length 16,384 based on an overlapping window technique. Following that, (a) the time-domain information is extracted directly from the vibration signal, (b) the frequency information is extracted by FFT, and (c) the time-frequency information is extracted via envelope analysis. Later, each type of information from the considered domains (time, frequency, and time-frequency) is converted into a 2D image with a length of 128 × 128. Furthermore, these 2D images are converted into grayscale images. Finally, the gray-scale photos are combined to create the final MDFVI image, which has dimensions of 128 × 128 × 3. If 2D, time-domain grayscale image is represented as v(t), 2D frequency-domain grayscale image as v f , and 2D grayscale envelop information to capture time-frequency information asv(t), the MDFVI image can be expressed as follows: where, v(t), v f , andv(t) are considered as red, green, and blue channel respectively. There are no significant reasons for these types of RGB sequences. As we have considered information from 3 domains, therefore, 3 information are considered as a color channel to form the final MDFVI image to get the distinguished health patterns.

Multi-Task Learning-Based Diagnostic Framework
For evaluating the health states of rolling element bearings under variable speed settings, the suggested MTL mechanism is based on CNN architecture. As depicted in Figure 5, the MTL-CNN architecture has two portions, (1) the common feature extractor, and (2) the task branches.
In the first portion, after the input is fed to the network, the spatial feature attributes from MDFVI are extracted from the subsequent layers. This portion is composed of two convolution layers and two max-pooling layers. Until this part, the network is learning the common attributes from the provided input. After that, the task branches are introduced to the proposed framework. The details of the layered architecture are depicted in Figure 5. Moreover, Leaky ReLU is considered as the activation function of the fully connected layers of both branches. On layers before the output layers for both tasks, L2 regularization of 0.05 is applied to prevent overfitting issues. There are no universally accepted guidelines for determining the overall number of layers in a model architecture. As a result, for the considered dataset, a generalized model has been constructed based on train-test methodologies and existing literature surveys [31,48].

Multi-Task Learning-Based Diagnostic Framework
For evaluating the health states of rolling element bearings under va tings, the suggested MTL mechanism is based on CNN architecture. As de 5, the MTL-CNN architecture has two portions, (1) the common feature e the task branches.  In the first portion, after the input is fed to the network, the spatial fea from MDFVI are extracted from the subsequent layers. This portion is com convolution layers and two max-pooling layers. Until this part, the netwo the common attributes from the provided input. After that, the task bran duced to the proposed framework. The details of the layered architecture a Figure 5. Moreover, Leaky ReLU is considered as the activation function o nected layers of both branches. On layers before the output layers for both larization of 0.05 is applied to prevent overfitting issues. There are no univer guidelines for determining the overall number of layers in a model archite sult, for the considered dataset, a generalized model has been constructed b test methodologies and existing literature surveys [31,48].

Performance Evaluation Metrics
Several evaluation metrics are examined for each task for performance evaluation of the proposed framework, i.e., (1) F1 score (F1), (2) average F1 score (aF1), (3) confusion matrices [49], and (4) graph of loss functions. F1 and aF1 [50] can be obtained from the following equations: The initials TP, FP, and FN in these equations stand for true positive, false positive, and false negative, respectively. Total classes indicate the total number of health types presented in the considered dataset. Furthermore, the entire loss of the model is recorded up to the defined epoch to observe the network's bias-variance trade-off. Furthermore, the final feature space derived from the task branch is shown using t-stochastic neighbor embedding to visualize the class separation for each task (t-SNE) [51]. Subsequently, to remove the bias from the evaluation matrices, four-fold cross-validation [52] is performed to obtain the results.

Experimental Setup and Performance Analysis
The proposed framework is tested on two bearing datasets: (1) a self-designed testbed and (2) a publicly accessible repository called the Case Western Reserve University (CWRU) bearing data center [53]. Variable shaft speed and load conditions are evaluated for both datasets to validate the superiority of our suggested technique. Testing is conducted on a self-designed test rig. This rig is run at 300, 400, and 500 RPMs to obtain the vibration signal. The entire setup, as shown in Figures 6 and 7, is made up of two shafts: a drive end shaft and a non-drive end shaft. To connect these two shafts, a gearbox with a reduction ratio of 1.52:1 is used. A three-phase induction motor is installed in the driving end shaft to collect data at three distinct motor speeds [54,55]. At both shaft ends of the experimental testbed, a cylindrical bearing (type FAG-NJ206-E-TVP2) is employed. A wide-band vibration sensor [56] with a sampling rate of 65536 Hz [54] is used to collect vibration signals from the non-drive end shaft. Four types of health conditions are used for conducting the experiments: normal type (NT), inner raceway type (IRT), outer raceway type (ORT), and roller type (RT). The dataset's specifics are presented in Table 1.
final feature space derived from the task branch is shown using t-stochastic neighbor em bedding to visualize the class separation for each task (t-SNE) [51]. Subsequently, to re move the bias from the evaluation matrices, four-fold cross-validation [52] is performe to obtain the results.

Experimental Setup and Performance Analysis
The proposed framework is tested on two bearing datasets: (1) a self-designe testbed and (2) a publicly accessible repository called the Case Western Reserve Univer sity (CWRU) bearing data center [53]. Variable shaft speed and load conditions are eva uated for both datasets to validate the superiority of our suggested technique.

Experimental Setup and Dataset Description
Testing is conducted on a self-designed test rig. This rig is run at 300, 400, and 500 RPM to obtain the vibration signal. The entire setup, as shown in Figures 6 and 7, is made up of tw shafts: a drive end shaft and a non-drive end shaft. To connect these two shafts, a gearbox wit a reduction ratio of 1.52:1 is used. A three-phase induction motor is installed in the drivin end shaft to collect data at three distinct motor speeds [54,55]. At both shaft ends of the expe imental testbed, a cylindrical bearing (type FAG-NJ206-E-TVP2) is employed. A wide-ban vibration sensor [56] with a sampling rate of 65536 Hz [54] is used to collect vibration signa from the non-drive end shaft. Four types of health conditions are used for conducting the ex periments: normal type (NT), inner raceway type (IRT), outer raceway type (ORT), and rolle type (RT). The dataset's specifics are presented in Table 1.   final feature space derived from the task branch is shown using t-stochastic neighbor em bedding to visualize the class separation for each task (t-SNE) [51]. Subsequently, to re move the bias from the evaluation matrices, four-fold cross-validation [52] is performe to obtain the results.

Experimental Setup and Performance Analysis
The proposed framework is tested on two bearing datasets: (1) a self-designe testbed and (2) a publicly accessible repository called the Case Western Reserve Unive sity (CWRU) bearing data center [53]. Variable shaft speed and load conditions are eva uated for both datasets to validate the superiority of our suggested technique.

Experimental Setup and Dataset Description
Testing is conducted on a self-designed test rig. This rig is run at 300, 400, and 500 RPM to obtain the vibration signal. The entire setup, as shown in Figures 6 and 7, is made up of tw shafts: a drive end shaft and a non-drive end shaft. To connect these two shafts, a gearbox wit a reduction ratio of 1.52:1 is used. A three-phase induction motor is installed in the drivin end shaft to collect data at three distinct motor speeds [54,55]. At both shaft ends of the expe imental testbed, a cylindrical bearing (type FAG-NJ206-E-TVP2) is employed. A wide-ban vibration sensor [56] with a sampling rate of 65536 Hz [54] is used to collect vibration signa from the non-drive end shaft. Four types of health conditions are used for conducting the ex periments: normal type (NT), inner raceway type (IRT), outer raceway type (ORT), and rolle type (RT). The dataset's specifics are presented in Table 1.

Results and Performance Comparison
The obtained MDFVI images from the considered four working conditions are shown in Figure 8. As can be seen in this diagram, each of the health kinds has its own set of color differences. Thus, without the necessity of any noise reduction techniques, it helps the proposed deep architecture to classify the health types. In these converted MDFVI images, the subtle differences are very small and difficult to identify with the bare eye. However, due to the color differences, visible distinctions can be observed. Fortunately, due to the powerful capabilities of capturing smaller changes from images, deep learning-based algorithms can help in these types of scenarios [31,57].

. Results and Performance Comparison
The obtained MDFVI images from the considered four working conditions are shown in Figure 8. As can be seen in this diagram, each of the health kinds has its own set of color differences. Thus, without the necessity of any noise reduction techniques, it helps the proposed deep architecture to classify the health types. In these converted MDFVI images, the subtle differences are very small and difficult to identify with the bare eye. However, due to the color differences, visible distinctions can be observed. Fortunately, due to the powerful capabilities of capturing smaller changes from images, deep learning-based algorithms can help in these types of scenarios [31,57]. Additionally, from the depicted Figure 8, the consistency of color components is present in different speed conditions, which helps to establish the invariant scenarios visually. As a result, the proposed MTL-CNN is fed these MDFVI images for final multi-class classification. The MTL-CNN architecture's parameters are depicted in Figure 5. The datasets considered are separated in the following ways to train and test the network. Additionally, from the depicted Figure 8, the consistency of color components is present in different speed conditions, which helps to establish the invariant scenarios visually. As a result, the proposed MTL-CNN is fed these MDFVI images for final multi-class classification. The MTL-CNN architecture's parameters are depicted in Figure 5. The datasets considered are separated in the following ways to train and test the network.
As discussed in the previous section, on each dataset, the total number of recorded signals is 800. Therefore, as listed in Table 2, a total of 1152 samples from all three datasets are used for training the network with 288 samples used for validation purposes. The remaining 960 samples are used for testing the diagnostic performance for two task branches. Furthermore, to eliminate bias, the above-mentioned data division is performed using an equal number of samples from each health class. The model is trained for 3000 epochs to validate the diagnostic performance. Besides, from Figure 9, the loss function graph can be observed for the whole model. Figure 9a highlights the loss function for speed detection, and Figure 9b shows the loss function for health type detection. Therefore, Figure 9c shows the total loss of the model. Besides, for evaluating the diagnostic performance, initially, the F1 and aF1 scores are considered from Equations (13) and (14). The diagnostic performance of the two considered work tasks are listed in Table 3. The proposed technique was 100% correct in almost every case, as shown in the table. Additionally, to make a better analysis of the obtained results, the confusion matrix ( Figure 10) and the last layer of the feature space of each task are visualized by t-SNE ( Figure 11). The diagnostic performance is represented in the form of actual vs. projected deviation in the confusion matrix. The proposed framework's diagnostic performance will indeed be improved as a result of these observations. As discussed in the previous section, on each dataset, the total number of recorded signals is 800. Therefore, as listed in Table 2, a total of 1152 samples from all three datasets are used for training the network with 288 samples used for validation purposes. The remaining 960 samples are used for testing the diagnostic performance for two task branches. Furthermore, to eliminate bias, the above-mentioned data division is performed using an equal number of samples from each health class. The model is trained for 3000 epochs to validate the diagnostic performance. Besides, from Figure 9, the loss function graph can be observed for the whole model. Figure 9a highlights the loss function for speed detection, and Figure 9b shows the loss function for health type detection. Therefore, Figure 9c shows the total loss of the model. Besides, for evaluating the diagnostic performance, initially, the F1 and aF1 scores are considered from Equations (13) and (14). The diagnostic performance of the two considered work tasks are listed in Table 3. The proposed technique was 100% correct in almost every case, as shown in the table. Additionally, to make a better analysis of the obtained results, the confusion matrix ( Figure 10) and the last layer of the feature space of each task are visualized by t-SNE ( Figure 11). The diagnostic performance is represented in the form of actual vs. projected deviation in the confusion matrix. The proposed framework's diagnostic performance will indeed be improved as a result of these observations.       The planned MTL-CNN is compared to different deep learning-based methodologie to determine the robustness of the proposed MTL-CNN-based diagnostic framework These approaches draw from several sources [37,58,59], and are adapted according to th similar experimental setup as this case study. To compare the results of these methods the af1 accuracy is employed. These techniques include the following: (1) WC + MTL: Data are first converted into the 2D matrices of wavelet coefficient. Thus   The planned MTL-CNN is compared to different deep learning-based methodologie to determine the robustness of the proposed MTL-CNN-based diagnostic framework These approaches draw from several sources [37,58,59], and are adapted according to th similar experimental setup as this case study. To compare the results of these method the af1 accuracy is employed. These techniques include the following: (1) WC + MTL: Data are first converted into the 2D matrices of wavelet coefficient. Thu The planned MTL-CNN is compared to different deep learning-based methodologies to determine the robustness of the proposed MTL-CNN-based diagnostic framework. These approaches draw from several sources [37,58,59], and are adapted according to the similar experimental setup as this case study. To compare the results of these methods, the af1 accuracy is employed. These techniques include the following: (1) WC + MTL: Data are first converted into the 2D matrices of wavelet coefficient. Thus, the identification of certain frequencies is captured both in the temporal, and spatial domain [58]. Therefore, these preprocessed signals are fed into MTL-based deep architectures [59]. (2) TFI + CNN: To construct the multi-fusion input, the input is converted into many time-frequency images (TFI), which are then transferred to the MTL-CNN architecture, which is based on the proposed CNN model taken from [37]. (3) GI + CNN: The input is transformed to 2D greyscale pictures (GI), which are then fed into the MTL-CNN, which is based on the proposed CNN from [60].
(4) VMD + MTL-CNN: To generate the multifusion input, each signal is decomposed into a sequence of intrinsic mode functions using variational mode decomposition and then channel wise joined [61]. Then, using the suggested MTL-CNN architecture, those series of intrinsic mode functions are fusioned channel wise for classification.
The comparisons among these methods with the improvement details are listed in Table 4. The results show that the suggested framework (MDFVI + MTL-CNN) outperformed three state-of-the-art approaches, with average performance improvements of 6.58-12.51% and 6.55-13.02% for Task 1 and Task 2, respectively. In addition to that, from these results, we can claim that, for multidomain information fusion, the model can extract more meaningful information automatically. Thus, it enables the simultaneous prediction for speed and health type with a 99.99% accuracy. The multi-domain fusion-based preprocessing approach examined in this work is confined to single sensor data. However, multiple approaches have effectively demonstrated multisensory data fusion in recent investigations. For instance, based on the belief divergence of shreds of evidence and the belief entropy, Xiao et al. [62] presented a successful fusion technique that is both practicable and effective in resolving conflicting evidence, increasing the target's belief value to 99.05%. Similarly, Shao et al. developed a defect diagnostic technique based on multisensory fusion in [63]. For multisensory fusion, this approach proposes a stacked wavelet auto-encoder (SAE) with a Morlet wavelet. Additionally, a variable weighted assignment technique for decision fusion is devised. On the gearbox dataset, our approach displays state-of-the-art performance. These findings, however, demonstrate the critical nature of multisensory fusion for condition-based monitoring. As a result, we aim to use multisensory fusion technology for our next investigation in order to collect all relevant data from all sensor locations. Therefore, the model becomes more resilient and dependable. Additionally, several research have demonstrated effective attempts to enhance the pattern from multivariate time series. For example, Zhang et al. [64] demonstrated the use of a tri-partition state alphabet-based sequential pattern to generate a compact, understandable, and scalable pattern for multivariate time series. As a result, these findings will be beneficial for future research in order to improve the MDFVI's conciseness. Furthermore, to extend the proposed MTL-CNN detection algorithm in a unsupervised one, k-means clustering techniques [65] can be useful for identifying the health cluster automatically as well.

Experimental Setup and Dataset Description
The vibration signals of the bearing are gathered from a public available repository, provided by Case Western Reserve University [66]. The experimental testbed is shown in Figure 12. The experimental setup consists of a 2-horsepower induction motor, a dynamometer, and a transducer, as shown in this diagram. With the help of the housing-mounted accelerometer, the desired signals are acquired by the induction motor. In addition, the dynamometer simulation considers a variety of motor loads. As a result, there is a difference in the motor shaft speeds. An electro-discharge machine is also used to manufacture the intentionally seeded defects on the driving end bearing. A sampling frequency of 12 kilohertz is used to collect the signals (kHz). As in the last case study, four types of health circumstances are used for conducting the experiments: NT, IRT, ORT, and RT. The dataset's details are listed in Table 5.
Sensors 2022, 22,56 a difference in the motor shaft speeds. An electro-discharge machine is also ufacture the intentionally seeded defects on the driving end bearing. A quency of 12 kilohertz is used to collect the signals (kHz). As in the last ca types of health circumstances are used for conducting the experiments: NT, I RT. The dataset's details are listed in Table 5.  After the signal segmentation, to analyze the diagnostic performance fr of health conditions, a total of 1000 signals (250 from each health type) are

Verification and Performance Comparison
After the signal segmentation, to analyze the diagnostic performance from four types of health conditions, a total of 1000 signals (250 from each health type) are considered at each RPM (1797, 1772, and 1750). Then, from every sample, the MDFVI images are attained to feed to the proposed network. In a very similar way to the previous case study, 60% of the dataset is used for training, and the remaining 40% is used for testing. Furthermore, the MTL-CNN architecture's parameters are kept the same as in the prior case study. The following Table 6 shows the details of the data split. According to the previous explanation, the model is also trained for 3000 epochs with four-fold cross-validation. For calculating the diagnostic performance, the F1 and aF1 scores are calculated from Equations (13) and (14). The analytical performances are given in Table 7. From these analyses, it can be ensured and validated that the proposed approach can provide a reasonable state-of-the-art diagnostic performance. Furthermore, the achieved 100% accuracy in the entire considered scenario indicates the generalization ability of the proposed approach. Similarly, as in the previous case study, to establish the generalization ability of this MTL-CNN-based diagnostic framework, the designed framework is compared with these previously mentioned approaches, i.e., (1) WC + MTL [59], (2) TFI + CNN [37], and (3) GI + CNN [60]. For these diagnostic frameworks, the preprocessing details and the parameters are kept similar to those used in the previous case study. The details of the comparisons are listed in Table 8.
For the CWRU dataset, the suggested framework (MDFVI + MTL-CNN) beat three state-of-the-art approaches, delivering an average performance enhancement of 1.21-6.59% and 1.87-6.45% for Task 1 and Task 2, respectively. Furthermore, the effects of noise on diagnostic performance have been examined for easy replication using this freely available dataset. Gaussian white noise with a signal-to-noise ratio (SNR) of 6 dB is introduced to the testing samples to replicate data with additional background noise. Before being tested on the simulated noisy data, all similar techniques, including the proposed one, are trained on the original preprocessed input data. Figure 13 shows the diagnostic results. Due to the noisy dataset, the diagnostic performances of all the evaluated approaches have gone off, according to this analysis. However, the proposed model outperforms the alternatives. diagnostic performances of all the evaluated approaches have gone off, according to thi ysis. However, the proposed model outperforms the alternatives.

Conclusions
This study demonstrated an autonomous diagnostic system that combines sign image translation techniques for multi-domain information with convolutional n network-assisted multitask learning. One of primary objectives of this study is to m variable operating conditions such as varying loads and speeds. As a result, to acco date changing operating conditions, a composite color image is created by fusing from many domains, including the raw time-domain signal, the time-domain si spectrum, and the time-frequency analysis's envelope spectrum. This two-dimen composite picture technique, called multi-domain fusion-based vibration im (MDFVI), is particularly effective at creating a unique pattern independent of spe load. Following that, these MDFVI images are fed into the proposed MTL-based architecture, which is capable of accurately detecting flaws in changing speed and h conditions concurrently. However, the proposed preprocessing method studied i work is currently limited to data from a single sensor. Additionally, the proposed f work is now constrained to the fixed resolution of MDFVI. As a result, we want to co

Conclusions
This study demonstrated an autonomous diagnostic system that combines signalto-image translation techniques for multi-domain information with convolutional neural network-assisted multitask learning. One of primary objectives of this study is to manage variable operating conditions such as varying loads and speeds. As a result, to accommodate changing operating conditions, a composite color image is created by fusing data from many domains, including the raw time-domain signal, the time-domain signal's spectrum, and the time-frequency analysis's envelope spectrum. This two-dimensional composite picture technique, called multi-domain fusion-based vibration imaging (MDFVI), is particularly effective at creating a unique pattern independent of speed or load. Following that, these MDFVI images are fed into the proposed MTL-based CNN architecture, which is capable of accurately detecting flaws in changing speed and health conditions concurrently. However, the proposed preprocessing method studied in this work is currently limited to data from a single sensor. Additionally, the proposed framework is now constrained to the fixed resolution of MDFVI. As a result, we want to conduct our next experiment using multisensory fusion technology in order to capture all essential data from all sensor locations. Furthermore, future work will incorporate an adaptive time, frequency, and time-frequency resolution when constructing a robust MDFVI as an input. As a result, the model becomes more robust and reliable.