Deep Learning Aided Data-Driven Fault Diagnosis of Rotatory Machine: A Comprehensive Review

: This paper presents a comprehensive review of the developments made in rotating bearing fault diagnosis, a crucial component of a rotatory machine, during the past decade. A data-driven fault diagnosis framework consists of data acquisition, feature extraction/feature learning, and decision making based on shallow/deep learning algorithms. In this review paper, various signal processing techniques, classical machine learning approaches, and deep learning algorithms used for bearing fault diagnosis have been discussed. Moreover, highlights of the available public datasets that have been widely used in bearing fault diagnosis experiments, such as Case Western Reserve University (CWRU), Paderborn University Bearing, PRONOSTIA, and Intelligent Maintenance Systems (IMS), are discussed in this paper. A comparison of machine learning techniques, such as support vector machines, k-nearest neighbors, artiﬁcial neural networks, etc., deep learning algorithms such as a deep convolutional network (CNN), auto-encoder-based deep neural network (AE-DNN), deep belief network (DBN), deep recurrent neural network (RNN), and other deep learning methods that have been utilized for the diagnosis of rotary machines bearing fault, is presented.


Introduction
Motion is powered by electromechanical systems, which account for around 70% of the gross energy consumption in industrialized economies [1]. By 2017, the global market was at the size of USD 96,967.9 million, and is expected to reach USD 136,496.1 by the year 2025 [2]. One of the basic components that is used in industries is an electrical motor that converts electrical energy into mechanical energy.
Specifically, based on motor types, the global market is divided into DC, or AC or hermetic motors, which in turn are further subdivided as:
The global market of electric motors can be further classified based on operating industries such as automotive vehicles, industrial machinery, aerospace, household, and commercial applications. In the manufacturing and automotive industries, due to an increase in demand for compressor systems, the industrial segment contributed the largest share in the year 2017, which is even estimated to increase by 2025 [3]. Figure 1 provides the continent-wise market shares of electric motors' global usages and Figure 2 presents application-wise usages of electric motors in 2017 and their forecasts by 2025.
An electric motor consists of different apparatus such as a rotor, bearings, stator, air gap, commutator, and windings. Among these parts, a bearing is the core of the rotating   An electric motor consists of different apparatus such as a rotor, bearings, stator, air gap, commutator, and windings. Among these parts, a bearing is the core of the rotating motor as it supports and locates the rotor to keep the air gap small and consistent, and it transfers the load from the motor to shaft. It is one of the most important mechanical components to diminish the friction between the rotating and stationary elements [4]. If the equipment fails to work during the use, it will affect the systems operations and can even cause serious economic losses and casualties. According to the literature review, around 50-60% of the failure of induction rotating machines is caused by bearing [4]. Therefore, fault diagnosis of a rotating machine bearing is inevitable to avoid the unexpected breakdown. An effective fault diagnosis of the bearing can ensure the efficient operation of the   An electric motor consists of different apparatus such as a rotor, bearings, stator, air gap, commutator, and windings. Among these parts, a bearing is the core of the rotating motor as it supports and locates the rotor to keep the air gap small and consistent, and it transfers the load from the motor to shaft. It is one of the most important mechanical components to diminish the friction between the rotating and stationary elements [4]. If the equipment fails to work during the use, it will affect the systems operations and can even cause serious economic losses and casualties. According to the literature review, around 50-60% of the failure of induction rotating machines is caused by bearing [4]. Therefore, fault diagnosis of a rotating machine bearing is inevitable to avoid the unexpected breakdown. An effective fault diagnosis of the bearing can ensure the efficient operation of the systems, and it detects and identifies the bearing faults during the operation of the motor. Over the past few decades, researchers have carried out extensive research on bearing fault diagnosis. Additionally, new approaches and research are emerging in this field with the advancement of technology and industrial techniques. The work consists of various techniques that focus on different domains of the bearing fault diagnosis pipeline. For example, some researchers focused on the effective classification mechanism consisting of machine learning and deep learning techniques, while others dedicated themselves to Energies 2021, 14, 5150 3 of 24 signal processing techniques to handle complex and nonlinear signals which are normally encountered during the fault diagnosis process. This paper reviews the machine learning and deep learning algorithms used for bearing fault bearing diagnosis and discusses the future direction in this field. The main contributions are as follows:

1.
A detailed analysis of a standard bearing fault diagnosis pipeline is given; 2.
An overview of shallow machine learning techniques used in the field of bearing fault diagnosis and their limitations; 3.
A systematic review of the literature available on bearing fault diagnosis in the last decade mainly focusing on the application of deep learning algorithms; 4.
Discussion on the future directions in the field of bearing fault diagnosis.
The rest of the paper is organized in the following manner. Section 2 consists of the common public datasets available for bearing motor fault diagnosis experiments. Section 3 covers the classical and deep learning algorithm-based research on fault bearing diagnosis. Finally, Section 4 shows deep-learning-based fault baring diagnosis research and their comparison.

A Standard Pipeline of Bearing Fault Diagnosis
Bearings are essential elements in rotating machines which ensure smooth operation by reducing friction among different components of the machine. Bearings are the main contributor to the failure of rotary machines, accounting for around 50-60%, since they have to operate in a harsh working environment [5]. An unexpected failure of the bearing can cause sudden breakdown to the machine or result in the entire system collapse, which could lead to huge financial loss and time wastage. Therefore, this sector receives significant attention from researchers in an effort to find more efficient solutions. The general diagnosis methodology consists of four steps, i.e., data acquisition, feature extraction, feature selection, and fault classification, as shown in Figure 3.
Over the past few decades, researchers have carried out extensive research on bearing fault diagnosis. Additionally, new approaches and research are emerging in this field with the advancement of technology and industrial techniques. The work consists of various techniques that focus on different domains of the bearing fault diagnosis pipeline. For example, some researchers focused on the effective classification mechanism consisting of machine learning and deep learning techniques, while others dedicated themselves to signal processing techniques to handle complex and nonlinear signals which are normally encountered during the fault diagnosis process. This paper reviews the machine learning and deep learning algorithms used for bearing fault bearing diagnosis and discusses the future direction in this field. The main contributions are as follows: 1. A detailed analysis of a standard bearing fault diagnosis pipeline is given; 2. An overview of shallow machine learning techniques used in the field of bearing fault diagnosis and their limitations; 3. A systematic review of the literature available on bearing fault diagnosis in the last decade mainly focusing on the application of deep learning algorithms; 4. Discussion on the future directions in the field of bearing fault diagnosis.
The rest of the paper is organized in the following manner. Section 2 consists of the common public datasets available for bearing motor fault diagnosis experiments. Section 3 covers the classical and deep learning algorithm-based research on fault bearing diagnosis. Finally, Section 4 shows deep-learning-based fault baring diagnosis research and their comparison.

A Standard Pipeline of Bearing Fault Diagnosis
Bearings are essential elements in rotating machines which ensure smooth operation by reducing friction among different components of the machine. Bearings are the main contributor to the failure of rotary machines, accounting for around 50-60%, since they have to operate in a harsh working environment [5]. An unexpected failure of the bearing can cause sudden breakdown to the machine or result in the entire system collapse, which could lead to huge financial loss and time wastage. Therefore, this sector receives significant attention from researchers in an effort to find more efficient solutions. The general diagnosis methodology consists of four steps, i.e., data acquisition, feature extraction, feature selection, and fault classification, as shown in Figure 3.

Data Acquisition
It is the process of sampling signals that calculates real-world physical conditions and converts the result samples into the digital numeric values that a computer manipulates. In the first step of diagnosis, it collects the vibration signals, acoustic emission signals or electric motor current signals, that reflect the health status of bearings from the sensor systems.

Data Acquisition
It is the process of sampling signals that calculates real-world physical conditions and converts the result samples into the digital numeric values that a computer manipulates. In the first step of diagnosis, it collects the vibration signals, acoustic emission signals or electric motor current signals, that reflect the health status of bearings from the sensor systems.

Feature Extraction
Feature extraction begins with a set of measured data and creates derived values (features) that are intended to be useful and non-redundant, easing the learning and generalization phases and, in some situations, resulting in superior human interpretations. It converts the raw signals into statistical characteristics that convey information about the machine's status, which is known as feature extraction. In order to obtain high-accuracy recognition outcomes, the feature extractor design plays a vital part in the pattern recognition challenge. The actual bearing failure signals gathered from rotary machines are in the time domain, and we may extract characteristics from signals in the time domain, frequency domain, and time-frequency domain. They can be investigated in the frequency and timefrequency domains using the appropriate transformation tool in the respective domains.

Feature Selection
Feature selection is the process of minimizing the number of input variables when creating a predictive model; the number of input variables should be reduced to lower the computational cost of modeling and, in some situations, to increase the model's performance. It chooses the most discriminant features, such as feature sets with high dimensions, which means having redundant and irrelevant features, which increases learning time, lowers classifier performance, and necessitates a lot of computation. The feature selection stage improves the classification accuracy while also reducing the calculation time. The two most frequent methods for feature selection are: (1) creating a new feature set with inferior dimensions from the extracted feature set. This might be accomplished using Independent Component Analysis (ICA) and Principal Component Analysis (PCA). (2) Using specific benchmarks, deleting non-sensitive or unneeded features. One of the most prominent approaches for this problem is Sequential Selection (SS).

Bearing Fault Diagnosis/Classification
After selecting the features, they must be passed into a learning-based classifier such as the k-nearest neighbor (KNN), artificial neural network (ANN), or support vector machine (SVM), one-dimension convolutional neural network (1D-CNN), strongly regularized deep convolutional neural network (SRDCNN), etc. to detect the bearing defect.

Dataset for Fault Bearing Experiment
A reliable and accessible dataset is required to develop data-driven bearing fault diagnosis methodologies. However, to collect data from a naturally degraded bearing is a time-consuming process. Therefore, most researchers prefer datasets with artificially induced faults on bearing. Some organizations and research centers have made efforts to create datasets and provide open access to researchers across the globe, which helps researchers to implement and evaluate bearing fault diagnosis algorithms. Some of the available popular public bearing datasets are discussed as follows.

Case Western Reserve University Bearing Dataset
Case Western Reserve University (CWRU) bearing data is a public dataset that is collected from a test rig, shown in Figure 4 [6]. The testbed contains (a) two HP motors, (b) a torque transducer/encoder, (c) a dynamometer and control electronics. According to the description available for the data, different single-point faults were introduced on both the bearings, i.e., the driven end as well as fan end, with an electro-discharge machine having fault diameters of 7, 14, 21, 28, and 40 mils on the rolling elements, inner and outer raceways. Moreover, the dataset consists of vibration signals collected under a sampling frequency of 12 kHz, where the motor speed varies from 1720-1797 revolutions per minute (RPM) and load variations of 0 to 3 HP by using accelerometers (sensors) installed on the fan-and drive-end bearings of the motor.

Paderborn University Bearing Dataset
The Paderborn University dataset is another public archive for bearing datasets [7]. The testbed used for data collection consists of (a) a test motor, (b) a measuring shaft, (c) a bearing module, (d) a flywheel, and € a load motor, as depicted in Figure 5. The collected dataset contains synchronous vibration measurements in addition to motor current measurements. Sensors used in the equipment are one accelerometer, two current sensors, and 1 thermocouple. Vibration signals are under high resolution and 64 kHz sampling frequency. Experiments are carried out on perfectly working 6 perfectly working bearings and 26 damaged bearings, out of which, 12 are artificially damaged and the rest contain real damages triggered by accelerated tests.

PRONOSTIA Dataset
PRONOSITA is a useful dataset that contains a real portrayal of the real-time degradation of bearings under different conditions [8]. Its testing equipment consists of the following components: (a) NI CDA Q cards, (b) a pressure regulator, (c) cylinder pressure, (d) a force sensor, (e) the bearing tested, (f) accelerometers, (g) platinum RTD, (h) coupling, (i) a torquemeter, (j) a speed reducer, (k) a speed sensor, and (l) an AC motor. Two uni-axis accelerometers of 25.6 kHz sampling frequency which are installed in horizontal and vertical positions, as can be seen in Figure 6. Equipment is categorized as a rotating speed sensor and force sensor.

Paderborn University Bearing Dataset
The Paderborn University dataset is another public archive for bearing datasets [7]. The testbed used for data collection consists of (a) a test motor, (b) a measuring shaft, (c) a bearing module, (d) a flywheel, and € a load motor, as depicted in Figure 5. The collected dataset contains synchronous vibration measurements in addition to motor current measurements. Sensors used in the equipment are one accelerometer, two current sensors, and 1 thermocouple. Vibration signals are under high resolution and 64 kHz sampling frequency. Experiments are carried out on perfectly working 6 perfectly working bearings and 26 damaged bearings, out of which, 12 are artificially damaged and the rest contain real damages triggered by accelerated tests.

Paderborn University Bearing Dataset
The Paderborn University dataset is another public archive for bearing datasets [7]. The testbed used for data collection consists of (a) a test motor, (b) a measuring shaft, (c) a bearing module, (d) a flywheel, and € a load motor, as depicted in Figure 5. The collected dataset contains synchronous vibration measurements in addition to motor current measurements. Sensors used in the equipment are one accelerometer, two current sensors, and 1 thermocouple. Vibration signals are under high resolution and 64 kHz sampling frequency. Experiments are carried out on perfectly working 6 perfectly working bearings and 26 damaged bearings, out of which, 12 are artificially damaged and the rest contain real damages triggered by accelerated tests.

PRONOSTIA Dataset
PRONOSITA is a useful dataset that contains a real portrayal of the real-time degradation of bearings under different conditions [8]. Its testing equipment consists of the following components: (a) NI CDA Q cards, (b) a pressure regulator, (c) cylinder pressure, (d) a force sensor, (e) the bearing tested, (f) accelerometers, (g) platinum RTD, (h) coupling, (i) a torquemeter, (j) a speed reducer, (k) a speed sensor, and (l) an AC motor. Two uni-axis accelerometers of 25.6 kHz sampling frequency which are installed in horizontal and vertical positions, as can be seen in Figure 6. Equipment is categorized as a rotating speed sensor and force sensor.

PRONOSTIA Dataset
PRONOSITA is a useful dataset that contains a real portrayal of the real-time degradation of bearings under different conditions [8]. Its testing equipment consists of the following components: (a) NI CDA Q cards, (b) a pressure regulator, (c) cylinder pressure, (d) a force sensor, (e) the bearing tested, (f) accelerometers, (g) platinum RTD, (h) coupling, (i) a torquemeter, (j) a speed reducer, (k) a speed sensor, and (l) an AC motor. Two uni-axis accelerometers of 25.6 kHz sampling frequency which are installed in horizontal and vertical positions, as can be seen in Figure 6. Equipment is categorized as a rotating speed sensor and force sensor.

IMS Dataset
This dataset is by the Intelligent Maintenance Systems (IMS) industry of the University of Cincinnati [9]. This dataset contains the natural bearing defect evolution, and contains a complete set of vibration signals from initial state to the failure with explicit time stamps, for which the bearing was kept running for 30 consecutive days on a fixed speed of 2000 rpm, which covers 86.4 million cycles before the confirmation of the defect [10]. The equipment contains (a) two accelerometers, four bearings as b1, b2, b3 and b4, (c) a radial load, and (d) four thermocouples which are attached to the outer race of each bearing. The collected vibration data are recorded repeatedly after 5 and 10 min for 1 sec with a 20 kHz sampling rate. The structure is illustrated below in Figure 7. Table 1 provides a summary of different bearing datasets mentioned above.

IMS Dataset
This dataset is by the Intelligent Maintenance Systems (IMS) industry of the University of Cincinnati [9]. This dataset contains the natural bearing defect evolution, and contains a complete set of vibration signals from initial state to the failure with explicit time stamps, for which the bearing was kept running for 30 consecutive days on a fixed speed of 2000 rpm, which covers 86.4 million cycles before the confirmation of the defect [10]. The equipment contains (a) two accelerometers, four bearings as b1, b2, b3 and b4, (c) a radial load, and (d) four thermocouples which are attached to the outer race of each bearing. The collected vibration data are recorded repeatedly after 5 and 10 min for 1 sec with a 20 kHz sampling rate. The structure is illustrated below in Figure 7.

IMS Dataset
This dataset is by the Intelligent Maintenance Systems (IMS) industry of the University of Cincinnati [9]. This dataset contains the natural bearing defect evolution, and contains a complete set of vibration signals from initial state to the failure with explicit time stamps, for which the bearing was kept running for 30 consecutive days on a fixed speed of 2000 rpm, which covers 86.4 million cycles before the confirmation of the defect [10]. The equipment contains (a) two accelerometers, four bearings as b1, b2, b3 and b4, (c) a radial load, and (d) four thermocouples which are attached to the outer race of each bearing. The collected vibration data are recorded repeatedly after 5 and 10 min for 1 sec with a 20 kHz sampling rate. The structure is illustrated below in Figure 7. Table 1 provides a summary of different bearing datasets mentioned above.   Table 1 provides a summary of different bearing datasets mentioned above.

Highlights of the Datasets
In Section 3 of this article, titled "Dataset for Fault Bearing Experiment", details four public datasets for fault bearing diagnosis have been elaborated upon. We conclude that the dataset Case Western Reserve University (CWRU) is one of the most used datasets for fault bearing diagnosis, as well as detection. The Paderborn University Dataset is considered as most efficient one that contains a real portrayal of the real-time degradation of bearings under different conditions. It has four sensors and a sampling frequency of 66 kHz. Moreover, the Intelligent Maintenance Systems Dataset contains the natural bearing defect evolution; both can be used for fault detection and the prediction of remaining useful life (RUL).

Effects of the Datasets
There are various conditions that effect the datasets; different datasets give different results on same training models.
A fault bearing dataset is composed of a signal, and a signal is made up of three components: (1) frequency, (2) amplitude, (3) phase. These properties of the signals vary from one fault to another, and variations in the signals can be observed within the signals of same health type if they are collected under different working conditions. So, the type of data used in the bearing fault diagnosis process has significance, as it affects the performance of the developed model. Figure 8 [11] illustrates vibration signals for a healthy bearing (HB), a bearing with an outer race crack (BORC), a bearing with a rough inner surface (BRIRC), a ball with corrosion pitting (BCP), and combined bearing components defects (CBD) at 2000 rpm with no loader [10]. It is evident from the figures that all the signals possess different waveforms; furthermore, these waveforms can undergo significant variations if the working conditions of the machinery during the data collection process changes. Table 2 shows the parameters/conditions that effect the signals. Phase angle 10 Change in amplitude 11 Change in sampling frequency

Shallow Learning for Bearing Fault Diagnosis
In a standard bearing fault diagnosis framework, the fault classification is normally performed through traditional machine learning (ML) algorithms. The classical machine learning algorithms are considered to be shallow, as they do not follow the concept of deep networks. In return, the learning capability of the shallow networks is limited, and hence fails to extract salient information from the complex, nonlinear, and high dimension data. Therefore, to apply shallow learning for bearing fault diagnosis, researchers mostly rely on the standard bearing fault diagnosis pipeline that includes feature extraction and selection steps. Table 3 presents the classical machine learning algorithms used for bearing fault diagnosis-related research, including the k-mean singular value decomposition (K-SVD) dictionary algorithm for feature extraction [12], which can extract the fault frequency of every band, and then the back propagation neural network (BP NN) can be applied for the detection of failure type and to obtain the accurate fault bearing diagnosis. Similarly, [13] proposed an ANN method for fault bearing diagnosis using a Local Binary Pattern (LBP) histogram. It is based on the micro-texture analysis of vibration images with the local binary patterns. In [14], the author proposed the use of infrared thermography (IRT) for bearing fault diagnosis. For the decomposition of the thermal image, a two-dimensional discrete wavelet transform (2D-DWT) was used. The dimensionality of extracted data was reduced using principal component analysis (PCA), and then the most important characteristics were determined. The support vector machine (SVM), linear discriminant analysis (LDA), and k-nearest neighbor (KNN) were also evaluated as classifiers for fault classification and performance evaluation. The results show that the SVM outperformed both the LDA and the KNN. Furthermore, authors in [15] proposed a method that is based on sensing theory which can collect and compress raw data effectively concurrently. In [16], authors proposed the Energy Fluctuated Multiscale Feature (EFMF) mining method with the Deep ConvNet model for spindle bearing fault diagnosis. W. Zhang et al. [17] proposed a novel DL method using the residual learning algorithm for fault bearing diagnosis. P. Luo et al. proposed LSTM (long short-term memory) for fault recognition, and a neural network to exploit the fault detection for fault bearing diagnosis [18]. C. Wu et al. [19] proposed KMCSVC based on the kernel matrix to find the fault locations and identify severities. In [20], authors proposed a method to identify the bearing condition with statistical central moments timedomain vibration and five maximum peaks and power spectral density, with the help of ANN/SVM classifiers to obtain the high accuracy of bearing fault diagnosis. An automatic method for bearing fault diagnosis based on pattern recognition and signal processing technology with the combination of v-SVM for the detection of a fault was proposed in [21], whereas in [22], authors proposed a technique based on the voltage, speed and stator current of a machine for the diagnosis of bearing fault. This technique also detects lubricant problems and is perfect for classification. An improved Ant Colony Optimization (ACO) algorithm based on adaptive control parameters and the SVM (support vector machine) model for correct fault bearing diagnosis was proposed [23]. Furthermore, an overview of more classical ML-based research is presented in Table 3.

Convolutional Neural Network (CNN)-Based Bearing Fault Diagnosis
The CNN is inspired by the animal cortex and was introduced in 1994 for detecting patterns from the input image to form a complex features map in a hierarchical way (Fukushima, 1980). It has an advantage over other learning algorithms when dealing with two-dimensional data, as it can autonomously learn the input data approximation through their layered architecture. Therefore, it is considered as an efficient and end-to-end learning system in which only a single objective function of a given model is to be optimized. The basic architecture of the CNN is given in Figure 9.

Convolutional Neural Network (CNN)-Based Bearing Fault Diagnosis
The CNN is inspired by the animal cortex and was introduced in 1994 for detecting patterns from the input image to form a complex features map in a hierarchical way (Fukushima, 1980). It has an advantage over other learning algorithms when dealing with two-dimensional data, as it can autonomously learn the input data approximation through their layered architecture. Therefore, it is considered as an efficient and end-toend learning system in which only a single objective function of a given model is to be optimized. The basic architecture of the CNN is given in Figure 9. A group of researchers proposed an algorithm based on CNN, proposed in [24]. This work aimed to automate the feature extraction from the bearing signals using a CNN so that the overhead of feature extraction and selection from bearing data could be avoided. In [25], an adaptive hierarchical CNN equipped with a SoftMax classifier which can automatically learn salient information from the vibration acceleration signals was used for bearing fault diagnosis. The developed hierarchical adaptive network was composed of two layers, i.e., the first layer was to identify the type of bearing faults and the second A group of researchers proposed an algorithm based on CNN, proposed in [24]. This work aimed to automate the feature extraction from the bearing signals using a CNN so that the overhead of feature extraction and selection from bearing data could be avoided. In [25], an adaptive hierarchical CNN equipped with a SoftMax classifier which can automatically learn salient information from the vibration acceleration signals was used for bearing fault diagnosis. The developed hierarchical adaptive network was composed of two layers, i.e., the first layer was to identify the type of bearing faults and the second layer was to predict the severity of bearing faults. This model could adaptively vary the learning rate of the model during the training phase, which enhances the learning capability of the network significantly. Hence, the proposed model delivered high-classification accuracy when tested with the unseen data in the testing phase and was able to predict the fault severity effectively. Chen Lu et al. [26] proposed a health state classification-based intelligent fault bearing diagnosis method, which was proposed to use a hierarchical CNN, and it extracts features automatically from vibration signals. Meanwhile, in [27], researchers proposed an idea of feature extraction from the bearing data acquired through multiple sensors. The results of the proposed method suggest an enhancement in the classification accuracy, because it is believed that data from multiple sensors is enriched in salient information about the health status of the bearing as compared to the single sensor data. Thus, the above-discussed CNN-based methods present that they can achieve high and more reliable diagnosis performance. In [28], a multi-scale convolutional neural network (MS-DCNN) was proposed, and researchers proved that a multi-scale convolutional layer can expand and deepen the neural network for better learning, robust feature representation which reduces training time and network parameters and a reduced processing time. Furthermore, in [29], researchers proposed a method for bearing fault diagnosis that ends the manual feature extraction by deep CNN for automatic feature extraction and for adapting signal characteristics using the swarm optimization method. Ince et al. proposed a monitoring system with implementation on the CNN [30]. This method achieves high-level generalization and avoids the need for manual parameter tuning and hand-crafted feature extraction. They claimed that their proposed method does not need any form of transformation, feature extraction, and preprocessing. Their proposed method can directly access the raw data to evaluate the bearing fault diagnosis effectively. In [31], a method for monitoring bearing health was proposed. The proposed system fuses the feature extraction and classification blocks of the common fault detection approach into a single body at this state: the one-dimension convolutional neural network (1D-CNN) learns exact optimized features from the raw data with BP training when classification is performed by MLP layers. Wen et al., 2018 proposed a different method of jointed signal analysis and DNN for bearing fault diagnosis, and applied the S transform technique to obtain the time-frequency formulation of signals and developed a modified CNN network [32]. In 2019 [33], proposed a deep CNN method that combines the detailed convolution, the input gate structure of LSTM and the residual network for fault bearing diagnosis, which shows higher denoising ability. In [34], a scheme based on the CNN and the bi-spectrum analysis of the vibration signals was proposed. It is proposed that this method can be used for bearing diagnosis under variable speed conditions. In [35], a proposed method works on raw signals without any time-consuming hand-crafted feature extraction process, and it works well when working load changes and working under noisy environments. In [36], a new fault bearing diagnosis method was proposed by developing a signal-to-image conversion method by using a famous motor bearing dataset, a selfpriming centrifugal pump dataset, and an axial piston hydraulic pump dataset. In [37], a method of using the CNN structure using 2D images for fault bearing diagnosis was proposed. In [38], an approach for fault bearing diagnosis using the 1D-CNN technique was proposed, and it added a preprocessing step in the diagnosis pipeline which calculates the frequency spectrum of vibration signals. Hao et al. proposed an end-to-end solution for fault bearing diagnosis with one-dimensional convolutional long short-term memory (1D-CLSTM) networks [39].
Further comparison of articles is discussed in Table 4. In these articles, the targeted faults of the bearings are outer raceways, inner raceways, ball/roller element fault, B fault, normal, damaged gear bearing, damaged bearing output shaft, motor current signals, and vibration signals with different levels of efficiency and better utilization of deep learning methods have been proposed.

Auto-Encoders-Based Bearing Fault Diagnosis
The unsupervised method of auto-encoders was first proposed in the year 1980 for the pre-training of an ANN [40,41]. It is defined as a broadly implemented greedy layer-wise neural network pre-training method. It is a unique neural network, since both its input and output are the same. This network learns itself. ANN trains an auto-encoder which consists of the encoder, bottleneck, decoder, and reconstruction loss. The encoder produces the new features representation from the old feature's representation. The bottleneck is a layer that contains the compressed representation of the input data, which is the lowest dimension of the input data. The decoder is the reverse of the encoder process and reconstruction loss is the method that measures the performance of the decoder and how close the output is to the original input. The output of the encoder is the input of the decoder. For imitating the input as a final output, the ANN takes the mean square error among the original input and output as the loss function and the decoder is released, while the encoder part remains. Classifiers can employ the output of the encoder in the feature representation stage. The general architecture of the encoder is illustrated in Figure 10.
AEs are trained by ANNs which comprise two parts, i.e., the encoder and decoder. Diverse research has been accomplished using AEs, including the first article which proposed a tool that diagnosed bearing faults with massive data, using five layers of auto-encoder from the frequency spectrum and effectively performed the classification of machines' health, in which the accuracy of 99.6% was achieved [42]. Furthermore, authors proposed the effective usage of the Gaussian kernel function and a deep auto-encoder network, resulting in effective bearing fault diagnosis [43]. In [44], two-layered faults bearing diagnosis was proposed: one is for the identification of the fault pattern in the rotatory bearing machine, and the second is for identification of the crack size in certain faults.
Article [45] states that fault bearing diagnosis using the capability of AEs and the high training speed of an Extreme Learning Machine (ELM) provided a better classification performance without explicit feature extraction. In [46], the feasibility of the Stacked Denoising Auto-encoder (SDAE)-based fault bearing diagnosis with the use of health state classification datasets from rolling bearings was proposed. In [47], a study on a fault recognizer based on the SDAE to denoise and extract features from the raw vibration signals by stacking several denoising auto-encoders was proposed. In [48], a novel deep AE feature learning method for rotating fault diagnosis was developed. In [49], a multi-sensor feature fusion method for fault bearing diagnosis with SAE and DBN methods combination was proposed. A bearing fault diagnosis solution comparison with multiple techniques, proving that sparse auto-encoders are better and can be deployed in health, motors, and air compressors, was proposed in [50]. In [51], a new method was proposed that works on temporal vibration signals. WTA is used during the training stage to learn sparse features that are suitable for fault bearing diagnosis. Additionally, to obtain improvement in diagnosis result accuracy, soft voting method was applied. In [52], a method that uses AE sensors with big data simplifies the signals using STFT to transform raw signals from the time domain to the frequency domain for the generation of the spectrum matrix. This spectrum generates sub-patterns to obtain the optimized DL structure, and then the Large Memory Storage and Retrieval (LAMSTAR) network diagnoses the bearing fault as proposed. This also presents an effective use of the deep auto-encoders network in classification and feature extraction in fault bearing diagnosis, which minimizes the time consumption rate and maximizes the accuracy rate. The average highest accuracy is achieved in [46,48] of 100%, whereas other proposed solutions are also accurate and the best as per their strategies. The targeted faults of the bearing are the inner raceway, outer raceway, roller fault, normal, cage fault, vibration signals, eccentric fault, spalling fault, misalignment fault, and abrasion fault. A survey of the results achieved through deep auto-encoders used in fault bearing diagnosis of previous research is presented in Table 5.  Article [45] states that fault bearing diagnosis using the capability of AEs and the high training speed of an Extreme Learning Machine (ELM) provided a better classification performance without explicit feature extraction. In [46], the feasibility of the Stacked Denoising Auto-encoder (SDAE)-based fault bearing diagnosis with the use of health state classification datasets from rolling bearings was proposed. In [47], a study on a fault recognizer based on the SDAE to denoise and extract features from the raw vibration signals by stacking several denoising auto-encoders was proposed. In [48], a novel deep AE feature learning method for rotating fault diagnosis was developed. In [49], a multi-sensor feature fusion method for fault bearing diagnosis with SAE and DBN methods combination was proposed. A bearing fault diagnosis solution comparison with multiple techniques, proving that sparse auto-encoders are better and can be deployed in health, motors, and air compressors, was proposed in [50]. In [51], a new method was proposed that works on temporal vibration signals. WTA is used during the training stage to learn sparse features that are suitable for fault bearing diagnosis. Additionally, to obtain improvement in diagnosis result accuracy, soft voting method was applied. In [52], a method that uses AE sensors with big data simplifies the signals using STFT to transform raw signals from the time domain to the frequency domain for the generation of the spectrum matrix. This spectrum generates sub-patterns to obtain the optimized DL structure, and then the Large Memory Storage and Retrieval (LAMSTAR) network diagnoses the bearing fault as proposed. This also presents an effective use of the deep auto-encoders network in classification and feature extraction in fault bearing diagnosis, which minimizes the time consumption rate and maximizes the accuracy rate. The average highest accuracy is achieved in [46,48] of 100%, whereas other proposed solutions are also accurate and the best as per their strategies. The targeted faults of the bearing are the inner raceway, outer raceway, roller fault, normal, cage fault, vibration signals, eccentric fault, spalling fault, misalignment fault, and abrasion fault. A survey of the results achieved through deep auto-encoders used in fault bearing diagnosis of previous research is presented in Table  5.

Deep Belief Network (DBN)-Based Methods for Bearing Fault Diagnosis
The DBN is a deep neural network that is constructed from various layers of RBMs-Restricted Boltzmann Machines [53]. Every RBM has layers of visible and hidden unit layers, and there is a connection between visible and hidden layers. The generic structure of the DBN is illustrated in Figure 11. There are multiple independent neurons in every layer. (h1, h2, h3) are hidden layers, visible layer y, hidden layer x.   The process of DBN learning begins from the lowest visible layer. The process comprises two stages: firstly, the RBM layers are pre-trained in a greedy method step by step. In the second stage, fine-tuning of the complete network takes place for the parameter adjustment of the network so that better performance can be achieved. The input data approximation learned through the unsupervised training of the first RBM is inputted to train the next RBM, and this training continues until the last RBM has been trained and has learned the approximations successfully. Much research has been carried out in previous years using the DBN in the field of bearing fault diagnosis [54]. In this paper, a new hierarchical bearing fault diagnosis method, NM, is adapted to the training process of the DBN to directly extract deep data features from signals of the frequency domain. In [55], an adaptive DBN with a dual-tree complex wavelet packet, which refines the measured vibration signals to design the original set of features, was proposed. Additionally, this method can recognize the different bearing faults. In [56], a bearing fault and severity diagnosis framework was proposed by binding many techniques together to obtain more accurate and capable bearing fault diagnostic algorithms. In [57], a method for rolling bearing fault diagnosis is proposed that has three steps: DBNs are constructed according to different hyperparameters, then IWV is used to determine every DBN's weight matrix, and then DBNs vote together to their respective matrix to obtain the final diagnosis result. In [58], a hierarchical diagnosis network for conducting the rolling bearing fault diagnosis was proposed, and researchers employed a wavelet packet transform representation of fault features and a DBN to classify/detect the type of fault. In [59], an ADBN was proposed that identifies the different conditions of bearing with DTCWPT, which measures vibration signals to design real feature sets.
The comparison of articles containing DBN-based methods is shown in Table 6. The targeted faults in these approaches were: health, inner raceway, outer raceways, normal, ball raceway, gear teeth breakage, broken bar, bowed rotor, stator winding defect, unbalanced motor, defecting bearing and roller fault.

Recurrent Neural Network (RNN)-Based Methodologies
An RNN processes input data in a recurrent manner. The architecture of an RNN is illustrated in Figure 12. The recurrent model can capture and model the sequential data or time-series data as the path goes from its hidden to the output layer. It is a generalized form of a Feed Forward Neural Network (FNN) that has internal memory. It gets to train with back propagation. An RNN is recurrent, as they perform a similar function for every input of data, whereas the output of the current input depends on previously considered computation. Through Time (BPTT) and a notorious gradient vanishing issue stemmed from its nature. To tackle this issue, LSTM is augmented by adding recurrent forget gates. LSTM is capable of modeling long-term dependency in data so it wins a dominant role in time series and text analysis and achieves success in natural language processing, video analysis, speech recognition, etc. In [60], a model based on the LSTM neural network was proposed. In [61], a data-driven method was proposed, with long-term time dependencies handled by this method; spatial and temporal dependencies can be utilized to detect faults based on the available sensor measurement signals for bearing fault detection. In [62], a technique for BLDC fault detection and diagnosis is presented. Additionally, the applications of these techniques to detect and accurately classify under non-stationary operating conditions is presented. Some of the work using this method is discussed below in Table 7.

Other Methods
There are various deeper learning methods for fault bearing diagnosis. Some of them are new approaches, and some are a mixture of previously discussed methods. In [63], an ensemble stack sparse auto-encoder space for fault bearing diagnosis was proposed. In [64], multiple wavelet fusion in a deep residual network with the help of two techniques, i.e., concatenation and maximization, is used to effectively capture useful information for bearing fault diagnosis. In [65], authors proposed a method of mapping original sound signals into time-frequency in the first step by STFT; then, SAE extracts the intrinsic fault features automatically. After that, SoftMax regression is used to recognize the fault modes of the feature vectors. In [66], a fault bearing diagnosis method was proposed while using all of the above mentioned methods, with four different preprocessing schemes. Similarly, in [67], a method of Dilated Residual Networks and DWWC to find a good set of features in fault diagnosis was proposed using the Planetary Gearbox dataset. Table 8 compares these methods.

Other Methods
There are various deeper learning methods for fault bearing diagnosis. Some of them are new approaches, and some are a mixture of previously discussed methods. In [63], an ensemble stack sparse auto-encoder space for fault bearing diagnosis was proposed. In [64], multiple wavelet fusion in a deep residual network with the help of two techniques, i.e., concatenation and maximization, is used to effectively capture useful information for bearing fault diagnosis. In [65], authors proposed a method of mapping original sound signals into time-frequency in the first step by STFT; then, SAE extracts the intrinsic fault features automatically. After that, SoftMax regression is used to recognize the fault modes of the feature vectors. In [66], a fault bearing diagnosis method was proposed while using all of the above mentioned methods, with four different preprocessing schemes. Similarly, in [67], a method of Dilated Residual Networks and DWWC to find a good set of features in fault diagnosis was proposed using the Planetary Gearbox dataset. Table 8 compares these methods.

Discussion
The considered studies for deep-learning-based fault diagnosis framework were mainly surveyed by considering the proposed methodology, and the reliability of the diagnostic performance. Almost all of the considered methods were developed based on publicly available datasets, which gives the freedom of reproducibility and scope of further analysis for further research in this field. It can be inferred that during the early stages of research in this field, researchers relied mostly upon engineered features and classical machine learning algorithms. However, as the research progressed in the field, more realistic methods have been considered by the researchers during fault diagnosis of the bearing. One of the realistic assumptions is the use of data that are collected under working conditions that are similar to the real-time environment of the industry. A few of the practicalities that can be considered during data collection are variable motor speed, variable motor load, presence of compound faults, and the presence of multiple fault severities. These variations constitute erratic working conditions of the machinery under examination, which makes the fault diagnosis process a challenging task. Based on the literature review, it is safe to say that the bearing fault diagnosis models developed using classical machine learning algorithms encounter deterioration in the fault diagnosis performance under erratic working conditions of the machinery. Therefore, under such circumstances, rather than the classical domain-dependent statistical feature analysis-based frameworks, deep-learning-based approaches establish the diagnosis approach as a general framework by improving the performance accuracies. Among the deep-learning-based approaches, the most popular techniques, such as the CNN, AE, DBN, RNN, DNN, SAE, etc., are efficiently utilized in rotatory machine fault diagnosis, which achieves higher accuracy than classical methods. By our survey, while conducting these experiments, CWR is nominated as the most considered dataset. However, for a real-world scenario, where the dataset is not acquired from the ideal conditions, there is still great opportunity to explore these established methods to make a more generalized and robust model for diagnosis.

Limitations Classical Machine Learning
Despite the fact that machine learning algorithms have been widely used in the construction of a predictive maintenance mechanism, there are certain drawbacks. The purpose of developing predictive maintenance algorithms is to automatically detect and diagnose any issue in the equipment under observation. It is also necessary to detect faults in order to adopt an efficient equipment prognosis approach. The following are some of the limits of machine learning in the context of predictive maintenance [68][69][70].

Generalizability
Machine learning has a domain-specific implementation methodology. This means that the algorithm must be trained and fine-tuned separately for each type of application.

Domain-Related Knowledge
Expert knowledge of the problem domain is necessary when utilizing machine learning algorithms in predictive maintenance activities. In the machine-learning-based fault detection, diagnostic, and prognostic procedure, a feature engineering step is required. Feature engineering is a challenging process that necessitates a great deal of experience to develop handcrafted features that can structure the dataset. It can also detect a growth in fault.

3.
Learning Ability, Reliability and Performance Because machine learning methods require a simple network topology, such networks have limited learning capability. Shallow networks are the term used to describe these types of networks. In practice, the data used in data-driven predictive maintenance is noisy, nonlinear, and complicated. Machine learning algorithms cannot manage data with abnormalities, non-stationarity, or non-linearity, which is common with data from industrial equipment. As a result, shallow networks are limited in their ability to provide data abstraction in the form of failure prediction features. As a result, when using realtime datasets for predictive maintenance, the overall performance of machine learning algorithms degrades.

Cross-Domain Analysis
In cross-domain applications, there is a lack of performance. Satisfactory performance is not guaranteed if the nature of the application becomes complex. The failure prediction data are used to guide maintenance operations.
6.2. Advantages and Disadvantages of Deep Learning 6.2.1. Advantages

1.
The automated learning of structures from new data is the main benefit of using a deep learning system. The hierarchical order of nonlinear transformations makes it simple to extrapolate information from coarse data without the requirement for feature extraction and selection.

2.
Because the overhead of feature engineering and selection is not required, developing condition monitoring, fault detection and diagnosis, and prognosis strategies for predictive maintenance is quite simple.

3.
Transfer learning is better served by deep learning algorithms. It paves the way for cross-domain data-driven predictive maintenance solutions to be developed.

4.
When compared to machine-learning-based predictive maintenance strategies, deeplearning-based predictive maintenance strategies have a higher generalization potential.

5.
The bigger the number of layers and neurons in a deep learning network, the more complicated the problems can be that are conceived, resulting in a performance improvement. 6.
The most appealing aspect of using deep learning in predictive maintenance is that these networks can automatically extract the relevant feature from data, obviating the need for manual feature engineering. 7.
When deep learning is up to date, it can predict failures and cover every new event or behavior.

1.
To perform better than other strategies, it necessitates a big volume of data.

2.
Because of the complicated data models, training is exceedingly costly. Deep learning also necessitates the use of pricey GPUs and hundreds of workstations. The users' costs will rise as a result of this.

3.
Because it necessitates knowledge of topology, the training method, and other characteristics, there is no standard theory to aid you in choosing the correct deep learning tools. As a result, it is difficult for less skilled people to adopt it.

4.
It is difficult to grasp output based just on learning, and therefore, this necessitates the use of classifiers. Such tasks are carried out using algorithms based on convolutional neural networks. Table 9 presents the detailed comparison of deep learning-based models for fault diagnosis. More than two layers are present. This allows for sophisticated non-linear relationships to be created. It is utilized for both classification and regression.

Comparison of Deep Learning Models
It is frequently utilized and has a high level of accuracy.
Because the error is propagated back to the previous one layer, the training process is not straightforward. The model's learning process is likewise far too slow.

Convolutional Neural Network (CNN)
With two-dimensional data, this network performs well. It is made up of convolutional filters that turn two-dimensional data into three-dimensional data.
Very good performance, and the model learns quickly. For categorization, it requires a large amount of labeled data.

Recurrent Neural Network (RNN)
It has the ability to learn and remember sequences. All of the weights are shared throughout all of the stages and neurons.
LSTM, BLSTM, MDLSTM, and HLSTM are some of the versions that can learn sequential events and reflect time dependencies. These provide cutting-edge accuracy in speech recognition, character recognition, and a number of other natural language processing applications.
Due of gradient vanishing and the necessity for large datasets, there are numerous difficulties.

Deep Belief Network (DBN)
DBNs are probabilistic generative models that give a combined probability distribution across observable data and labels.
It addresses the problem of parameter selection, which can lead to poor local optima in some circumstances, and ensures that the network is properly established. Because the procedure is unsupervised, no tagged data are required. However, DBNs have a number of flaws, such as the high computational cost of training a DBN and the lack of clarity surrounding the processes for further network optimization based on maximum likelihood training approximation.
They do not account for the two-dimensional structure of an input image, which may significantly affect their performance and applicability in computer vision and multimedia analysis problems.

Auto-Encoders
Auto-encoders are a type of unsupervised learning technology in which neural networks are used to learn representations. We will create a neural network architecture in such a way that we force a compressed knowledge representation of the original input due to a bottleneck in the network.
They are particularly useful in feature extraction, since they can represent data as nonlinear representations.
An auto-encoder must be trained. Before you even start developing the real model, that is a lot of data, processing time, hyper parameter adjustment, and model validation.
Instead of capturing as much information as possible, an auto-encoder learns to capture as much relevant information as feasible. The DBM has entirely undirected connections, whereas the top two layers constitute an undirected graphical model and the lower layers form a directed generative model. Units in odd-numbered levels are conditionally independent on units in even-numbered layers, and vice versa, in DBMs with several layers of hidden units.
They can capture multiple layers of complicated input data representations and are suitable for unsupervised learning, since they can be trained on unlabeled data, but they can also be fine-tuned for a specific job in a supervised manner.
One of the most significant is the high computing cost of inference, which makes collaborative optimization on large datasets nearly impossible. Several strategies for improving the effectiveness of DBMs have been presented. These include employing distinct models to initialize the values of the hidden units in all layers to speed up inference.

Future Perspectives of Deep Learning
Deep-learning-based predictive maintenance still has room for improvement. In the next subsections, some of the limitations of deep learning algorithms in terms of predictive maintenance are discussed.

Enhanced Generalization
Although advanced deep learning approaches such as fine-tune transfer learning and multitask learning have added a feeling of generality to data-driven predictive maintenance tactics, these concepts still need to be investigated further. Domain-independent datadriven predictive maintenance can be implemented using notions such as these.

Explain-Ability
Deep learning's data processing and exploration capabilities are unquestionably superior to machine learning's. Its application in predictive maintenance has eliminated a lot of the overhead and difficulties that traditional machine learning techniques had.
To name a few advantages, it can readily handle large amounts of data and can learn important information from inputs without the need for a domain-specific feature engineering process. Deep learning algorithms, on the other hand, are more like a black box than expanded capability. There is currently no comprehensive explanation for how deep learning algorithms correctly simulate complicated, nonlinear, and nonstationary data in an abstract manner. Furthermore, it is not known how the estimated codes, also known as features, perform better than their predecessors in terms of predictive maintenance. There is a need for explainable deep-learning-based predictive maintenance strategies.

Multimodal and Multisensor Data Fusion
Data fusion from numerous sensors and modalities is an intriguing and viable extension of data-driven predictive maintenance based on deep learning. Data fusion can offer detailed information about bearing faults, which can help improve bearing fault detection models. Data fusion from many sensors is also a practical aspect, as multiple sensors are typically mounted on the concerned component to collect data for better performance.

Conclusions
In this paper, we investigated the applications of deep learning algorithms for bearing fault diagnosis. In most of the studies, researchers like to rely on the publicly available datasets due to the easier availability, and ideal working conditions. From the performance analysis of the considered studies, we saw that the deep learning algorithms are highly capable of learning the health characteristics automatically, and the diagnostic performance has significantly been improved. Furthermore, the analysis indicates that the accuracy of many improved deep-learning-based methods can improve comparatively through more training, which gives an idea for the exploration and new work to be carried out for intelligent fault bearing diagnosis. However, it should be considered that the successes of deep learning-based diagnosis models still rely on some kind of domain-based analysis, and are subject to sufficient labeled samples. Therefore, this review is anticipated to scientifically present the development and progress of a deep-learning-based bearing fault diagnosis framework and deliver valuable guidelines for future research.
Informed Consent Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.