Deep Learning with LPC and Wavelet Algorithms for Driving Fault Diagnosis

Vehicle fault detection and diagnosis (VFDD) along with predictive maintenance (PdM) are indispensable for early diagnosis in order to prevent severe accidents due to mechanical malfunction in urban environments. This paper proposes an early voiceprint driving fault identification system using machine learning algorithms for classification. Previous studies have examined driving fault identification, but less attention has focused on using voiceprint features to locate corresponding faults. This research uses 43 different common vehicle mechanical malfunction condition voiceprint signals to construct the dataset. These datasets were filtered by linear predictive coefficient (LPC) and wavelet transform(WT). After the original voiceprint fault sounds were filtered and obtained the main fault characteristics, the deep neural network (DNN), convolutional neural network (CNN), and long short-term memory (LSTM) architectures are used for identification. The experimental results show that the accuracy of the CNN algorithm is the best for the LPC dataset. In addition, for the wavelet dataset, DNN has the best performance in terms of identification performance and training time. After cross-comparison of experimental results, the wavelet algorithm combined with DNN can improve the identification accuracy by up to 16.57% compared with other deep learning algorithms and reduce the model training time by up to 21.5% compared with other algorithms. Realizing the cross-comparison of recognition results through various machine learning methods, it is possible for the vehicle to proactively remind the driver of the real-time potential hazard of vehicle machinery failure.


Introduction
Issues such as "Metaverse", "Big Data", "Artificial Intelligence (AI)", and "Digital Transformation" are in full swing [1,2], and the most critical point is the use of data acquisition (DAQ), data analysis, and machine learning, etc., to realize the integration of digitalization and smart manufacture of the system. With the popularization of 5G communication and the rapid development of Industry 4.0, the integration of technologies such as AI, Internet of Things (IoT), and cloud computing has become a very important development key in the field [3][4][5]. Recently, the scientific and technological circles are quite looking forward to realizing the integration of virtual and real (Digital Twin), thereby leading technology to another metaverse of a new digital world. These ever-changing cross-domain integrations show that the application of AI is leading the development of future technology and has penetrated into every corner.
The development of Industry 4.0 has attracted increased attention to fault diagnosis in recent years. For equipment automation, effective fault diagnosis can save time by allowing MEMS microphone array is adopted, because the acoustic microphone has high resolution in the middle and high frequency bands. The fault characteristics can be presented by the mid-to-high frequency signal in the early stage of the fault [15,16], and consequently, the acoustic sensing technology is very suitable for the early abnormal detection of the rotating mechanical system.

Linear Predictive Coding Method
Linear prediction coefficient (LPC) is one of the most effective speech analysis techniques and is widely used in speech recognition and audio compression [32], as illustrated in Figure 2, and G is gain value. LPC can provide accurate speech parameter prediction, making it well suited for modeling the transfer characteristics of sound sources. Good analytical performance is also observed in the extraction of noise characteristics of the mechanical transmission system when using LPC. The main theory is that the input x(k) of a linear discrete-time system is a linear weighted combination of the input samples and the output of previous samples. The following function can be written: where integer is the time index, α is defined as the linear prediction coefficient, and p represents the past coefficient. Given a prediction signal ( ), the number of prediction errors ( ) is given by: ( ) is the prediction sample, and is determined by ( ) to minimize the mean square error (MSE). The equation is: Instead of using the MEMS microphone array, addressing these problems requires the use of appropriate filtering methods, and methods such as the Mel-scale frequency cepstral coefficient (MFCC) [17,18], Fast Fourier transform (FFT) [19,20], order-tracking technology [16] and wavelet transform [21] have been applied to machinery fault diagnosis. Linear prediction coefficients (LPC) have been applied to many modern speech processing systems for applications including coding, synthesis, analysis and recognition [22]; the initial model is constructed using historical data, and new data testing and verification can be used to predict the associated outcomes of audio signal data. Previous studies have used LPC to achieve perfect fault diagnosis performance [23], and compared with the abovementioned filtering algorithm, LPC uses less resources to achieve high-resolution spectra. The development of artificial intelligence (AI) and the Internet of things (IoT) [4,24] has raised new possibilities. For example, failure detection of vehicle suspension systems [25] uses AI to achieve early prediction, thus reducing the occurrence of vehicle accidents.
Machine learning algorithms can be used to perform large-scale data analysis. Support vector machine (SVM) was first used for fault diagnosis in the late 1990s [26]. Artificial neural networks (ANN) are among the most widely used methods for fault diagnosis and have been used for mechanical fault prediction [27] and the later development of multilayer neural networks [28]. The improvement of hardware capabilities further drove the application of neural networks. Increasing the number of neurons deepens the hidden layer, thus improving the recognition rate. Deep neural networks (DNN) are still in the development stage for fault identification applications, and many challenges still need to be addressed. For example, very large amounts of data can significantly impair processing efficiency [29]. Even with strong hardware support, data processing presents a major challenge and is difficult to apply in practice. This article also discusses the challenges of finding a suitable activation function to accelerate neuron convergence. Previously, CNNs were mainly used for image and facial recognition [30,31] and have rarely been used for classification in speech processing. In this experiment, the spectrum dataset is classified using a CNN model for comparison with different types of classifiers, such as DNNs. The remainder of this article is as follows. Section 2 discusses the theoretical background, including LPC, Wavelet, DNN, CNN, and LSTM algorithms. Section 3 introduces the experimental framework, collects sound signals, and converts them into voiceprint characteristic spectra to build a dataset. Section 4 introduces the test results of ML methods on the dataset. Section 5 draws conclusions and presents directions for future work.

Linear Predictive Coding Method
Linear prediction coefficient (LPC) is one of the most effective speech analysis techniques and is widely used in speech recognition and audio compression [32], as illustrated in Figure 2, and G is gain value. LPC can provide accurate speech parameter prediction, making it well suited for modeling the transfer characteristics of sound sources. Good analytical performance is also observed in the extraction of noise characteristics of the mechanical transmission system when using LPC. The main theory is that the input x(k) of a linear discrete-time system is a linear weighted combination of the input samples and the output of previous samples. The following function can be written: where integer k is the time index, α i is defined as the linear prediction coefficient, and p represents the past coefficient. Given a prediction signal x(k), the number of prediction errors e(k) is given by: x(k) is the prediction sample, and α i is determined by e(k) to minimize the mean square error (MSE). The equation is: Sensors 2022, 22, x FOR PEER REVIEW 5 of 2 Figure 2. The spectral features filtered by LPC method.

Wavelet Transform (WT)
The Shannon recovery formula can assist in restoring the original analog function ( ), where the relationship can be as follows The continuous wavelet transform (CWT) of the time-domain signal ( ) can be ex pressed by the following transformation formula:

Wavelet Transform (WT)
The Shannon recovery formula can assist in restoring the original analog function y(t), where the relationship can be as follows The continuous wavelet transform (CWT) of the time-domain signal y(t) can be expressed by the following transformation formula: If a = 1 2 σ is used as the scale, b = k 2 σ is used as the translation, and both s and k are integers, in the time-scale plane, the CWT of y(t) is a value in ( k 2 σ , 1 2 σ ), which represents the relationship between x(t) and ∅(t) at that time-scale point, and is called discrete wavelet transform (DTW). This method generates a set of sparse values on the time-scale plane. The expression is as follows: (6) With this expression, the wavelet coefficient can be represented in b = k 2 σ , a = 1 2 σ . That is the mapping to the time-domain signal y(t) under the discrete-time scale [33].

Deep Neural Network (DNN)
DNN is an artificial neural network used for supervised learning. Neural networks make "judgments" by simulating the operation of neurons in human brain cells. Such networks contain many computational layers, using the input layer and output layer as perceptrons, and with one or more hidden layers between them. Networks with multiple hidden layers are called deep neural networks (DNNs). Using a large number of hidden layer training data can help improve the accuracy of the weight value classification. The activation function plays an important role in neural networks. It can make neurons improve gradient descent performance through nonlinear conversion. Different activation functions can be used to improve the MLP performance [34]. Figure 3 shows MLP structure with two hidden layers. W (m) is defined as weighted, which connects between the hidden layers. b (m) is basis of the mth layers (m > 0). a (m) (x) indicates the previous level h (m−1) and W (m) are multiplied and added to b(m). The value of a (k) (x) is inserted into the activation function g(x), and the result is the output y n [17].
For the ith neuron in the mth hidden layer, the concept is equated as below: The equation of the desired output layer is formulated as: The equation of the desired output layer is formulated as:

Convolutional Neural Network (CNN)
The basic CNN architecture was first proposed in 1980 by Kunihiko Fukushima [35]. Its structure was inspired by the concept of simple and complex cells in the brain's visual cortex [36], as an extension of the ANN architecture. A CNN is composed of a convolution layer, a pooling layer, and a fully connected layer. The convolutional layer is a feature map obtained by applying the summing of the product of the input pixels. The pooling layer is used to reduce the feature dimensionality of the input, thus preventing overfitting. Finally, the fully connected layer flattens the features into a one-dimensional vector for classification. Some well-known CNN models, such as AlexNet [30], GoogLeNet [37], VGGNet [38], LeNet-5 [39], etc., have been widely used in image recognition. Among these, a CNN block diagram has been successfully applied for image-based fault diagnosis [40]. Figure 4 shows the CNN architecture, which is applied here to identify audio signal features in vehicles.

Convolutional Neural Network (CNN)
The basic CNN architecture was first proposed in 1980 by Kunihiko Fukushima [35]. Its structure was inspired by the concept of simple and complex cells in the brain's visual cortex [36], as an extension of the ANN architecture. A CNN is composed of a convolution layer, a pooling layer, and a fully connected layer. The convolutional layer is a feature map obtained by applying the summing of the product of the input pixels. The pooling layer is used to reduce the feature dimensionality of the input, thus preventing overfitting. Finally, the fully connected layer flattens the features into a one-dimensional vector for classification. Some well-known CNN models, such as AlexNet [30], GoogLeNet [37], VGGNet [38], LeNet-5 [39], etc., have been widely used in image recognition. Among these, a CNN block diagram has been successfully applied for image-based fault diagnosis [40]. Figure 4 shows the CNN architecture, which is applied here to identify audio signal features in vehicles.

Convolutional Neural Network (CNN)
The basic CNN architecture was first proposed in 1980 by Kunihiko Fukushima [3 Its structure was inspired by the concept of simple and complex cells in the brain's vis cortex [36], as an extension of the ANN architecture. A CNN is composed of a convolut layer, a pooling layer, and a fully connected layer. The convolutional layer is a featu map obtained by applying the summing of the product of the input pixels. The pool layer is used to reduce the feature dimensionality of the input, thus preventing overfitti Finally, the fully connected layer flattens the features into a one-dimensional vector classification. Some well-known CNN models, such as AlexNet [30], GoogLeNet [3 VGGNet [38], LeNet-5 [39], etc., have been widely used in image recognition. Amo these, a CNN block diagram has been successfully applied for image-based fault diagno [40]. Figure 4 shows the CNN architecture, which is applied here to identify audio sig features in vehicles.

Long Short-Term Memory (LSTM)
In a general recurrent neural network, there is only one hidden state unit h t , and the parameters of the hidden state unit at different times are the same, as shown in Figure 5a. This makes the recurrent neural network a long-term dependence problem that can only be sensitive to short-term input. LSTM adds a cell state unit c t on the basis of an ordinary recurrent neural network, which has variable connection weights at different times to solve the problem of gradient disappearance or gradient explosion in an ordinary recurrent neural network, as shown in Figure 5b. In Figure 5, h t is the hidden state unit (short-term state unit), and c t is the unit state unit (long-term state unit), which together constitute the LSTM architecture [41].
In a general recurrent neural network, there is only one hidden state unit ℎ , and the parameters of the hidden state unit at different times are the same, as shown in Figure 5a. This makes the recurrent neural network a long-term dependence problem that can only be sensitive to short-term input. LSTM adds a cell state unit on the basis of an ordinary recurrent neural network, which has variable connection weights at different times to solve the problem of gradient disappearance or gradient explosion in an ordinary recurrent neural network, as shown in Figure 5b. In Figure 5, ℎ is the hidden state unit (shortterm state unit), and is the unit state unit (long-term state unit), which together constitute the LSTM architecture [41]. Unlike general recurrent neural networks, LSTMs reference gating units. Gating is the unit learned by the neural network to control the storage, utilization, and discarding of signals. For each time , LSTM has three gating units: input gate , forget gate and output gate . The input of each gating unit contains the sequence information at the current moment and the hidden state unit ℎ at the previous moment. The actual calculation formula is: = ( + ℎ + ) Among them, and are the weight matrices, is the bias vector, and (•) is the startup function. It can be found that the calculation methods of the above three gating units are the same (all equivalent to a fully connected hierarchy), and only the weight matrix and the bias vector are different. The setting value range of the startup function (•) is generally [0, 1], and the commonly used startup function is the sigmoid function. By multiplying the gating unit and the signal data element by element, the amount of information to be retained after the signal passes through the gating can be controlled. For example, when the state of the gate unit is 0, the signal will be completely discarded; when the state is 1, the signal will be fully retained; and when the state is between 0 and 1, the signal will be partially reserved.
LSTM operates by using three gating units and cell state units. Figure 6 is a schematic diagram of the gating unit and state unit in LSTM. It can be seen that the transmission of the cell state unit from at the previous moment to at the current moment is jointly controlled by the input gate and the forgetting gate. The input gate determines how much of the input information is absorbed at the current moment. The forget gate determines how much of the cell state unit is not forgotten at the previous moment, and the final cell state unit is generated by the sum of the two gated signals. The actual formula is: Unlike general recurrent neural networks, LSTMs reference gating units. Gating is the unit learned by the neural network to control the storage, utilization, and discarding of signals. For each time t, LSTM has three gating units: input gate i t , forget gate f t and output gate o t . The input of each gating unit contains the sequence information x t at the current moment and the hidden state unit h t−1 at the previous moment. The actual calculation formula is: Among them, W and U are the weight matrices, b is the bias vector, and σ(·) is the startup function. It can be found that the calculation methods of the above three gating units are the same (all equivalent to a fully connected hierarchy), and only the weight matrix and the bias vector are different. The setting value range of the startup function σ(·) is generally [0, 1], and the commonly used startup function is the sigmoid function.
By multiplying the gating unit and the signal data element by element, the amount of information to be retained after the signal passes through the gating can be controlled. For example, when the state of the gate unit is 0, the signal will be completely discarded; when the state is 1, the signal will be fully retained; and when the state is between 0 and 1, the signal will be partially reserved.
LSTM operates by using three gating units and cell state units. Figure 6 is a schematic diagram of the gating unit and state unit in LSTM. It can be seen that the transmission of the cell state unit from c t−1 at the previous moment to c t at the current moment is jointly controlled by the input gate and the forgetting gate. The input gate determines how much of the input information c t is absorbed at the current moment. The forget gate determines how much of the cell state unit c t−1 is not forgotten at the previous moment, and the final cell state unit c t is generated by the sum of the two gated signals. The actual formula is: Among them, ⊙ is the element-wise dot product operation. The hidden state of LSTM is determined by the output gate and : ℎ = ⨀ tanh( ) It can be seen that in LSTM, not only the hidden state unit ℎ and ℎ hav tively complex cyclic connection, but also the internal unit state unit and is also a linear self-circulating relationship between them. The linear self-loop b cell state units can be seen as sliding to process information at different times. W gated unit is on, the past information is remembered; when the gated unit is off, information is discarded. On the whole, LSTM provides a path for the long-dista tinuous circulation of gradients through the linear self-circulation of the gating u the cell state unit, which changes the propagation mode of information and grad the previous recurrent neural network and solves the long-term dependency p The complete LSTM architecture is shown in Figure 7.  Among them, is the element-wise dot product operation. The hidden state unit h t of LSTM is determined by the output gate and c t : It can be seen that in LSTM, not only the hidden state unit h t−1 and h t have a relatively complex cyclic connection, but also the internal unit state unit c t−1 and c t . There is also a linear self-circulating relationship between them. The linear self-loop between cell state units can be seen as sliding to process information at different times. When the gated unit is on, the past information is remembered; when the gated unit is off, the past information is discarded. On the whole, LSTM provides a path for the long-distance continuous circulation of gradients through the linear self-circulation of the gating unit and the cell state unit, which changes the propagation mode of information and gradients in the previous recurrent neural network and solves the long-term dependency problem. The complete LSTM architecture is shown in Figure 7.
Among them, ⊙ is the element-wise dot product operation. The hidden state unit ℎ of LSTM is determined by the output gate and : Figure 6. Schematic of LSTM cell.
It can be seen that in LSTM, not only the hidden state unit ℎ and ℎ have a relatively complex cyclic connection, but also the internal unit state unit and . There is also a linear self-circulating relationship between them. The linear self-loop between cell state units can be seen as sliding to process information at different times. When the gated unit is on, the past information is remembered; when the gated unit is off, the past information is discarded. On the whole, LSTM provides a path for the long-distance continuous circulation of gradients through the linear self-circulation of the gating unit and the cell state unit, which changes the propagation mode of information and gradients in the previous recurrent neural network and solves the long-term dependency problem. The complete LSTM architecture is shown in Figure 7.

Experimental Structure
This research is divided into three parts. Figure 8 shows the structure of the vehicular audio signal diagnosis experiment. The first part focuses on signal characteristic filtering. We use acoustic sensors to collect 43 vehicle fault signals. Table 1 shows the 43 fault conditions, including 18 different types for the tires, 6 types for the belt, 16 types for the chassis, and 3 types for the engine.
R PEER REVIEW 9 of 28

Experimental Structure
This research is divided into three parts. Figure 8 shows the structure of the vehicular audio signal diagnosis experiment. The first part focuses on signal characteristic filtering. We use acoustic sensors to collect 43 vehicle fault signals. Table 1 shows the 43 fault conditions, including 18 different types for the tires, 6 types for the belt, 16 types for the chassis, and 3 types for the engine.   To obtain a considerable amount of the fault signals, the IoT device architecture plays an important role in ensuring that a large number of faulty signal samples are obtained. The combined equipment and network characteristics enable us to obtain a dynamic time signal and convert the energy spectrum on the experimental equipment. The Hamming window function is used to obtain the spectrum signal. Figure 9 shows the settings of the spectrum signal.
The sampling frequency is 44,100 Hz, and the acquisition time of each data sample is 40 s. In converting the frequency spectrum, signal preprocessing is first performed to eliminate background noise. According to the "Nyquist-Shannon" sampling theorem, the sampling frequency must be greater than twice the maximum frequency required for reproduction [42]. Since the human hearing range is approximately 20-20,000 Hz, the sampling frequency must be greater than 40 kHz. The audio codec of our smartphone has a standard sampling frequency of 44.1 kHz, which corresponds to a sampling rate of 20-20 kHz in the audible range of the human ear [43]. This analysis uses the LPC and wavelet algorithms for filtering, and the sound signal is converted from a continuous time domain signal to a sound spectrum frequency domain. Sound features are filtered through MATLAB to create a training dataset. Figure 10 shows the 43 normal and fault conditions of LPC and wavelet on the MATLAB platform with audio signal spectrum characteristics. Toe-in V30 20 psi Chassis toe-in and low pressure in 30 km/h. 28 Toe-in V20 32 psi Chassis toe-in and normal pressure in 20 km/h. 29 Toe-in V20 50 psi Chassis toe-in and high pressure in 20 km/h. 30 Toe-in V20 20 psi Chassis toe-in and low pressure in 20 km/h. 31 Toe-out V30 32 psi Chassis toe-out and normal pressure in 30 km/h. 32 Toe-out V30 50 psi Chassis toe-out and high pressure in 30 km/h. 33 Toe-out V30 20 psi Chassis toe-out and low pressure in 30 km/h. 34 Toe-out V20 32 psi Chassis toe-out and normal pressure in 20 km/h. 35 Toe-out V20 50 psi Chassis toe-out and high pressure in 20 km/h. 36 Toe-out V20 20 psi Chassis toe-out and low pressure in 20 km/h. 37 Drive V10  The second part establishes the spectral characteristic signal dataset. To effectively identify the characteristics of various types of faults, the original spectral characteristics are reduced to 10,000 characteristic lengths as filtered spectral characteristics. Then, 30 and 40 sets of characteristic coefficients were set for 43 fault conditions. A total of 1290 and 1720 voiceprint characteristic data were constructed, respectively. The voiceprint feature on the horizontal axis is expected to be set to 10000 points, that is, 10000 pieces of feature data to train the model. The dimensions of the total dataset are 1290*10000 and 1720*10000, respectively. Figure 11 shows the settings diagram of the dataset. In the third part, after completing the construction and labeling of the dataset, the Pytorch architecture in Python is used to import the dataset into the three algorithms of DNN, CNN and LSTM for classification. The Pytorch architecture used in this study uses adaptive moment estimation (Adam) as the optimizer function. The Adam is an adaptive learning rate algorithm whose essence is an RMSprop optimization method with a momentum term and is currently the most widely used model training optimizer [44]. The experimental process is mainly used to cross-compare the datasets constructed by two different filtering algorithms LPC and wavelet and the differences between the recognition results and learning speeds caused by the three machine learning algorithms. The learning rate and batch size are based on the premise of using the least Epoch (the number of iterations) to adjust the parameters to achieve the best recognition rate.
The learning rate controls the learning rate of the model. The larger the learning rate, the faster the convergence rate and the less training time. After the extreme value is exceeded, the loss function stops decreasing and oscillates at a certain position. The smaller the learning rate, the slower the convergence speed, the more time it takes to train the model, and the easier the network to enter the local minimum, which makes the loss function converge poorly. Therefore, the appropriate learning rate can be adjusted by observing the change of the model loss parameters. In this study, we set the learning rate of the three algorithms to 0.00001 after many experiments, which not only achieves the best model convergence overall, but also facilitates cross-validation of different deep learning algorithms.  The learning rate controls the learning rate of the model. The larger the learning rate, the faster the convergence rate and the less training time. After the extreme value is exceeded, the loss function stops decreasing and oscillates at a certain position. The smaller the learning rate, the slower the convergence speed, the more time it takes to train the model, and the easier the network to enter the local minimum, which makes the loss func- In terms of the hidden layer setting of the DNN algorithm in this study, at first, we tried to use a DNN architecture with 12 hidden layers, but it was found that the training results of the model with too few hidden layers were not ideal, as shown in Figure 12a. After that, we used a DNN architecture with 20 hidden layers and found that there was an overfitting problem. In addition, the model training and identification results were also extremely poor, as shown in Figure 12b. Therefore, it was finally decided to adopt a DNN architecture with 15 hidden layers as the hidden layer parameter setting in this study.
ceeded, the loss function stops decreasing and oscillates at a certain position. The smaller the learning rate, the slower the convergence speed, the more time it takes to train the model, and the easier the network to enter the local minimum, which makes the loss function converge poorly. Therefore, the appropriate learning rate can be adjusted by observing the change of the model loss parameters. In this study, we set the learning rate of the three algorithms to 0.00001 after many experiments, which not only achieves the best model convergence overall, but also facilitates cross-validation of different deep learning algorithms.
In terms of the hidden layer setting of the DNN algorithm in this study, at first, we tried to use a DNN architecture with 12 hidden layers, but it was found that the training results of the model with too few hidden layers were not ideal, as shown in Figure 12a. After that, we used a DNN architecture with 20 hidden layers and found that there was an overfitting problem. In addition, the model training and identification results were also extremely poor, as shown in Figure 12b. Therefore, it was finally decided to adopt a DNN architecture with 15 hidden layers as the hidden layer parameter setting in this study. Taking the LPC 30 feature as an example, if epoch = 200, the number of iterations of the model training is insufficient, and underfitting occurs. Although the model training time can be the shortest, the recognition accuracy is low (as shown in Figure 13a). If we lengthen the epoch to 800, the model training stability is extremely low, and overfitting occurs (as shown in Figure 13b). Therefore, it is more appropriate to use epoch = 500, thereby avoiding overfitting and underfitting. In addition, in order to facilitate the comparison of the learning effects of the three deep learning algorithms DNN, CNN and LSTM on the same dataset, we use the same epoch to facilitate cross-comparison of the time consumed by the training model. Taking the LPC 30 feature as an example, if epoch = 200, the number of iterations of the model training is insufficient, and underfitting occurs. Although the model training time can be the shortest, the recognition accuracy is low (as shown in Figure 13a). If we lengthen the epoch to 800, the model training stability is extremely low, and overfitting occurs (as shown in Figure 13b). Therefore, it is more appropriate to use epoch = 500, thereby avoiding overfitting and underfitting. In addition, in order to facilitate the comparison of the learning effects of the three deep learning algorithms DNN, CNN and LSTM on the same dataset, we use the same epoch to facilitate cross-comparison of the time consumed by the training model. The DNN parameters of this experiment are set as: the learning rate = 0.00001, the iteration time (Epoch time) = 500, the batch size (Batch) = 128, the test size = 0.3. We adopted a 15-layer deep neural network architecture, and the number of neurons in each hidden layer is shown in Figure 14. The connection between the hidden layers is fully connected. In order to solve the problem of gradient disappearance at saturation, Xavier Glorot et al. proposed a linear rectified function (Rectified Linear Unit, Relu) [45]. The disadvantage is that when the variable is updated too fast and when the function has not found the optimal value, the neuron will become less than 0 and the neuron will die. Therefore, the activation function we selected in the experiment is Selu (Scaled Exponential Linear Unit), and Selu is a variant of Relu [46], as shown in Figure 15, where its function is: The DNN parameters of this experiment are set as: the learning rate = 0.00001, the iteration time (Epoch time) = 500, the batch size (Batch) = 128, the test size = 0.3. We adopted a 15-layer deep neural network architecture, and the number of neurons in each hidden layer is shown in Figure 14. The connection between the hidden layers is fully connected. In order to solve the problem of gradient disappearance at saturation, Xavier Glorot et al. proposed a linear rectified function (Rectified Linear Unit, Relu) [45]. The disadvantage is that when the variable is updated too fast and when the function has not found the optimal value, the neuron will become less than 0 and the neuron will die. Therefore, the activation function we selected in the experiment is Selu (Scaled Exponential Linear Unit), and Selu is a variant of Relu [46], as shown in Figure 15, where its function is: where λ is a fixed value of 1.05070098736, and α is 1.67326324235.   Furthermore, in this experimental CNN model, three convolutional layers and three pooling layers are used to reduce the size of audio signal features. The sizes of the three convolutional layers are (5 × 5 × 3), (5 × 5 × 5), (5 × 5 × 5) in sequence, and the pooling layers are all 2 × 2. That is, the filter measures is 2 × 2, as shown in Figure 16. The network architecture parameter settings include a learning rate of 0.00001, a batch size of 128, and a test set scale of 0.3. The third neural network approach used in this experiment is the Long Short-T Memory (LSTM) algorithm. The LSTM architecture we used in the experimental pro is based on two hidden layers of the network, and 300 hidden neurons are used in e fault condition with 30 and 40 features, where the LSTM training model network l layer is 2×300, indicating that there are two hidden layers containing 300 hidden neur as shown in Figure 17. Here, we compare the training performance of the LPC dataset the wavelet dataset. Two different spectral datasets use a batch size of 128, the learn rate is set to 0.00001, and the number of iterations is set to 500.

Results and Discussion
This part uses the DNN algorithm to model the training dataset in the Python Pyto framework. We compare the results for DNNs with more than 10 hidden layers. Ge ally, in deep learning, deeper hidden layers are more accurate than shallow hidden lay As showed in Table 2, the LPC dataset uses the CNN algorithm to achieve be identification results, and the LSTM algorithm is extremely poor, followed by DNN. wavelet dataset using LSTM and DNN has a good identification effect, while the effe the CNN algorithm is slightly worse. In terms of wavelet dataset, the identification a racy of the DNN algorithm is as high as 1.00, which is 16.57% higher than 0.86 of the C algorithm and 13.82% higher than 0.88 of the LSTM algorithm. As far as the LPC dat is concerned, the accuracy rate of the CNN algorithm reaches 1.00, which is 72.77% hig than that of the DNN algorithm. In terms of model training time, the training time of The third neural network approach used in this experiment is the Long Short-Term Memory (LSTM) algorithm. The LSTM architecture we used in the experimental process is based on two hidden layers of the network, and 300 hidden neurons are used in each fault condition with 30 and 40 features, where the LSTM training model network label layer is 2×300, indicating that there are two hidden layers containing 300 hidden neurons, as shown in Figure 17. Here, we compare the training performance of the LPC dataset and the wavelet dataset. Two different spectral datasets use a batch size of 128, the learning rate is set to 0.00001, and the number of iterations is set to 500. The third neural network approach used in this experiment is the Long Short-Term Memory (LSTM) algorithm. The LSTM architecture we used in the experimental proces is based on two hidden layers of the network, and 300 hidden neurons are used in eac fault condition with 30 and 40 features, where the LSTM training model network labe layer is 2×300, indicating that there are two hidden layers containing 300 hidden neurons as shown in Figure 17. Here, we compare the training performance of the LPC dataset and the wavelet dataset. Two different spectral datasets use a batch size of 128, the learnin rate is set to 0.00001, and the number of iterations is set to 500.

Results and Discussion
This part uses the DNN algorithm to model the training dataset in the Python Pytorc framework. We compare the results for DNNs with more than 10 hidden layers. Gener ally, in deep learning, deeper hidden layers are more accurate than shallow hidden layers As showed in Table 2, the LPC dataset uses the CNN algorithm to achieve bette identification results, and the LSTM algorithm is extremely poor, followed by DNN. Th wavelet dataset using LSTM and DNN has a good identification effect, while the effect o the CNN algorithm is slightly worse. In terms of wavelet dataset, the identification accu racy of the DNN algorithm is as high as 1.00, which is 16.57% higher than 0.86 of the CNN algorithm and 13.82% higher than 0.88 of the LSTM algorithm. As far as the LPC datase is concerned, the accuracy rate of the CNN algorithm reaches 1.00, which is 72.77% highe

Results and Discussion
This part uses the DNN algorithm to model the training dataset in the Python Pytorch framework. We compare the results for DNNs with more than 10 hidden layers. Generally, in deep learning, deeper hidden layers are more accurate than shallow hidden layers.
As showed in Table 2, the LPC dataset uses the CNN algorithm to achieve better identification results, and the LSTM algorithm is extremely poor, followed by DNN. The wavelet dataset using LSTM and DNN has a good identification effect, while the effect of the CNN algorithm is slightly worse. In terms of wavelet dataset, the identification accuracy of the DNN algorithm is as high as 1.00, which is 16.57% higher than 0.86 of the CNN algorithm and 13.82% higher than 0.88 of the LSTM algorithm. As far as the LPC dataset is concerned, the accuracy rate of the CNN algorithm reaches 1.00, which is 72.77% higher than that of the DNN algorithm. In terms of model training time, the training time of the two datasets imported into the CNN algorithm is the longest, followed by DNN, and the shortest by LSTM. Compared with the LPC dataset, the wavelet dataset takes a longer time to import the three machine learning algorithms for the model training, and the difference is the largest in the CNN algorithm, which is 3.13% longer than the LPC dataset training time.  Figure 18 is a comparison of the loss functions of the LPC feature dataset (with 30 features). Figure 18a-c are the loss functions using DNN, CNN and LSTM algorithms, respectively. We found that the convergence speed of DNN and CNN is faster. Compared with the wavelet dataset, the CNN algorithm converges more stably, whereas LSTM has extremely poor convergence here. Figure 19 shows the comparison of the loss function of 30 kinds of features by wavelet. Figure 19a-c are the loss functions using DNN, CNN and LSTM algorithms, respectively. We found that DNN and CNN converge faster, but DNN is more stable overall. Furthermore, the gradient convergence of LSTM is slower, but the stability is higher than the previous two. In addition, Figures 20-22  shortest by LSTM. Compared with the LPC dataset, the wavelet dataset takes a longer time to import the three machine learning algorithms for the model training, and the difference is the largest in the CNN algorithm, which is 3.13% longer than the LPC dataset training time.  Figure 18 is a comparison of the loss functions of the LPC feature dataset (with 30 features). Figure 18a-c are the loss functions using DNN, CNN and LSTM algorithms, respectively. We found that the convergence speed of DNN and CNN is faster. Compared with the wavelet dataset, the CNN algorithm converges more stably, whereas LSTM has extremely poor convergence here. Figure 19 shows the comparison of the loss function of 30 kinds of features by wavelet. Figure 19a-c are the loss functions using DNN, CNN and LSTM algorithms, respectively. We found that DNN and CNN converge faster, but DNN is more stable overall. Furthermore, the gradient convergence of LSTM is slower, but the stability is higher than the previous two. In addition, Figures Table 3 presents the results with three different deep learning algorithm of 40-featu datasets. Similar to taking 40 sets of Cdelta coefficients and 30 sets of P coefficients f each fault condition, the LPC dataset uses the CNN algorithm to achieve better identific tion results, and the LSTM algorithm is extremely poor, followed by DNN. The wave dataset using LSTM and DNN has a good identification effect, while the effect of the CN algorithm is slightly worse. For the wavelet dataset, the identification accuracy of t   Table 3 presents the results with three different deep learning algorithm of 40-featu datasets. Similar to taking 40 sets of Cdelta coefficients and 30 sets of P coefficients f each fault condition, the LPC dataset uses the CNN algorithm to achieve better identific tion results, and the LSTM algorithm is extremely poor, followed by DNN. The wave dataset using LSTM and DNN has a good identification effect, while the effect of the CN algorithm is slightly worse. For the wavelet dataset, the identification accuracy of t  Table 3 presents the results with three different deep learning algorithm of 40-feature datasets. Similar to taking 40 sets of Cdelta coefficients and 30 sets of P coefficients for each fault condition, the LPC dataset uses the CNN algorithm to achieve better identification results, and the LSTM algorithm is extremely poor, followed by DNN. The wavelet dataset using LSTM and DNN has a good identification effect, while the effect of the CNN algorithm is slightly worse. For the wavelet dataset, the identification accuracy of the DNN algorithm and the LSTM algorithm is as high as 1.00, which is 9.55% higher than the 0.86 of the CNN algorithm. As far as the LPC dataset is concerned, the accuracy rate of the CNN algorithm reaches 1.00, which is 20.84% higher than that of the DNN algorithm. In terms of the model training time, the training time of the two datasets imported into the CNN algorithm is the longest, followed by LSTM. The DNN is the shortest in the training time. Unlike the 30 features, DNN is a calculus with a shorter training time between the two. Compared with the LPC dataset, the wavelet dataset has a shorter time for model training when the three machine learning algorithms are applied.   Figure 26a-c are the loss functions using DNN, CNN and LSTM algorithms, respectively. Similar to the previous results of taking 30 feature datasets: DNN and CNN converge faster. Moreover, compared with wavelet datasets, the CNN algorithm converges more stably, whereas LSTM has extremely poor convergence here. Figure 27 shows the comparison of the loss function of 40 kinds of features by wavelet. Figure 27a-c are the loss functions using DNN, CNN and LSTM algorithms, respectively. We find that the CNN loss function converges more smoothly, and the DNN and CNN loss functions are more stable than the experimental results with 30 features. In addition, Figures 28-30 are the confusion matrices of the LPC dataset with 40 features imported into DNN, CNN and LSTM algorithms, respectively, for classification and identification. Figures 31-33 are the confusion matrices of the 40-features wavelet dataset imported into DNN, CNN, and LSTM algorithms, respectively, for classification and identification. The results of our experiments can also be seen from the confusion matrix. Furthermore, LSTM has faster gradient convergence than the experimental results with 30 features per failure, but is less stable when the number of iterations is small.   Figure 26 is a comparison of the loss functions of the LPC feature dataset with 40 features. Figure 26a-c are the loss functions using DNN, CNN and LSTM algorithms, respectively. Similar to the previous results of taking 30 feature datasets: DNN and CNN converge faster. Moreover, compared with wavelet datasets, the CNN algorithm converges more stably, whereas LSTM has extremely poor convergence here. Figure 27 shows the comparison of the loss function of 40 kinds of features by wavelet. Figure 27a-c are the loss functions using DNN, CNN and LSTM algorithms, respectively. We find that the CNN loss function converges more smoothly, and the DNN and CNN loss functions are more stable than the experimental results with 30 features. In addition, Figures 28-30 are the confusion matrices of the LPC dataset with 40 features imported into DNN, CNN and LSTM algorithms, respectively, for classification and identification. Figures 31-33 are the confusion matrices of the 40-features wavelet dataset imported into DNN, CNN, and LSTM algorithms, respectively, for classification and identification. The results of our experiments can also be seen from the confusion matrix. Furthermore, LSTM has faster gradient convergence than the experimental results with 30 features per failure, but is less stable when the number of iterations is small.

Conclusions
An early vehicle fault signal classification method is proposed based on voicepr filtering combined with deep learning algorithms. We collected 43 different vehicle brea down signals. LPC and wavelet were used to filter the original signal to obtain importa signal spectral characteristics used to define fault type. In addition, three machine lear

Conclusions
An early vehicle fault signal classification method is proposed based on voicepr filtering combined with deep learning algorithms. We collected 43 different vehicle brea down signals. LPC and wavelet were used to filter the original signal to obtain importa signal spectral characteristics used to define fault type. In addition, three machine lea ing algorithms, DNN, CNN and LSTM, were used to develop automatic diagnosis me

Conclusions
An early vehicle fault signal classification method is proposed based on voiceprint filtering combined with deep learning algorithms. We collected 43 different vehicle breakdown signals. LPC and wavelet were used to filter the original signal to obtain important signal spectral characteristics used to define fault type. In addition, three machine learning algorithms, DNN, CNN and LSTM, were used to develop automatic diagnosis methods to classify complex fault features. Looking at the whole experiment, in terms of the LPC dataset, CNN has the best performance, followed by DNN, and finally LSTM. However, in terms of model training time, the order of the three is reversed.
The LPC dataset and the CNN algorithm can obtain the best identification results, but the training process is also the most time-consuming. With regard to LPC + LSTM, although the training time is the shortest, it is almost impossible to identify and classify. That is, the accuracy rate is extremely low. Furthermore, for the wavelet dataset, DNN has the best performance both in terms of identification performance and training time. For datasets with large dimensions, the accuracy of the wavelet algorithm combined with LSTM also has good identification performance. Based on our experimental results, we can infer that in this experiment, the wavelet algorithm combined with DNN can not only achieve the best identification performance, but also the shortest model training time when the dataset dimension is large.
All deep learning models are implemented on the Python Pytorch platform using NVIDIA GeForce GTX. In this research, early failure prediction in vehicles is of great significance to the emerging Internet of Vehicles and can help increase the production capacity of Internet of Things and Industry 4.0 applications. Future work will seek to combine two filtering methods, such as MFCC + LPC or MFCC + wavelet, and to apply machine learning methods suitable for natural language processing (NLP), such as long short-term memory (LSTM) work to produce an application that effectively achieves faster identification. Voice recognition methods are an area worthy of attention and have extensive applications in daily life; thus, combining artificial intelligence and voiceprint recognition can potentially produce significant and widespread benefits.