Combination of VMD Mapping MFCC and LSTM: A New Acoustic Fault Diagnosis Method of Diesel Engine

Diesel engines have a wide range of functions in the industrial and military fields. An urgent problem to be solved is how to diagnose and identify their faults effectively and timely. In this paper, a diesel engine acoustic fault diagnosis method based on variational modal decomposition mapping Mel frequency cepstral coefficients (MFCC) and long-short-term memory network is proposed. Variational mode decomposition (VMD) is used to remove noise from the original signal and differentiate the signal into multiple modes. The sound pressure signals of different modes are mapped to the Mel filter bank in the frequency domain, and then the Mel frequency cepstral coefficients of the respective mode signals are calculated in the mapping range of frequency domain, and the optimized Mel frequency cepstral coefficients are used as the input of long and short time memory network (LSTM) which is trained and verified, and the fault diagnosis model of the diesel engine is obtained. The experimental part compares the fault diagnosis effects of different feature extraction methods, different modal decomposition methods and different classifiers, finally verifying the feasibility and effectiveness of the method proposed in this paper, and providing solutions to the problem of how to realise fault diagnosis using acoustic signals.


Introduction
Diesel engines, as the main power source in the current industrial, military and other fields, are prone to various types of failures due to their harsh working environment in many practical applications. Most of the current research on diesel engine fault identification is realized by analyzing vibration signals [1][2][3]. In practical engineering applications, diesel engines will generate high temperature or even explode during operation. Directly measuring the vibration state of the diesel engine in space is very likely to damage the sensor and transmission line, resulting in measurement distortion or even inability to measure. Acoustic measurement [4] can effectively avoid such problems and provide convenience for data analysis and collection.
At present, fault diagnosis and state identification based on acoustic signals has become an emerging research hotspot in the research of fault prediction and health management (PHM) of machinery such as diesel engines. Some experts and scholars have carried out certain research and exploration on diesel engine fault diagnosis based on acoustic signals. Mathew et al. [5] conducted combustion fault diagnosis for engine acoustic signals, established and compared three fault classification models, and weighed the diagnostic accuracy and diagnostic time of different models, which can be applied to different scenarios. Figlus et al. [6] used discrete wavelet transform to denoise the engine acoustic signal and extract entropy features, which can effectively diagnose diesel engine valve faults based on instantaneous entropy. Ning et al. [7] proposed a new dislocation superposition method, which improves the extraction accuracy of fault components by increasing the number of superpositions to the acoustic signal, and uses the similarity coefficient to automatically determine the fault components. Zhang et al. [8] proposed a parallel sparse filtering method to extract the fault features of bearings, and verified the effectiveness of the method through simulation and experiments. The above studies show that it is practical to diagnose diesel engine faults by means of acoustic signals, but in the process of these actual studies, some deficiencies and problems of non-contact data acquisition methods and traditional feature extraction methods are also exposed: (1) The sound field environment is complex, mixed with noise signals from other sound sources, resulting in a low signal-to-noise ratio in the acoustic signal, and it is difficult to separate and extract the fault information in the acoustic signal. The transmission path of sound is complex and the signal is weak, so it is difficult to locate the fault source for complex equipment. (2) Using conventional feature extraction methods to extract fault features from sound signals is still relatively one-sided and easy to ignore acoustic characteristics.
Variational mode decomposition (VMD) is widely used in signal noise reduction in the field of mechanical vibration, and some current researches also use it in acoustic signals. Facing the problem of abnormal sound recognition during vehicle operation, Kwon et al. [9] used VMD to effectively distinguish the sound into background noise and abnormal sound, and proved the good decomposition performance of VMD for sound; Chen et al. [10] proposed a method combining VMD, fast independent component analysis, and Hilbert transform to optimize the decomposition effect for the problem of diesel engine sound source decomposition, and verified the effectiveness of the method; Zhang et al. [11] optimized the decomposition performance of wind turbine aeroacoustic signals collected by a single acoustic sensor by improving the number of decomposition layers of VMD, and achieved a better blind source separation effect. These studies have shown that VMD can reduce noise, decompose effective information, and improve the signal-to-noise ratio, but it has disadvantage of not being able to directly extract the required fault features and needing to combine more specific feature extraction methods. It will lead to a substantial increase in the dimension of the features extracted later.
Mel Frequency Cepstrum Coefficient (MFCC) is a typical voiceprint recognition feature and has strong ability to extract acoustic signal features. At present, some scholars have begun to extend MFCC to the acoustic signal analysis of mechanical equipment. Suman et al. [12] proposed an algorithm combining adaptive Kalman filter and MFCC for mechanical fault detection of vehicle acoustic signals. Gong et al. [13] extracted MFCC features from the dynamic acoustic signals of vehicles, and then implemented fault feature classification using various machine learning methods. Márquez-Molina et al. [14] extracted features from the sound signal of aircraft taking off through 1/24 octave analysis and MFCC, which can effectively realize the classification of aircraft engines. It can be seen that extracting MFCC from the sound signal as a fault feature can better reflect the status information of the equipment, and it is more suitable for the characteristics of the acoustic signal, but these studies did not preprocess the original acoustic signal, which will lead to noise during the feature extraction process. Greater impact.
The combined use of VMD and MFCC can not only improve the signal-to-noise ratio of the signal, but also extract fault features that conform to the acoustic characteristics, but it will lead to a high feature dimension and introduce some useless features, which will introduce interference for later diagnosis and identification. and increase the computational burden. Therefore, this paper proposes a diesel engine fault feature extraction method based on VMD mapping MFCC. This method uses VMD to decompose the mode of the original one-dimensional sound pressure signal, removes useless noise components according to the change of the energy ratio of each intrinsic mode function (IMF), and maps the reserved IMFs to the frequency distribution of the Mel filter bank in the frequency domain. The Mel filter corresponding to each IMF is used to calculate the MFCC as the fault feature of the diesel engine, which effectively solves the problem of fault information extraction. In view of the memory function of the LSTM network, it has a strong learning ability in time series data [15,16]. Therefore, the double-layer LSTM network is selected as the fault diagnosis classifier, which effectively solves the problem of fault classification and localization. Combined with the diesel engine preset fault experiment, compared with the traditional feature extraction method and the unimproved MFCC extraction method, the method proposed in this paper can effectively characterize the diesel engine state and reduce the feature dimension. Then the diagnostic effects of different classification networks are compared to test the effectiveness, superiority and robustness of the proposed model.
The main contributions and innovations of this paper are as follows: (1) A feature extraction method of VMD mapping MFCC is proposed. According to the corresponding relationship between the IMF component and the frequency distribution of the Mel filter bank, the corresponding MFCC is calculated to form a fault feature. The feature has a low dimension and a strong ability to represent the state, which can effectively improve diagnostic accuracy. (2) The deep learning method is introduced to train the classification model, and the double-layer LSTM network is selected as the output classifier for network training and parameter fine-tuning, which effectively reduces the network training time and improves the accuracy of fault identification. (3) Taking the acoustic signal as the research object, combined with VMD mapping MFCC and LSTM network to build a diesel engine fault diagnosis model, the model shows more efficient and accurate diagnosis effect in the experiment.
The main contents of the other sections of this paper are as follows: Section 2 introduces the basic theory of VMD, MFCC and LSTM in detail; Section 3 describes the implementation process of the fault diagnosis model based on VMD mapping MFCC and LSTM networks; Section 4 describes the setup and result analysis of the diesel engine preset fault test; the conclusion is showed in Section 5.

Variational Mode Decomposition
VMD is a signal mode decomposition method proposed by Zosso's team in 2014 [17]. It has the characteristics of self-adaptation and non-recursion. As a new non-stationary, nonlinear signal processing method, replacing the previous local mean decomposition (LMD) [18], empirical mode decomposition (EMD) [19], ensemble empirical mode decomposition (EEMD) [20] and other recursive decomposition modes, which improve the mode mixing and end-effect problems of traditional decomposition methods by solving constrained variational problems. Literature [21,22] verified that VMD has better complex data decomposition accuracy and better anti-noise interference, etc.
The function of the VMD algorithm is to construct and solve the variational problem, decompose the original signal f (t) into K IMF components u k (t). Under the condition of the sum of each component is equal to the input signal, the center frequency and bandwidth are continuously updated through the iterative process, and finally the IMF component that minimizes the sum of the IMF bandwidths is obtained. The specific steps are as follows: (1) Through the Hilbert transform, the one-sided spectrum of the analytical signal of each IMF component u k (t) (k= 1, 2, · ··, K) is obtained: Adjust the position of the center frequency of each intrinsic mode to the respective baseband: where δ(t) is the impulse function, * is the convolution calculation, u k (t) is the kth IMF component and ω k is the center frequency of each IMF component. (2) The L2-norm of the above demodulated signal gradient is calculated, and the bandwidth of each IMF component is estimated. Get the constrained variational model expression: (3) The quadratic penalty factor α and the Lagrange multiplication operator λ are introduced into the variational solution problem to transform the constrained variational problem into an unconstrained variational problem. The augmented Lagrange function is as follows: where α is usually selected to be a large enough positive number to improve the reconstruction accuracy of the signal; λ(t) ensures the strict restriction of the constraints; 2 2 represents the operation of L 2 -norm and represents the operation of inner product. Alternate direction method of multipliers (ADMM) is used to update the value of each u k ,ω k .and calculate the saddle point of the augmented Lagrange function. It is the way to find the optimal solution of the constrained variational model, so as to realize modal decomposition. The iterative process of u k ,ω k is as follows: whereû k is the IMF function in the frequency domain state,λ is the Lagrange multiplication operator in the frequency domain state. In the VMD process, the difference of the number of decomposition layers K will affect the effect of modal decomposition, which is a limitation of VMD. The determination of the number of decomposition layers K needs to be set manually, which has a certain degree of randomness and subjectivity. Therefore, this paper will determine the appropriate K value by calculating the spectral centroid of each IMF.
where ω k G is the spectral centroid of each kth IMF.

Mel Frequency Cepstrum Coefficient
MFCC was first proposed by Davis and Mermelstein in the 1980s [23], and this study effectively proved that the coefficient has a better identification effect than other parameters. So far, MFCC has been widely used in various fields of speech recognition, including instruction recognition, emotion recognition, and person recognition. At the same time, MFCC is also gradually introduced into the status and fault identification of some mechanical equipment, such as transformers [24], UAVs [25], etc., and good results have also been achieved by means of MFCC. The physical meaning of MFCC, in simple terms, is a cepstral coefficient calculated in the Mel-scale frequency domain, which can be understood as the parameter corresponding to the energy of the sound signal in different frequency ranges in the cepstrum, which can reflect low frequency envelope and high frequency detail information. The Mel-scale frequency domain is a frequency domain that simulates the human ear's perception of sound frequency. Its definition formula is as follows: where M is the Mel frequency; f is the frequency. The MFCC extraction process of the sound signal is shown in Figure 1, in which the pre-emphasis is to compensate the high-frequency components of the signal and prevent the high-frequency components from being attenuated or even lost in the subsequent calculation process. When framing the sound pressure signal, considering the continuity between the two frames, there should be an overlapping area between the two adjacent frames. In speech recognition, the frame length is generally set to 25 ms, and the frame shift is set to 10 ms [26]. The acoustic signal of the diesel engine is relatively more stable, so this paper also uses this setting. The framed signal is windowed with a Hamming window to enhance the continuity of the signal and reduce the distortion caused by the subsequent FFT. The Hamming window formula is expressed as follows.
where A is the length of the Hamming window.
MFCC was first proposed by Davis and Mermelstein in the 1980s [23], and this study effectively proved that the coefficient has a better identification effect than other parameters. So far, MFCC has been widely used in various fields of speech recognition, including instruction recognition, emotion recognition, and person recognition. At the same time, MFCC is also gradually introduced into the status and fault identification of some mechanical equipment, such as transformers [24], UAVs [25], etc., and good results have also been achieved by means of MFCC.
The physical meaning of MFCC, in simple terms, is a cepstral coefficient calculated in the Mel-scale frequency domain, which can be understood as the parameter corresponding to the energy of the sound signal in different frequency ranges in the cepstrum, which can reflect low frequency envelope and high frequency detail information. The Mel-scale frequency domain is a frequency domain that simulates the human ear's perception of sound frequency. Its definition formula is as follows: where M is the Mel frequency; f is the frequency. The MFCC extraction process of the sound signal is shown in Figure 1, in which the pre-emphasis is to compensate the high-frequency components of the signal and prevent the high-frequency components from being attenuated or even lost in the subsequent calculation process. When framing the sound pressure signal, considering the continuity between the two frames, there should be an overlapping area between the two adjacent frames. In speech recognition, the frame length is generally set to 25 ms, and the frame shift is set to 10 ms [26]. The acoustic signal of the diesel engine is relatively more stable, so this paper also uses this setting. The framed signal is windowed with a Hamming window to enhance the continuity of the signal and reduce the distortion caused by the subsequent FFT. The Hamming window formula is expressed as follows.
( ) 2 0.54 0.46 cos , 0 where A is the length of the Hamming window.
where H n (k) is the function of the nth Mel filter, and k is the independent variable of the function, that is, the frequency.
Hk is the function of the nth Mel filter, and k is the independent variable of the function, that is, the frequency. After the preprocessed signal is filtered by the Mel filter bank, N parameters n m (n = 1, 2,…, N) can be obtained: where F is the number of FFT points; () Xk is the frequency domain function of the preprocessing framed signal after FFT. Take the logarithm of n m calculated by Equation (11), and perform DCT on it. The transformation process is as follows: where () cn is the MFCC of each framed signal. However, the traditional MFCC extraction method realizes frequency domain conversion through FFT. In this process, it is assumed that the signal is approximately unchanged in a short time, and the nonlinearity and non-stationarity of the signal cannot be reflected, resulting in the loss of some information. Decomposing the signal by VMD After the preprocessed signal is filtered by the Mel filter bank, N parameters m n (n = 1, 2, . . . , N) can be obtained: where F is the number of FFT points; X(k) is the frequency domain function of the preprocessing framed signal after FFT. Take the logarithm of m n calculated by Equation (11), and perform DCT on it. The transformation process is as follows: where c(n) is the MFCC of each framed signal. However, the traditional MFCC extraction method realizes frequency domain conversion through FFT. In this process, it is assumed that the signal is approximately unchanged in a short time, and the nonlinearity and non-stationarity of the signal cannot be reflected, resulting in the loss of some information. Decomposing the signal by VMD and then extracting the MFCC will reflect the local characteristics more accurately, so as to obtain more comprehensive diesel engine fault characteristics.

VMD Mapping MFCC Features
Due to the low signal-to-noise ratio of the diesel engine sound signal, MFCC, as features, can effectively reflect the envelope and detail information of the diesel engine acoustic signal. but the conventional MFCC feature extracted from the original signal directly without noise reduction may result in more interference in the features. Therefore, multiple IMF components are obtained after VMD of the original signal, and the noise components are removed from them, which can achieve effective noise reduction of the original signal. Zou et al. [27] extracted MFCC and GFCC from each IMF component decomposed by the VMD of the UAV noise signal, and formed a mixed feature together with the energy ratio of each IMF component of the VMD. Drone audio can be identified effectively in different noise environments. However, this method of directly extracting features from the original signal may still contain a large amount of noise components in the features. At the same time, the method of extracting MFCC features for all IMF components will lead to higher dimensions of the extracted features and increase the computational burden. And even some cepstral coefficient features that are not conducive to classification are introduced. Considering that the distribution of each IMF component is different in the frequency domain, it is usually concentrated in the frequency range of 4 to 5 triangular Mel filters, as shown in Figure 3. The MFCC of each frame signal is also calculated separately by each triangular Mel filter.
decomposed by the VMD of the UAV noise signal, and formed a mixed feature together with the energy ratio of each IMF component of the VMD. Drone audio can be identified effectively in different noise environments. However, this method of directly extracting features from the original signal may still contain a large amount of noise components in the features. At the same time, the method of extracting MFCC features for all IMF components will lead to higher dimensions of the extracted features and increase the computational burden. And even some cepstral coefficient features that are not conducive to classification are introduced. Considering that the distribution of each IMF component is different in the frequency domain, it is usually concentrated in the frequency range of 4 to 5 triangular Mel filters, as shown in Figure 3. The MFCC of each frame signal is also calculated separately by each triangular Mel filter. If only the cepstral coefficients corresponding to the main frequency distribution filters of each IMF component are calculated during the MFCC calculation process of the IMF components, some invalid information can be eliminated, the feature dimension can be reduced, and the operation speed can be improved. Therefore, this paper proposes a diesel engine fault feature extraction method based on VMD Mapping MFCC (VMMFCC). The extraction process of VMMFCC is as follows: The IMF component is FFT, and its frequency domain distribution is mapped to the frequency distribution of the Mel filter bank, so that each IMFi1 mainly corresponds to 2~4 Mel triangular filters in frequency, for example, the first IMFi1 reserved corresponds to the x1th to y1th filters. Likewise, the nth IMFin retained corresponds to the xnth to ynth filters. Use the triangular filters corresponding to the IMF components to calculate the MFCC of the IMF, and combine them to form a new cepstral coefficient feature: [c(x1), c(x1 + 1),…, c(y1), c(x2), c(x2 + 1),…, c(y2),…, c(xn), c(xn + 1),…, c(yn)], namely VMMFCC. If only the cepstral coefficients corresponding to the main frequency distribution filters of each IMF component are calculated during the MFCC calculation process of the IMF components, some invalid information can be eliminated, the feature dimension can be reduced, and the operation speed can be improved. Therefore, this paper proposes a diesel engine fault feature extraction method based on VMD Mapping MFCC (VMMFCC). The extraction process of VMMFCC is as follows: The IMF component is FFT, and its frequency domain distribution is mapped to the frequency distribution of the Mel filter bank, so that each IMF i1 mainly corresponds to 2~4 Mel triangular filters in frequency, for example, the first IMF i1 reserved corresponds to the x 1 th to y 1 th filters. Likewise, the nth IMF in retained corresponds to the x n th to y n th filters. Use the triangular filters corresponding to the IMF components to calculate the MFCC of the IMF, and combine them to form a new cepstral coefficient feature: [c(x 1 ), c(x 1 + 1), . . . , c(y1), c(x 2 ), c(x 2 + 1), . . . , c(y 2 ), . . . , c(x n ), c(x n + 1), . . . , c(y n )], namely VMMFCC.

Long Short-Term Memory Network
Recurrent neural network (RNN) is the most traditional method of deep learning for time series signal classification, but RNN often has the problem of gradient disappearance for long sequence data, so it is difficult to learn and save long-term information. LSTM [28] proposed by Hochreiter and Schmidhuber in 1997 solves this problem well, and its core is a special implicit neural unit, as shown in Figure 4. LSTM can effectively prevent the occurrence of gradient disappearance and gradient explosion, and is widely used in long sequence recognition. It means that LSTM is also applicable to acoustic signals belonging to long time series.

Long Short-Term Memory Network
Recurrent neural network (RNN) is the most traditional method of deep learning for time series signal classification, but RNN often has the problem of gradient disappearance for long sequence data, so it is difficult to learn and save long-term information. LSTM [28] proposed by Hochreiter and Schmidhuber in 1997 solves this problem well, and its core is a special implicit neural unit, as shown in Figure 4. LSTM can effectively prevent the occurrence of gradient disappearance and gradient explosion, and is widely used in long sequence recognition. It means that LSTM is also applicable to acoustic signals belonging to long time series. Compared with RNN, LSTM is more complex. On the basis of retaining the original hidden state ht, it also adds a new hidden state of the cell state, denoted as Ct. The longterm memory and short-term memory functions of LSTM are realized through the gated structure, including input gate, forget gate and output gate. The specific calculation process is as follows: (1) The forget gate is a gated structure that controls the cellular state at the previous moment to be forgotten according to the probability. The gate will read the hidden state term memory and short-term memory functions of LSTM are realized through the gated structure, including input gate, forget gate and output gate. The specific calculation process is as follows: (1) The forget gate is a gated structure that controls the cellular state at the previous moment to be forgotten according to the probability. The gate will read the hidden state of the previous moment and the input of the current moment, and output a value between 0 and 1 to represent the probability that the data is retained, the formula is as follows: where h t−1 represents the hidden state at the previous moment, x t is the input at the current moment, W is the weight vector, b is the bias vector, σ represents the sigmoid function, and f t is the retention probability of the cell state.
(2) The input gate is a gated structure that determines the degree to which new information enters the cellular state. The gate consists of two parts. The calculation formulas are as follows: Equation (14) uses the sigmoid layer to decide which information needs to be updated, that is, the update probability: i t ; Equation (15) uses the tanh layer to generate the update content of the current moment: C t . The results of the two equations are multiplied to update the cellular state.
(3) Combine the forget gate and the input gate to calculate the current cell state C t .
(4) The output gate is a gated structure that determines the final output of the cell state to the hidden state h t . It is also composed of two parts. The calculation is as follows: Equation (17) uses the sigmoid layer to determine which parts of C t . will be output. Equation (18) is to multiply the output of the sigmoid layer with the C t . processed by tanh to obtain the hidden state output h t at the current moment.

Fault Diagnosis Process Based on VMMFCC and LSTM
When calculating the MFCC for the IMF components, only the cepstral coefficients of the Mel filter corresponding to the frequency distribution of each IMF component are calculated. It can eliminate some invalid information, reduce the feature dimension, and improve the operation speed. Therefore, this paper proposes a diesel engine fault diagnosis method based on VMMFCC-LSTM. The flow chart is shown in Figure 5. The specific implementation steps are as follows: Step 1: Optimization of VMD decomposition layers. Under different decomposition layers K, perform VMD on the original sound pressure signal of the diesel engine, and calculate the spectral centroid of each IMF component. When the boundary of the spectral centroid tends to be stable for the first time, VMD can not only avoid under-decomposition, but also better avoid modal aliasing. Therefore, the K value at this time is determined as the number of decomposition layers of VMD.
Step 2: IMF component optimized selection. the capacity ratio and energy variation range of each IMF under different operating states of the diesel engine are calculated to determine which IMF components contain large amounts of information and are more sensitive to fault conditions, and retain these n IMF components for subsequent feature extraction.
Step 3: Map the IMF frequency distribution to the Mel filter bank and calculate the new feature vectors. FFT is performed on the reserved IMF components, and the distribution in the frequency domain is mapped to the frequency distribution of the Mel filter bank, so as to calculate VMMFCC.
Step 4: Feature set partitioning. The feature sets of the diesel engine sound pressure signals in different operating states are extracted, and the training set and the validation set are divided in each state.
Step 5: Train the LSTM classification network. Build the LSTM network and adjust the network parameters, input the training set, and train the LSTM network classifier.
Step 6: Fault identification. The trained LSTM network structure and parameters are transferred to the classification network of the validation set to verify the effect of the proposed method.

Experimental Setup
The diesel engine preset experiment takes a high pressure common rail diesel engine with the model of CA6DF3-20E3 as the research object. This test bench is a customized

Experimental Setup
The diesel engine preset experiment takes a high pressure common rail diesel engine with the model of CA6DF3-20E3 as the research object. This test bench is a customized derating test bench for experiment and teaching. In the artificial preset fault experiment, the sound pressure signals of the diesel engine under five working conditions, namely, normal, blocked air inlet, the first cylinder misfire, the third cylinder misfire and the sixth cylinder misfire, are tested respectively. The installation position of the test bench and the sensor is shown in Figure 6. derating test bench for experiment and teaching. In the artificial preset fault experiment, the sound pressure signals of the diesel engine under five working conditions, namely, normal, blocked air inlet, the first cylinder misfire, the third cylinder misfire and the sixth cylinder misfire, are tested respectively. The installation position of the test bench and the sensor is shown in Figure 6. According to the standard on the position of the microphone in GB/T 17248.3-2018, the distance between the two sensors does not exceed 2 m and the distance between the sensor and the ground shall not be less than 1.2 m. Therefore, the two sound pressure sensors are arranged at the center of 30 cm on each side of the diesel engine, and the height is basically the same as the upper plane of the cylinder head. The experiment adopts YSV5001 high-precision ICP-type sound pressure sensor and DH5902N data acquisition system. The sensor sensitivity is 50 mV/Pa and the sampling frequency is 20 kHz. In the abnormal state setting, the blockage of the intake port is realized by installing the intake cover, and the misfire is realized by disconnecting the ignition power cord of the corresponding cylinder. According to the sampling settings in Table 1, the sound pressure signals of the diesel engine running in five states were collected.

Experimental Data Preprocessing
The sound pressure signals of different states in the time domain are shown in Figure  7a, and it is found that the periodicity and regularity of the sound pressure signals of diesel engines are poor. Through the fast Fourier transform, the frequency domain signal diagram is obtained as shown in Figure 7b. It can be seen that the spectral distribution difference under different states is not too large, so it is difficult to find the corresponding fault characteristic frequency. It can be found that the spectral energy is mainly distributed within 9 kHz, which does not exceed the hearing range of the human ear of 20Hz~20 kHz, so it basically conforms to the applicable range of the Mel filter bank. According to the standard on the position of the microphone in GB/T 17248.3-2018, the distance between the two sensors does not exceed 2 m and the distance between the sensor and the ground shall not be less than 1.2 m. Therefore, the two sound pressure sensors are arranged at the center of 30 cm on each side of the diesel engine, and the height is basically the same as the upper plane of the cylinder head. The experiment adopts YSV5001 high-precision ICP-type sound pressure sensor and DH5902N data acquisition system. The sensor sensitivity is 50 mV/Pa and the sampling frequency is 20 kHz. In the abnormal state setting, the blockage of the intake port is realized by installing the intake cover, and the misfire is realized by disconnecting the ignition power cord of the corresponding cylinder. According to the sampling settings in Table 1, the sound pressure signals of the diesel engine running in five states were collected.

Experimental Data Preprocessing
The sound pressure signals of different states in the time domain are shown in Figure 7a, and it is found that the periodicity and regularity of the sound pressure signals of diesel engines are poor. Through the fast Fourier transform, the frequency domain signal diagram is obtained as shown in Figure 7b. It can be seen that the spectral distribution difference under different states is not too large, so it is difficult to find the corresponding fault characteristic frequency. It can be found that the spectral energy is mainly distributed within 9 kHz, which does not exceed the hearing range of the human ear of 20Hz~20 kHz, so it basically conforms to the applicable range of the Mel filter bank.

Data Set Partitioning
In order to facilitate the analysis and verification of the scientificity and validity of the classification model constructed in this paper, it is necessary to divide the collected data. In this experiment, the 0.5 s sound pressure data collected by the sound pressure sensor 1 is used as a sample, that is, 10,000 sampling points are used as a sample. Therefore, the number of samples measured in each state is 120, of which 100 samples are selected as the training set, and the remaining 20 samples are used as the verification set. The data set division is shown in Table 2. Table 2. The division of data set sample. S1  120  100  20  S2  120  100  20  S3  120  100  20  S4  120  100  20  S5  120  100  20  Total  600 500 100

The Extraction of VMMFCC
Compared with the traditional EMD, the VMD can effectively improve the modal aliasing phenomenon, but the determination of the decomposition level K of the VMD needs to be set manually. For complex signals, the value of K will directly affect the decomposition effect. If its value is too large, it will cause over-decomposition and increase the degree of modal aliasing; if its value is too small, it will lead to underdecomposition and no useful signal will be obtained. In this paper, the optimal number of decomposition layers is determined by analyzing the distribution of the spectral centroid of each IMF component. The formula for calculating the spectral centroid is as follows: where N is the number of FFT points, k is the frequency subscript, and S(k) is the amplitude value of the frequency domain signal at k.

Data Set Partitioning
In order to facilitate the analysis and verification of the scientificity and validity of the classification model constructed in this paper, it is necessary to divide the collected data. In this experiment, the 0.5 s sound pressure data collected by the sound pressure sensor 1 is used as a sample, that is, 10,000 sampling points are used as a sample. Therefore, the number of samples measured in each state is 120, of which 100 samples are selected as the training set, and the remaining 20 samples are used as the verification set. The data set division is shown in Table 2. Table 2. The division of data set sample. S1  120  100  20  S2  120  100  20  S3  120  100  20  S4  120  100  20  S5  120  100  20  Total  600  500  100

The Extraction of VMMFCC
Compared with the traditional EMD, the VMD can effectively improve the modal aliasing phenomenon, but the determination of the decomposition level K of the VMD needs to be set manually. For complex signals, the value of K will directly affect the decomposition effect. If its value is too large, it will cause over-decomposition and increase the degree of modal aliasing; if its value is too small, it will lead to under-decomposition and no useful signal will be obtained. In this paper, the optimal number of decomposition layers is determined by analyzing the distribution of the spectral centroid of each IMF component. The formula for calculating the spectral centroid is as follows: where N is the number of FFT points, k is the frequency subscript, and S(k) is the amplitude value of the frequency domain signal at k. VMD is performed on the sound pressure signal according to different decomposition layers K, and the respective spectral centroids of the IMFs are calculated respectively, and they are arranged in order from low to high. The results of normal status are shown in Table 3. As the value of K increases, the range boundary of the spectral centroid will gradually stabilize. When the boundary values of the spectral centroid are stable for the first time, it can be determined that the K value at this time is the optimal number of decomposition layers. When K = 8, the boundary value of the spectral centroid begins to stabilize. Therefore, for the collected sound pressure signal, the optimal value of K is determined to be 8. At this time, the decomposition effect of VMD is shown in Figure 8. The spectral distribution of each IMF has no obvious modal aliasing and is relatively concentrated, so the decomposition effect is ideal. VMD is performed on the sound pressure signal according to different decomposition layers K, and the respective spectral centroids of the IMFs are calculated respectively, and they are arranged in order from low to high. The results of normal status are shown in Table 3. As the value of K increases, the range boundary of the spectral centroid will gradually stabilize. When the boundary values of the spectral centroid are stable for the first time, it can be determined that the K value at this time is the optimal number of decomposition layers. When K = 8, the boundary value of the spectral centroid begins to stabilize. Therefore, for the collected sound pressure signal, the optimal value of K is determined to be 8. At this time, the decomposition effect of VMD is shown in Figure  8. The spectral distribution of each IMF has no obvious modal aliasing and is relatively concentrated, so the decomposition effect is ideal.   In order to verify whether the optimal decomposition level K = 8 is also applicable to other data of fault state, acoustic signal is processed by VMD under different states with In order to verify whether the optimal decomposition level K = 8 is also applicable to other data of fault state, acoustic signal is processed by VMD under different states with different K-values, according to the same method, and calculate the spectral centroids of each IMF respectively. Since the change of boundary value of spectrum centroid is mainly considered when determining the optimal number of layers, the maximum and minimum values of the IMF spectrum centroid of each group of signals are calculated, as shown in Table 4. By observing the same state data, with the constant increase of K, the maximum value of spectral centroid tends to be stable after K = 8, and when K = 8, the minimum value of spectral centroid is also in the stable range. Therefore, K = 8 is the optimal decomposition level for VMD processing of diesel engine operation data in different states. VMD of the original sound pressure signal only decomposes the signal into multiple modes, but it cannot play a role in noise reduction. Therefore, it is also necessary to filter and eliminate the IMF components obtained, so as to achieve signal noise reduction and retain the useful part of the signal. The noise mentioned here is generalized. Any component that is not conducive to fault feature extraction can be regarded as noise. When the energy proportion of an IMF component changes significantly under different operating conditions, it can be considered that the IMF component is more sensitive to fault conditions. At the same time, it is also necessary to take into account the amount of information contained in the IMF. If the energy of the IMF component is relatively small, it means that its amount of information is small. Therefore, IMFs with a large proportion of energy and obvious changes among different states should be retained.
In this paper, Average proportion of energy (APE), variance of contribution of energy (VPE) and average Pearson correlation coefficient (APCC) are selected as the selection indicators of IMF components, in which APE is obtained by averaging the energy proportion of the same IMF in all groups of data; The VPE is obtained by first calculating average energy proportion of the same IMF in 120 groups of data under each state, and then calculating the variance between the five average values. Therefore, VPE can reflect the sensitivity of the IMF to changes in operating conditions to a certain extent; The APCC is obtained by averaging the Pearson correlation coefficients of the same IMF and the original signal in each group of data, which can reflect the correlation between IMF components and the original signal. Table 5 shows APE, VPE and APCC of IMF 1~I MF 8 . In this paper, five IMF components that are larger in each index value are selected as candidates. Among them, the APE of IMF 1~I MF 5 is relatively high, indicating that these five components contain relatively large amount of information; The VPE of IMF 1 , IMF 2 , IMF 4 , IMF 5 and IMF 6 is large, which means that these five components are more sensitive to fault conditions; The APCC of IMF 1~I MF 5 is high, indicating that the IMF component contains high original information content. After comprehensive consideration, four IMF components, IMF 1 , IMF 2 , IMF 4 and IMF 5 , which take into account the three indicators, are retained in this paper to further extract fault features. The spectra of the four reserved IMF components will be corresponding to the constructed Mel filter bank, as shown in Figure 9. The frequency distribution of each IMF component of the energy will be mainly concentrated in some filters. Take IMF 1 as an example, its spectrum is mainly distributed in the range of the first to fourth triangular filters. Therefore, the first to fourth triangular filters are called the mapped part. And the distribution in the 5th to 13th filters is very small, which is called the unmapped part. The IMF 1 component itself has very little information in the unmapped part, and filtering through the filter will only result in less obvious features, or even cause interference. So only the 4 MFCC values of the mapped part are calculated for IMF 1 . In the same way, the mapping relationship between different IMF components and Mel filters is obtained, as shown in Table 6. The spectra of the four reserved IMF components will be corresponding to the constructed Mel filter bank, as shown in Figure 9. The frequency distribution of each IMF component of the energy will be mainly concentrated in some filters. Take IMF1 as an example, its spectrum is mainly distributed in the range of the first to fourth triangular filters. Therefore, the first to fourth triangular filters are called the mapped part. And the distribution in the 5th to 13th filters is very small, which is called the unmapped part. The IMF1 component itself has very little information in the unmapped part, and filtering through the filter will only result in less obvious features, or even cause interference. So only the 4 MFCC values of the mapped part are calculated for IMF1. In the same way, the mapping relationship between different IMF components and Mel filters is obtained, as shown in Table 6.    This experiment takes 0.5 s of data as a sample, that is, 10,000 sampling points as a sample. Generally speaking, the frame length is set to 25 ms, and the frame shift is set to 15 ms. This article uses this setting to calculate the frame length as 20,000 × 0.025 = 500 sampling points, and the frame shift as 20,000 × 0.015 = 300 sampling points. Therefore, a sample will be divided into 32 frames in feature extraction, and the sum of the number of MFCCs extracted in each frame is the dimension of the frame's feature vector, so that a set of 32 × 17 features vector is extracted from each set of data.

Comparison of Feature Effect
The VMMFCC feature is different from the traditional MFCC feature. Figure 10a-e shows the MFCC feature vector map in five states. From the shape of the graph, under the same feature dimension, the feature coefficients corresponding to different frames fluctuate greatly under the same feature dimension, indicating that the feature fluctuates greatly over time, so the robustness of the feature is poor. Therefore, the robustness of this feature is poor. When the MFCC is directly extracted from the original signal, the feature will also be extracted from the noise component. Therefore, the feature vectors are more cluttered and less regular, which will also bring a burden to the subsequent training of the LSTM network.
MFCCs extracted in each frame is the dimension of the frame's feature vector, so that a set of 32 × 17 features vector is extracted from each set of data.

Comparison of Feature Effect
The VMMFCC feature is different from the traditional MFCC feature. Figure 10a-e shows the MFCC feature vector map in five states. From the shape of the graph, under the same feature dimension, the feature coefficients corresponding to different frames fluctuate greatly under the same feature dimension, indicating that the feature fluctuates greatly over time, so the robustness of the feature is poor. Therefore, the robustness of this feature is poor. When the MFCC is directly extracted from the original signal, the feature will also be extracted from the noise component. Therefore, the feature vectors are more cluttered and less regular, which will also bring a burden to the subsequent training of the LSTM network.  In order to verify the effectiveness of VMD for MFCC coefficient mapping selection and outperform other modal decomposition methods. Use LMD, EMD and EEMD to decompose the original signals, and then get the mapping MFCC under different mode decomposition methods through the process in Figure 5, namely LMMFCC, EMMFCC and EEMMFCC. The MFCC of the original signal and the four mapping MFCCs are trained and verified using LSTM network. The final fault diagnosis accuracy is shown in Table 7, and the fault diagnosis confusion matrix is shown in Figure 12.  In order to verify the effectiveness of VMD for MFCC coefficient mapping selection and outperform other modal decomposition methods. Use LMD, EMD and EEMD to decompose the original signals, and then get the mapping MFCC under different mode decomposition methods through the process in Figure 5, namely LMMFCC, EMMFCC and EEMMFCC. The MFCC of the original signal and the four mapping MFCCs are trained and verified using LSTM network. The final fault diagnosis accuracy is shown in Table 7, and the fault diagnosis confusion matrix is shown in Figure 12.  It can be seen that the accuracy of fault diagnosis can reach 87% when taking MFCC directly extracted from the original signal as the feature, but the diagnosis ability for S state is poor. Although feature LMMCC and EMMFCC use modal decomposition an mapping, due to the limited decomposition ability of LMD and EMD, it may lead to moda aliasing, which will bring more interference to feature extraction later, so the diagnosti effect is not as good as that of MFCC. With EEMMFCC as the feature input, the diagnosti accuracy has been improved to a certain extent, reaching 90%, but each state cannot b completely accurately identified, and the feature extraction time is long. VMMFCC is th feature proposed in this paper. Its feature extraction time is slightly higher than that o other features with lower accuracy, but the overall fault diagnosis accuracy can reach 97% and individual confusion only exists among the three similar fault states of S3, S4 and S5 Therefore, VMMFCC can better reflect the running state of diesel engine and obtain highe accuracy in the diagnosis process.

Comparison of Classifier Effects
After feature extraction, proper fault classification diagnostic device is also the key to fault diagnosis. For one-dimensional time-domain signals, LSTM is the most commonl used neural network recognition classification signal. The construction of LSTM networ should take into account such structural factors as data format, number of layers, numbe of hidden units, and set appropriate learning rate, batch size, discard rate, maximum number of iterations and other network learning super parameters [29]. Since the origina signal has been feature extracted in the early stage, there is no need to use the overly complex deep network for feature learning, so the simple double-layer LSTM networ structure is used in this paper. Specific parameters of network model training are shown It can be seen that the accuracy of fault diagnosis can reach 87% when taking MFCC directly extracted from the original signal as the feature, but the diagnosis ability for S3 state is poor. Although feature LMMCC and EMMFCC use modal decomposition and mapping, due to the limited decomposition ability of LMD and EMD, it may lead to modal aliasing, which will bring more interference to feature extraction later, so the diagnostic effect is not as good as that of MFCC. With EEMMFCC as the feature input, the diagnostic accuracy has been improved to a certain extent, reaching 90%, but each state cannot be completely accurately identified, and the feature extraction time is long. VMMFCC is the feature proposed in this paper. Its feature extraction time is slightly higher than that of other features with lower accuracy, but the overall fault diagnosis accuracy can reach 97%, and individual confusion only exists among the three similar fault states of S3, S4 and S5. Therefore, VMMFCC can better reflect the running state of diesel engine and obtain higher accuracy in the diagnosis process.

Comparison of Classifier Effects
After feature extraction, proper fault classification diagnostic device is also the key to fault diagnosis. For one-dimensional time-domain signals, LSTM is the most commonly used neural network recognition classification signal. The construction of LSTM network should take into account such structural factors as data format, number of layers, number of hidden units, and set appropriate learning rate, batch size, discard rate, maximum number of iterations and other network learning super parameters [29]. Since the original signal has been feature extracted in the early stage, there is no need to use the overly complex deep network for feature learning, so the simple double-layer LSTM network structure is used in this paper. Specific parameters of network model training are shown in Table 8. Other learning super parameter settings include: the decay period of learning rate decay period is 30, the decay factor of learning rate is 0.1, the batch size is 32, and the maximum number of iterations is 35. To verify the rationality of parameter settings of LSTM network, we also compared the impact of different number of hidden cells and dropout rates on network training and final diagnosis results, as shown in Table 9. As the number of hidden layer units increases, the number of network training parameters increases. At the same time, by comparing the diagnostic accuracy of different combinations of number of hidden cells of LSTM and dropout rate, it can be seen that number of hidden cells = 100 and dropout rate = 0.2 are the best combination of parameter settings. The t-SNE visualization process is performed on the features processed by different layers of the double-layer LSTM network, and the intuitive classification effect is shown in Figure 13. It can be seen that the distribution of fault characteristics of each state extracted by LSTM1 is generally scattered, only the S1 state can be basically distinguished, and the features of the other four states are seriously scattered, making it difficult to achieve fault separation. The role of the Dropout layer is to prevent overfitting, so the extracted features can not improve the classification effect. After the feature vector is processed by LSTM2, five states can be effectively identified. Among them, the S1 and S2 states can be completely distinguished, and there is a very small amount of confusion between the S3, S4, and S5 states, but these three states can also be effectively identified. The classification effect of the fullyConnected layer is more obvious, but there is still a very small amount of confusion between S3, S4, and S5. Considering that S3, S4, and S5 are all in the misfire state of a certain cylinder in the diesel engine, their fault states are relatively close, so there may be misjudgments in the identification process. In order to verify the effectiveness of LSTM network model for sound pressure signals and extracted features, common classification models used in machine learning and deep learning are selected for training. The accuracy of fault diagnosis obtained by inputting verification sets into different classification models is shown in Table 10. Among them, as a traditional machine learning model, SVM has a relatively simple structure, so the sum of trainable parameters is less, but it is difficult to effectively distinguish the types of fault states; For LSTM, RNN and 1D-CNN neural networks, the total training parameters of LSTM network classifier are relatively small and can achieve better classification effect, which proves that LSTM network is more suitable for the classification of sound signals under different operating conditions of diesel engines. The diagnosis effect of VMMFCC in different classification models is showed in Figure 14. When using SVM for diagnosis, various states cannot achieve a good fault identification effect. When using the traditional RNN model for classification, the accuracy rate can reach 91%, but the effect is not ideal. It may be that the network cannot be accurately classified due to the disappearance of the gradient during the training process. The diagnostic accuracy rate of 1D-CNN is 86%. Although the convolutional In order to verify the effectiveness of LSTM network model for sound pressure signals and extracted features, common classification models used in machine learning and deep learning are selected for training. The accuracy of fault diagnosis obtained by inputting verification sets into different classification models is shown in Table 10. Among them, as a traditional machine learning model, SVM has a relatively simple structure, so the sum of trainable parameters is less, but it is difficult to effectively distinguish the types of fault states; For LSTM, RNN and 1D-CNN neural networks, the total training parameters of LSTM network classifier are relatively small and can achieve better classification effect, which proves that LSTM network is more suitable for the classification of sound signals under different operating conditions of diesel engines. The diagnosis effect of VMMFCC in different classification models is showed in Figure 14. When using SVM for diagnosis, various states cannot achieve a good fault identification effect. When using the traditional RNN model for classification, the accuracy rate can reach 91%, but the effect is not ideal. It may be that the network cannot be accurately classified due to the disappearance of the gradient during the training process. The diagnostic accuracy rate of 1D-CNN is 86%. Although the convolutional network has strong self-learning ability, it is difficult to learn the features of time series signals with strong correlation. When using LSTM, the overall accuracy rate can reach 97%, which is better than other classifiers and can achieve relatively satisfactory results. network has strong self-learning ability, it is difficult to learn the features of time series signals with strong correlation. When using LSTM, the overall accuracy rate can reach 97%, which is better than other classifiers and can achieve relatively satisfactory results.
To sum up, the VMMFCC-LSTM fault diagnosis method proposed in this paper has more representative feature extraction and better diagnosis effect, and provides a more effective method and idea for diesel engine acoustic fault diagnosis.

Conclusions
In this paper, a fault diagnosis method for diesel engine acoustic signals based on VMD mapping MFCC and LSTM is proposed. A feature that can better represent the current operating state is extracted from the sound pressure signal obtained by noncontact measurement, and then the LSTM network is trained for diesel engine fault diagnosis. The main contributions of this paper are as follows: (1) VMD on the diesel engine sound pressure signal is performed, and calculate the spectral centroid of each IMF component under different decomposition layers K is calculated. According to the change of the boundary of spectral centroid, the optimal number of decomposition layers is determined to realize the optimization of VMD. (2) The MFCC method is introduced into the analysis of diesel engine sound signals.
According to the corresponding relationship between the frequency domain distribution of the IMF component and the Mel filter, a feature extraction method of VMD mapping MFCC is proposed for the first time, which effectively improve the ability of the sound signal of the diesel engine to reflect the fault state. To sum up, the VMMFCC-LSTM fault diagnosis method proposed in this paper has more representative feature extraction and better diagnosis effect, and provides a more effective method and idea for diesel engine acoustic fault diagnosis.

Conclusions
In this paper, a fault diagnosis method for diesel engine acoustic signals based on VMD mapping MFCC and LSTM is proposed. A feature that can better represent the current operating state is extracted from the sound pressure signal obtained by non-contact measurement, and then the LSTM network is trained for diesel engine fault diagnosis. The main contributions of this paper are as follows: (1) VMD on the diesel engine sound pressure signal is performed, and calculate the spectral centroid of each IMF component under different decomposition layers K is calculated. According to the change of the boundary of spectral centroid, the optimal number of decomposition layers is determined to realize the optimization of VMD. (2) The MFCC method is introduced into the analysis of diesel engine sound signals.
According to the corresponding relationship between the frequency domain distribution of the IMF component and the Mel filter, a feature extraction method of VMD mapping MFCC is proposed for the first time, which effectively improve the ability of the sound signal of the diesel engine to reflect the fault state. (3) A diagnosis method based on two-layer LSTM network is proposed. VMMFCC is input into LSTM network for parameter pre-training, then the verification set is used to verify the LSTM network. Thus, a fault diagnosis classifier suitable for VMMFCC is constructed, which achieves good fault diagnosis effect.
The experimental results show that the proposed method is superior to other diagnostic models in diesel engine acoustic fault diagnosis. In the engineering application of pursuing high timeliness, this research can quickly judge the fault type and locate the fault location by non-contact measurement means. In addition, the results of research will help to expand new knowledge in this field, with good reference value, and provide a new idea for relevant scholars.