Application of Generalized Composite Multiscale Lempel–Ziv Complexity in Identifying Wind Turbine Gearbox Faults

Wind turbine gearboxes operate in harsh environments; therefore, the resulting gear vibration signal has characteristics of strong nonlinearity, is non-stationary, and has a low signal-to-noise ratio, which indicates that it is difficult to identify wind turbine gearbox faults effectively by the traditional methods. To solve this problem, this paper proposes a new fault diagnosis method for wind turbine gearboxes based on generalized composite multiscale Lempel–Ziv complexity (GCMLZC). Within the proposed method, an effective technique named multiscale morphological-hat convolution operator (MHCO) is firstly presented to remove the noise interference information of the original gear vibration signal. Then, the GCMLZC of the filtered signal was calculated to extract gear fault features. Finally, the extracted fault features were input into softmax classifier for automatically identifying different health conditions of wind turbine gearboxes. The effectiveness of the proposed method was validated by the experimental and engineering data analysis. The results of the analysis indicate that the proposed method can identify accurately different gear health conditions. Moreover, the identification accuracy of the proposed method is higher than that of traditional multiscale Lempel–Ziv complexity (MLZC) and several representative multiscale entropies (e.g., multiscale dispersion entropy (MDE), multiscale permutation entropy (MPE) and multiscale sample entropy (MSE)).


Introduction
Wind turbines are widely used in the power system field, and are mainly composed of an impeller, rotor, gearbox, generator, bearing and coupling. The gearbox is one of the parts of the wind turbine most vulnerable to damage. The safe and steady operation of wind turbine gearboxes is directly related to the health condition of the whole mechanical system [1]. When a wind turbine gearbox exhibits local faults, the generated vibration signal is nonlinear and non-stationary. Furthermore, the generated vibration signal contains complicated and coupled vibration characteristic information (e.g., the periodic impulse, multiple harmonic interference and environment noise of signal transmission path), which indicates that it is difficult to identify wind turbine gearbox faults by common methods [2]. Therefore, to ensure the safe and steady operation of wind turbines, it is of great significance to develop a new and effective method to identify the local faults of wind turbine gearboxes.

Morphological Convolution Filtering
Considering the advantages of convolution operation in signal noise reduction, morphological convolution filtering, termed as a morphological gradient convolution operator (MGCO), was proposed by Li et al. [25] in 2018, which is defined as follows: MGCO(x(n)) = GCO(x(n)) * GCOOC(x(n)) (13) where the asterisk, *, denotes the convolution operation. Inspired by the concept of the MGCO, an alternative morphological convolution filtering method hailed as the morphological-hat convolution operator (MHCO) is formulated as follows: MHCO(x(n)) = AHCO(x(n)) * AHCOOC(x(n)) Due to fault feature information of practical wind turbine gearboxes, vibration signals are distributed over a wide frequency band; thus, excavating gear fault features by only using single-scale morphological filtering is inadequate. Therefore, the multiscale morphological-hat convolution operator is further defined as follows: MHCO(x(n) λg ) = AHCO(x(n) λg ) * AHCOOC(x(n) λg ) (15) where g is the single-scale SE, λg is the multiscale SE at scale λ, which can be obtained by dilation operation of λ − 1 times of g. Specifically, the multiscale SE λg can be expressed as follows [26]: λg = g ⊕ g ⊕ · · · ⊕ g } λ−1 times = (( g ⊕ · · · ⊕ g) ⊕ g) ⊕ g Through the introduction and application of multiscale SE λg, the proposed morphological convolution filtering (MHCO) process can successfully realize multi-resolution signal analysis and more accurately extract the fault feature information of wind turbine gearbox vibration signals than traditional morphological filtering. Theoretically, if fault feature frequency can be extracted by AHCO and AHCOOC, MHCO can also extract the same fault feature frequency, and the amplitude extracted by MHCO is larger than that AHCO and AHCOOC. Nevertheless, the existing related studies have shown that selection of the SE scale has a certain impact on the noise reduction performance of morphological filtering, which indicates that an effective selection strategy needs to be introduced in morphological convolution filtering.
At present, the research on selection method of SE scale of morphological filtering mainly focuses on two aspects. Firstly, many intelligent optimization algorithms (e.g., particle swarm optimization, genetic algorithm and differential evolution algorithm) are adopted to automatically select the SE scale of morphological filtering, but they will consume a lot of computational time under the iteration process. Secondly, some sensitive indexes (e.g., kurtosis, signal-to-noise ratio (SNR) [27] and fault feature ratio (FFR) [28]) are presented for the auxiliary selection of the SE scale of morphological filtering. Although these indexes had been proven effective in selecting the SE scale of morphological filtering, they also suffer from some disadvantages. For instance, kurtosis can detect periodic impulse characteristics of the vibration signal, but it is susceptible to random impulses with a large amplitude. Moreover, due to the influence of a complex environment, the fault signatures in the real gear vibration signal are indistinct and dispersive, which are difficult to detect through the kurtosis index. SNR can efficaciously evaluate stochastic noise interference of the vibration signal, but its robustness is weak for the impact property of the vibration signal. FFR can effectively characterize repetitive transients of the signal, but it Entropy 2021, 23, 1372 5 of 26 is insensitive to stochastic noise of the vibration signal. Therefore, considering the merits and demerits of SNR and FFR, this paper introduces a synthetic sensitive indicator called the signal characteristic frequency-to-noise ratio (SCFNR) to automatically determine the optimal SE scale of MHCO, which is defined as follows [29]: where f ci is the ith fault characteristic frequency of Hilbert envelope spectrum of the original signal x(n), S( f ci ), i = 1, 2, · · · , M is the amplitude of Hilbert envelope spectrum of the original signal x(n) at the ith fault characteristic frequency, S( f j ), j = 1, 2, · · · , N is the amplitude of Hilbert envelope spectrum of the original signal x(n) at the jth frequency f, and N and M are the number of all frequencies and fault characteristic frequencies of the Hilbert envelope spectrum of the original signal x(n), respectively. The SCFNR indicator is derived from the theoretical ideas of SNR and FFR; therefore, it inherits the advantages of SNR and FFR. The greater the SCFNR indicator, the better the noise reduction performance of MHCO, i.e., MHCO has better fault feature extraction performance. Therefore, the optimal SE scale of MHCO can be determined based on the largest SCFNR value.

Simulation Analysis
To verify the validity of MHCO in extracting periodic impulse features of vibration signals, a gear fault simulation signal y(t) is formulated as: where t 0 = mod(k/ f s , 1/ f o ), k = 0, 1, 2, · · · , 2047. The simulation signal y(t) consists of three parts (s 1 (t), s 2 (t) and r(t)). s 1 (t) is a periodic impulse sequence with the amplitude of 2, carrier frequency of f 1 = 200 Hz, modulation frequency (i.e., gear fault characteristic frequency) of f o = 16 Hz and attenuation coefficient of a = −100, which is used to simulate the impact signal generated by gear faults. Thus, the signal period of s 1 (t) is equal to 0.0625 s. s 2 (t) is a sinusoidal superimposed signal with the frequency of f 2 = 20 Hz and f 3 = 30 Hz, which is used to simulate the harmonic interference signals in gear vibration signals. r(t) is the Gaussian white noise with a signal-to-noise ratio (SNR) of 3 dB, which is used to simulate the background noise in gear vibration signal. The sampling frequency f s and sampling length of the simulation signal y(t) are set as 2048 Hz and 2048 points, respectively. Figure 1 shows the time domain waveform, amplitude spectrum and envelope spectrum of simulation signal y(t). Seen from the spectrum of Figure 1, due to the harmonic interference and background noise, the fault feature frequency f o = 16 Hz cannot be extracted by using directly the amplitude spectrum and envelope spectrum.
The proposed MHCO method and seven representative morphological filtering (i.e., GDE, GCO, GCOOC, AHDE, AHCO, AHCOOC and MGCO) are adopted to process the simulation signal y(t), respectively. In the comparison, all morphological filtering selected the widely used flat SE and the optimal SE scale was determined by the largest SCFNR. According to the literature [30], the maximal length of the flat SE is recommended as f s / f o , which represents the number of sampling points in one fault repetition period, i.e., the value of f s / f o can completely cover one fault repetition period. In addition, the length L and scale λ of the flat SE satisfy the relationship L = λ + 2. Therefore, the search range of the flat SE scale λ is set as 1 to f s / f o − 2, where f s is the sampling frequency, f o is the fault feature frequency and · denotes the down round operation. In other words, in the simulation signal, the flat SE scale λ = 1, 2, · · · , 126. According to the SE scale selection criterion, the SCFNR of different morphological filter methods are first calculated to determine their optimal SE scale. Figure 2 plots the SCFNR value obtained by all methods at different SE scales, and Table 1 lists the largest SCFNR obtained with different methods and their corresponding optimal SE scale. Then, different methods with the optimal SE scale are used to analyze the simulation signal y(t). Figure 3 shows the filtered signal obtained by different methods and their corresponding envelope spectrum. Seen from the envelope spectrum of Figure 3, the MHCO, AHDE and AHCO can effectively extract the fault characteristic frequency f o = 16 Hz and its harmonics. In addition, the fault characteristic frequency f o can also be found in the GCO, GCOOC and MGCO, but the fault characteristic frequency f o is invisible in the GDE and AHCOOC. Overall, compared with the seven other methods (i.e., GDE, GCO, GCOOC, AHDE, AHCO, AHCOOC and MGCO), the amplitude of the fault characteristic frequency obtained by the proposed MHCO is the largest, which indicates that the proposed MHCO has better noise reduction performance. Therefore, it can be seen from the simulation analysis results that the proposed MHCO is effective in eliminating the noise interference of gear vibration signals. The proposed MHCO method and seven representative morphological filtering (i.e. GDE, GCO, GCOOC, AHDE, AHCO, AHCOOC and MGCO) are adopted to process the simulation signal y(t), respectively. In the comparison, all morphological filtering selected the widely used flat SE and the optimal SE scale was determined by the largest SCFNR According to the literature [30], the maximal length of the flat SE is recommended as  Figure 2 plots the SCFNR value obtained by all methods at different SE scales, and Table 1 lists the largest SCFNR obtained with dif ferent methods and their corresponding optimal SE scale. Then, different methods with the optimal SE scale are used to analyze the simulation signal y(t). Figure 3 shows the filtered signal obtained by different methods and their corresponding envelope spectrum Seen from the envelope spectrum of Figure 3, the MHCO, AHDE and AHCO can effec         To study the influence of the added noises on the MHCO method, we calculated the results of the MHCO method under different noise levels (i.e., SNR = 0 dB, SNR = −2 dB and SNR = −4 dB), as shown in Figure 4. From Figure 4, when the SNR of Gaussian white noise of gear fault simulation signal y(t) is set as 0 to −4 dB, the proposed MHCO is still effective in extracting the gear fault characteristic frequency f o . However, when the SNR of Gaussian white noise is −5 dB, the proposed MHCO and other morphological filtering cannot effectively extract the gear fault characteristic frequency f o (see Figure 5). Hence, in the first example, the limit of SNR of simulation signal is empirically considered as −5 dB. results of the MHCO method under different noise levels (i.e., SNR = 0 dB, SNR = −2 dB and SNR = −4 dB), as shown in Figure 4. From Figure 4, when the SNR of Gaussian white noise of gear fault simulation signal y(t) is set as 0 to −4 dB, the proposed MHCO is still effective in extracting the gear fault characteristic frequency fo. However, when the SNR of Gaussian white noise is −5 dB, the proposed MHCO and other morphological filtering cannot effectively extract the gear fault characteristic frequency fo (see Figure 5). Hence, in the first example, the limit of SNR of simulation signal is empirically considered as −5 dB.   results of the MHCO method under different noise levels (i.e., SNR = 0 dB, SNR = −2 dB and SNR = −4 dB), as shown in Figure 4. From Figure 4, when the SNR of Gaussian white noise of gear fault simulation signal y(t) is set as 0 to −4 dB, the proposed MHCO is still effective in extracting the gear fault characteristic frequency fo. However, when the SNR of Gaussian white noise is −5 dB, the proposed MHCO and other morphological filtering cannot effectively extract the gear fault characteristic frequency fo (see Figure 5). Hence, in the first example, the limit of SNR of simulation signal is empirically considered as −5 dB.

Generalized Composite Multiscale Lempel-Ziv Complexity
To overcome the shortcoming of information loss of coarse-grained process existing in the traditional MLZC method, a new complexity index named generalized composite multiscale Lempel-Ziv complexity (GCMLZC) is presented in this section.

Lempel-Ziv Complexity
Lempel-Ziv complexity (LZC) is a widely used tool which can effectively describe the randomness and uncertainty of time series. Figure 6 shows the flowchart of the LZC method. For a given time series {x(i), i = 1, 2, · · · , N}, the specific steps of LZC are described as follows: To overcome the shortcoming of information loss of coarse-grained process existing in the traditional MLZC method, a new complexity index named generalized composite multiscale Lempel-Ziv complexity (GCMLZC) is presented in this section.

Lempel-Ziv Complexity
Lempel-Ziv complexity (LZC) is a widely used tool which can effectively describe the randomness and uncertainty of time series. Figure 6 shows the flowchart of the LZC method. For a given time series { } ( ), 1, 2, , , the specific steps of LZC are described as follows: Figure 6. Flowchart of the LZC method.
(1) According to Equation (19), the binary coarse-grained analysis is used to process the original time series, i.e., according to the 0-1 binary encoding, the original time series is reconstructed to obtain the symbol sequence Specifically, the mean value, Td, of the original time series is first calculated; then, the points in the original time series that are greater than the mean value are assigned a value of 1, and the points that are less than the mean value are assigned a value of 0.
(2) Initialize 0 P and 0 Q as the empty matrices and set the initial value 0 i = . At this time, the complexity ( ) 0 C i = . (1) According to Equation (19), the binary coarse-grained analysis is used to process the original time series, i.e., according to the 0-1 binary encoding, the original time series is reconstructed to obtain the symbol sequence S = {s 1 , s 2 , · · · , s N }. Specifically, the mean value, T d , of the original time series is first calculated; then, the points in the original time series that are greater than the mean value are assigned a value of 1, and the points that are less than the mean value are assigned a value of 0.
(2) Initialize P 0 and Q 0 as the empty matrices and set the initial value i = 0. At this time, the complexity C(i) = 0.
(3) Perform a loop operation. Set P i = {P i−1 s i }, Q i = {Q i−1 s i }, and then judge whether P i−1 contains Q i . If the judgment result is "Yes", the complexity C(i) will remain as the same value, i.e., Notably, in this step, it will loop N times until the symbol sequence S is traversed, and the last complexity C(N) can be obtained.
(4) Normalize the complexity C N to obtain the final Lempel-Ziv complexity value. Specifically, for the binary symbol sequence S, the normalized Lempel-Ziv complexity is calculated by Entropy 2021, 23, 1372 10 of 26

Multiscale Lempel-Ziv Complexity
For a given time series {x(i), i = 1, 2, · · · , N}, the specific calculation process of MLZC can be described as follows: (1) Using Equation (21) obtains a coarse-grained time series y (τ) j with the length of N/τ.
(2) According to Equation (22), the LZC value of each coarse-grained time series y can be calculated to obtain the final multiscale Lempel-Ziv complexity.
where τ is the scale factor and LZC(•) is the operator of Lempel-Ziv complexity. Figure 7 shows the flowchart of the MLZC method.
Specifically, for the binary symbol sequence S, the normalized Lempel-Ziv complexity is calculated by

Multiscale Lempel-Ziv Complexity
For a given time series { } ( ), 1,2, , , the specific calculation process of MLZC can be described as follows: (1) Using Equation (21) obtains a coarse-grained time series represents the scale factor. Apparently, when the scale factor amounts to the original time series.
(2) According to Equation (22), the LZC value of each coarse-grained time series ) (τ j y can be calculated to obtain the final multiscale Lempel-Ziv complexity.
where τ is the scale factor and

Generalized Composite Multiscale Lempel-Ziv Complexity
For a given time series {x(i), i = 1, 2, · · · , N}, the specific calculation process of GCMLZC is given as follows: (1) Using Equation (23) to obtain the generalized composite coarse-grained time series k,j τ . Concretely, in the GCMLZC method, when the scale factor τ = 1, one generalized composite coarse-grained time series y (1) k can be obtained, which is equivalent to the original time series. When the scale factor τ = 2, two generalized composite coarse-grained time series y (1) k and y (2) k can be obtained. Nevertheless, in the MLZC method, if the scale factor τ = 2, we can only obtain one coarse-grained time series y where  (3) Equation (24) can be used to calculate the mean LZC value as the LZC result of the scale factor τ of the original time series.
where τ is the scale factor and LZC(•) is the operator of Lempel-Ziv complexity.
(4) Judging whether the scale factor τ reaches its maximum value τ m . If the scale factor τ < τ m , set τ = τ + 1, return to steps (2) and (3) and continue to run the procedure until τ = τ m . Otherwise, stop the circulation process and output the final results of the GCMLZC method. In other words, after performing the GCMLZC method, for the scale factor τ, a series of LZC value can be obtained. It is worth mentioning that the scale factor τ of GCMLZC method is between 2 and τ m . In addition, without a loss of generality, the largest scale factor τ m is selected as 20, which is regarded as an empirical value. Figure 8 shows the flowchart of the GCMLZC method. (2) For the scale factor τ , calculating the LZC value of each generalized composite coarse-grained time series (3) Equation (24) can be used to calculate the mean LZC value as the LZC result of the scale factor τ of the original time series.
where τ is the scale factor and ( ) LZC  is the operator of Lempel-Ziv complexity.
(4) Judging whether the scale factor τ reaches its maximum value m τ . If the scale factor m τ τ < , set τ = τ + 1, return to steps (2) and (3) and continue to run the procedure until m τ τ = . Otherwise, stop the circulation process and output the final results of the GCMLZC method. In other words, after performing the GCMLZC method, for the scale factor τ , a series of LZC value can be obtained. It is worth mentioning that the scale factor τ of GCMLZC method is between 2 and m τ . In addition, without a loss of generality, the largest scale factor m τ is selected as 20, which is regarded as an empirical value. Figure 8 shows the flowchart of the GCMLZC method.

Comparison among LZC, MLZC and GCMLZC
To show the feature extraction performance of the proposed GCMLZC method, here, one intermittent multi-component amplitude-modulated and one frequency-modulated signal x(t) are established as follows: where n(t) is the Gaussian white noise with an SNR of 32 dB. x 1 (t), x 2 (t) and x 3 (t) are used to simulate gear fault signals with different frequencies (i.e., 1500 Hz, 750 Hz and 300 Hz), respectively. The sampling frequency and sampling length of simulation signal x(t) are 5000 Hz and 35,000 points. Figure 9 shows the time domain waveform of simulation signal x(t) and its components.
where n(t) is the Gaussian white noise with an SNR of 32 dB. x1(t), x2(t) and to simulate gear fault signals with different frequencies (i.e., 1500 Hz, 750 H respectively. The sampling frequency and sampling length of simulation 5000 Hz and 35,000 points. Figure 9 shows the time domain waveform of sim x(t) and its components. For convenient comparison, we firstly used a sliding window with 80% the number of the overlapping data points is 800) to conduct the data inter the simulation signal x(t), where the window width of the sliding window points, i.e., there were 171 sliding windows in total. Then, Euclidean dis three complexity indexes (i.e., LZC, MLZC and GCMLZC) of the intercep calculated to describe the complexity and uncertainty of the simulated s largest scale factors m τ of the GCMLZC and MLZC were set as 20. Figure   ED calculation results obtained by different complexity methods for the sim x(t). It can clearly be seen from Figure 10 that LZC can only depict the cha the component x1(t) and x2(t) of the simulation signal x(t). In addition, the c of the component x2(t) and x3(t) of the simulation signal x(t) can be detec whereas the proposed GCMLZC can track the changing of all components tion signal x(t). Furthermore, compared with LZC and MLZC, the Euclidi the proposed GCMLZC has smaller fluctuation and higher accuracy at th For convenient comparison, we firstly used a sliding window with 80% overlap (i.e., the number of the overlapping data points is 800) to conduct the data interception along the simulation signal x(t), where the window width of the sliding window was 1000 data points, i.e., there were 171 sliding windows in total. Then, Euclidean distances (ED) of three complexity indexes (i.e., LZC, MLZC and GCMLZC) of the intercepted data were calculated to describe the complexity and uncertainty of the simulated signal x(t). The largest scale factors τ m of the GCMLZC and MLZC were set as 20. Figure 10 shows the ED calculation results obtained by different complexity methods for the simulation signal x(t). It can clearly be seen from Figure 10 that LZC can only depict the changing state of the component x 1 (t) and x 2 (t) of the simulation signal x(t). In addition, the changing state of the component x 2 (t) and x 3 (t) of the simulation signal x(t) can be detected in MLZC, whereas the proposed GCMLZC can track the changing of all components of the simulation signal x(t). Furthermore, compared with LZC and MLZC, the Euclidian distance of the proposed GCMLZC has smaller fluctuation and higher accuracy at the component detection position of simulation signal x(t), which indicates that the GCMLZC method has better complexity assessment and fault feature extraction performance.
To investigate the influence of the added signal noises on the performance of the GCMLZC method, we calculated the analysis results of the GCMLZC method under different noise levels (i.e., SNR = 20 dB, SNR = 15 dB, SNR = 11 dB and SNR = 10 dB), as shown in Figure 11. It can clearly be seen from Figure 11 that the complexity assessment ability of the GCMLZC method will be decreased with the decrease in SNR. When the SNR of the added signal noises is 10 dB, the GCMLZC method cannot accurately depict the complexity of simulation signal x(t). Therefore, the SNRs of the added signal noises are usually set to more than 10 dB, which is regarded as an empirical value. ferent noise levels (i.e., SNR = 20 dB, SNR = 15 dB, SNR = 11 dB and SNR = 10 dB), as shown in Figure 11. It can clearly be seen from Figure 11 that the complexity assessmen ability of the GCMLZC method will be decreased with the decrease in SNR. When the SNR of the added signal noises is 10 dB, the GCMLZC method cannot accurately depic the complexity of simulation signal x(t). Therefore, the SNRs of the added signal noises are usually set to more than 10 dB, which is regarded as an empirical value.  ferent noise levels (i.e., SNR = 20 dB, SNR = 15 dB, SNR = 11 dB and SNR = 10 dB), as shown in Figure 11. It can clearly be seen from Figure 11 that the complexity assessment ability of the GCMLZC method will be decreased with the decrease in SNR. When the SNR of the added signal noises is 10 dB, the GCMLZC method cannot accurately depict the complexity of simulation signal x(t). Therefore, the SNRs of the added signal noises are usually set to more than 10 dB, which is regarded as an empirical value.

The Proposed Fault Diagnosis Scheme
To effectively identify wind turbine gearbox faults, this paper proposes a new intelligent fault diagnosis scheme based on MHCO and GCMLZC, which is mainly composed of four stages (i.e., data sample collection, signal preprocessing, fault feature extraction and fault pattern identification). Figure 12 shows the overall flowchart of the proposed fault diagnosis method, and its detailed procedure is expressed as follows:

The Proposed Fault Diagnosis Scheme
To effectively identify wind turbine gearbox faults, this paper proposes a new intelligent fault diagnosis scheme based on MHCO and GCMLZC, which is mainly composed of four stages (i.e., data sample collection, signal preprocessing, fault feature extraction and fault pattern identification). Figure 12 shows the overall flowchart of the proposed fault diagnosis method, and its detailed procedure is expressed as follows: Step 1: Data sample collection. Using the accelerometer to collect wind turbine gearbox vibration signals ( ) { (1), (2), , ( )} x i x x x N =  .
Step 2: Signal preprocessing. Morphological convolution filtering (i.e., MHCO) is adopted to preprocess the originally collected wind turbine gearbox vibration signal, which is aimed at weakening noise interference and highlighting fault features. Meanwhile, the SCFNR indicator is employed to select the optimal SE scale of MHCO.
Step 3: Fault feature extraction. According to the calculation process of GCMLZC, GCMLZC of the filtered signal is calculated to extract fault features of the wind turbine gearbox under different health conditions.
Step 4: Fault pattern identification. The extracted fault features are randomly divided into training samples and testing samples, where the training samples are used to train the softmax model and the testing samples are input into the well-trained softmax model to automatically identify wind turbine gearbox faults. In this step, the output of the softmax model is defined by Step 1: Data sample collection. Using the accelerometer to collect wind turbine gearbox vibration signals x(i) = {x(1), x(2), · · · , x(N)}.
Step 2: Signal preprocessing. Morphological convolution filtering (i.e., MHCO) is adopted to preprocess the originally collected wind turbine gearbox vibration signal, which is aimed at weakening noise interference and highlighting fault features. Meanwhile, the SCFNR indicator is employed to select the optimal SE scale of MHCO.
Step 3: Fault feature extraction. According to the calculation process of GCMLZC, GCMLZC of the filtered signal is calculated to extract fault features of the wind turbine gearbox under different health conditions.
Step 4: Fault pattern identification. The extracted fault features are randomly divided into training samples and testing samples, where the training samples are used to train the softmax model and the testing samples are input into the well-trained softmax model to automatically identify wind turbine gearbox faults. In this step, the output of the softmax model is defined by where p(θ) j represents the probability corresponding to the jth fault type, K is the number of fault types and θ denotes the parameters learned from the input samples.

Case 1: Experimental Gearbox Data Analysis
The proposed method was adopted to analyze gear vibration data collected from the laboratory of testing technology and fault diagnosis, North China Electric Power University (NCEPU). The experimental gear fault device was mainly composed of a driving motor, bearing, gearbox, shaft, turntable and governor. Figure 13a,b show photos of the experimental gear fault device and a gearbox structure drawing, respectively. In the experiment, gear vibration data were collected by using the accelerometer installed on the housing of the reduction gearbox with a sampling frequency of 5120 Hz. The experimental data collection system was mainly composed of an accelerometer, cable conductor, amphenol connector, signal conditioner, acquisition card and acquisition software, where the type of the data acquisition card was ADA16-8/2 (LPCI) with a single-terminal 8-channel input and 2-channel output. The motor speed could be adjusted by looking at the tachometer and turning the speed control knob. In addition, gear loading could be adjusted by switching on the brake and setting the level of braking torque. The specific steps can be found in the operating instructions of vibration analysis and the fault diagnosis test platform system for rotating machinery of QPZZ-II. The experimental gearbox was made up of two parts (i.e., the pinion and big gear). The pinion had 55 teeth, whereas the big gear had 75 teeth. In this experiment, the gearbox operated under five health conditions, including normal (condition 1), big gear pitting fault (condition 2), big gear fracture fault (condition 3), big gear pitting and pinion wear compound fault (condition 4), big gear fracture and pinion wear compound fault (condition 5). In addition, in this experiment, through speed adjustments, the motor operated at a rotating speed of about 800 rpm, but the actual speed and environmental interference under different health conditions differed somewhat, which indicates that the amplitude of the healthy condition may be greater than that of the unhealthy condition at a certain time point: the rotating frequencies of the small gear and big gear can be approximatively inferred as f r1 = 13.3 Hz and f r2 = 9.8 Hz, respectively. To verify the proposed method, 50 sets of gear vibration data under each health conditions were collected and each gear vibration signal consisted of 4096 data points. The training:testing data proportion was 1:1, i.e., the number of training samples and testing samples was the same, which was 125. Table 2 details the gear health conditions and sample selection. The time domain waveforms and amplitude spectra of gear vibration signals under different health conditions are shown in Figure 14. Notably, the plotted gear vibration signal belongs to the standardized results. The standardized formula is expressed as x = (x − mean(x))/std(x), where x is the collected original gear vibration signal, mean(x) is the mean value of x and std(x) is the standard deviation of x. As can be seen from Figure 14, the waveforms and spectra in different gear health conditions have certain similarities, especially for condition 3, condition 4 and condition 5, which implies that an effective method should be adopted to identify them. In order to facilitate the understanding, the identification performance of the proposed method was compared and analyzed from the following several aspects:    Gear fracture and wear 25 25 5 (1) The proposed method was utilized to analyze the collected gear vibration data. According to the flowchart of the proposed method, MHCO was first used to process different gear fault signals, where the optimal SE scale of MHCO was determined as 8 by  Gear fracture and wear 25 25 5 (1) The proposed method was utilized to analyze the collected gear vibration data. According to the flowchart of the proposed method, MHCO was first used to process different gear fault signals, where the optimal SE scale of MHCO was determined as 8 by using SCFNR. Notably, in this experimental data analysis, the search range of the flat SE scale λ was set as 1 to f s / f r − 2, where f s is the sampling frequency, f r is the rotating frequency of the input or output shaft and · denotes the down round operation. Due to the rotating frequency of the output shaft, f r2 = 9.8 Hz is smaller than that of the input shaft, i.e., when the maximum SE scale λ = f s / f r2 − 2, fault signatures of different gear health condition can all be covered. Hence, in experimental case 1, the flat SE scale λ = 1, 2, · · · , 520. Figure 15 plots the filtered results obtained by MHCO for different gear fault signals. Subsequently, the GCMLZC of all data samples was calculated for fault feature extraction. For analysis, Figure 16a,b show the GCMLZC of one data sample for different gear health conditions before and after morphological convolution filtering, respectively. In the GCMLZC method, without a loss of generality, the largest scale factor τ m is set as 20. As can be seen from Figure 16, GCMLZC with morphological convolution filtering has a better differentiation than GCMLZC without noise reduction. This proves the necessity of morphological convolution filtering in fault identification. Finally, the extracted GCMLZC was fed into the softmax classification model for automatically identifying different gear health conditions. Figure 17 shows the identification results of the proposed method for the first trial. Seen from Figure 17, the identification accuracy rate of the proposed method reached 98.4%, which indicates that only two data samples were misidentified; therefore, the proposed method is preliminarily proven to be effective in identifying gear fault types.
tively. In the GCMLZC method, without a loss of generality, the largest scale factor m τ is set as 20. As can be seen from Figure 16, GCMLZC with morphological convolution filtering has a better differentiation than GCMLZC without noise reduction. This proves the necessity of morphological convolution filtering in fault identification. Finally, the extracted GCMLZC was fed into the softmax classification model for automatically identifying different gear health conditions. Figure 17 shows the identification results of the proposed method for the first trial. Seen from Figure 17, the identification accuracy rate of the proposed method reached 98.4%, which indicates that only two data samples were misidentified; therefore, the proposed method is preliminarily proven to be effective in identifying gear fault types. tively. In the GCMLZC method, without a loss of generality, the largest scale factor m τ is set as 20. As can be seen from Figure 16, GCMLZC with morphological convolution filtering has a better differentiation than GCMLZC without noise reduction. This proves the necessity of morphological convolution filtering in fault identification. Finally, the extracted GCMLZC was fed into the softmax classification model for automatically identifying different gear health conditions. Figure 17 shows the identification results of the proposed method for the first trial. Seen from Figure 17, the identification accuracy rate of the proposed method reached 98.4%, which indicates that only two data samples were misidentified; therefore, the proposed method is preliminarily proven to be effective in identifying gear fault types. (2) To further verify the validity of the proposed method, comparisons among the proposed method and four representative complexity indexes (i.e., MLZC, multiscale dispersion entropy (MDE) [31], multiscale permutation entropy (MPE) [32] and multiscale sample entropy (MSE) [ (2) To further verify the validity of the proposed method, comparisons among the proposed method and four representative complexity indexes (i.e., MLZC, multiscale dispersion entropy (MDE) [31], multiscale permutation entropy (MPE) [32] and multiscale sample entropy (MSE) [33]) were performed. To avoid randomness in the identification results of different methods and to ensure a fair comparison, all methods were preprocessed by the same morphological filtering (MHCO), the largest scale factor τ m of all methods (i.e., GCMLZC, MLZC, MDE, MPE and MSE) were set as 20 and 10 trials were conducted. In addition, in the MDE method, the embedded dimension m = 3, time delay d = 1, the number of classes c = 5. In the MPE method, the embedded dimension m = 3 and time delay d = 1. In the MSE method, the embedded dimension m = 3 and the similarity tolerance r = 0.15 × SD, where SD is the standard deviation of the analyzed signal. Figure 18 shows the identification accuracies of different methods in 10 trials. In addition, Table 3 gives the detailed identification results of different methods, including the maximum, minimum and mean identification accuracy. Seen from Figure 18 and Table 3, the average identification accuracy (98.24%) of the proposed method was bigger than that of other methods (i.e., MLZC, MDE, MPE and MSE), whereas the standard deviation (0.3373) of the proposed method was lower than that of other methods, which means that the identification ability and stability of the proposed method are superior to other methods mentioned in this paper. Therefore, the effectiveness of the proposed method in gear fault identification is further validated by the above comparison.
(3) To consolidate the fault identification results, the fivefold cross-validation method was also applied to analyze the same gear vibration signal. Concretely, the data sample was first divided into five parts (each part had 50 samples), where four parts (i.e., 200 samples) were alternately regarded as the training samples and the remaining part (i.e., 50 samples) served as the testing sample. Next, five trials of different methods were performed, and the average identification accuracy values of five results were regarded as the ultimate identification accuracy of different methods. Table 4 gives the detailed diagnosis results obtained by different methods. As shown in Table 4, the proposed method achieved an average identification accuracy of 98.80%, whereas other complexity methods (i.e., MLZC, MDE, MPE and MSE) obtained 96.40%, 97.60%, 94.40% and 86.40% accuracy, respectively. The identification accuracy of the proposed method is clearly higher than that of other comparison methods. Consequently, the effectiveness and superiority of the proposed method is demonstrated once again.

Case 2: Engineering Data Analysis for Wind Turbine Gearbox
In this section, the proposed method was adopted to analyze the practical vibration data from a 1.5 MW wind turbine gearbox, which is located on a wind farm in northern China. Figure 19 shows a structural diagram of the wind turbine transmission system, which mainly consisted of a vane, spindle, rotor, gearbox and generator. The analyzed wind turbine gearbox adopted three-stage transmission (i.e., planetary stage, middle stage and high-speed stage), and was an FD1660 type. The rated power of the wind turbine gearbox was 1660 KW, and the weight of the gearbox was approximately 16,800 kg. In addition, the generator speed could be adjusted by using the electrical control system of the wind turbine. Table 5 lists the teeth numbers of each stage gear of the wind turbine gearbox, where Z 0 denotes the teeth number of the planet gear, Z 1 denotes the teeth number of the sun gear, Z 2 represents the teeth number of the inner ring gear, Z 3 and Z 5 are the teeth numbers of the big gear and small gear in the middle stage, respectively, and Z 4 and Z 6 are the teeth numbers of the big gear and small gear in the high-speed stage, respectively. In engineering data analysis, gear vibration data were collected by an accelerometer (see Figure 19) glued onto the casing of the gearbox with a sampling frequency of 32,768 Hz. The wind turbine gearbox operated under four gear health conditions (i.e., normal, pitting fault of small gear in middle stage, spalling fault of big gear in high-speed stage, fracture and wear compound fault small gear in high-speed stage). When gear vibration data collection was conducted for each health condition, the wind speed was stable at about 12 m/s (corresponding to an input shaft speed of about 17 rpm and a power of about 1500 kW), and the speed of the high-speed shaft was stable at about 1400 rpm. Thus, the rotating frequencies of the high-speed shaft and middle shaft can be approximatively calculated as f h = 23.33 Hz and f m = 6.27 Hz, respectively. Figure 20 shows photographs of three gearbox faults. In the process of method validation, we obtained 100 data samples of each health condition. For each health condition, 50 samples were randomly selected as the training data, and the remainder was regarded as testing data. A total of 200 training and 200 testing samples were obtained, and each sample had 16,384 points. Apparently, it is a four-classification issue to be solved in essence. Table 6 presents detailed information of wind turbine gearbox data. Figure 21 shows the time domain waveform and amplitude spectrum of the gear vibration signal under different health conditions. Seen from Figure 21, gear fault conditions are difficult to be identified directly through observing the time domain waveform and amplitude spectrum, because different gear vibration data have certain self-similarity. Therefore, it is necessary to adopt an effective method to process the practical gearbox data.
Apparently, it is a four-classification issue to be solved in essence. Table 6 presents detailed information of wind turbine gearbox data. Figure 21 shows the time domain waveform and amplitude spectrum of the gear vibration signal under different health conditions. Seen from Figure 21, gear fault conditions are difficult to be identified directly through observing the time domain waveform and amplitude spectrum, because different gear vibration data have certain self-similarity. Therefore, it is necessary to adopt an effective method to process the practical gearbox data.   form and amplitude spectrum of the gear vibration signal under different health conditions. Seen from Figure 21, gear fault conditions are difficult to be identified directly through observing the time domain waveform and amplitude spectrum, because different gear vibration data have certain self-similarity. Therefore, it is necessary to adopt an effective method to process the practical gearbox data.      According to the flowchart in Figure 12, the proposed method was adopted to analyze the practical gearbox data. In the proposed method, based on the SCFNR indicator, the optimal SE scale of MHCO was selected as 10. Similar to case 1, the search range of the flat SE scale λ was set as 1 to f s / f r − 2, where f s is the sampling frequency, f r is the rotating frequency of the high-speed shaft or middle shaft and · denotes the down round operation. The rotating frequency of the middle shaft f m = 6.27 Hz was smaller than that of the high-speed shaft, i.e., when the maximum SE scale λ = f s / f m − 2, fault signatures of different gear health condition can all be covered. Hence, in experimental case 2, the flat SE scale λ = 1, 2, · · · , 5224. Due to space limitations, the corresponding parameter optimization diagram is not included here. Figure 22 shows the filtered signals of three gear faults. For fault feature extraction, we calculated the GCMLZC of the filtered signal of all samples. Similar to case 1, in the GCMLZC method, the largest scale factor τ m was selected as 20. Figure 23a,b show the GCMLZC of gear vibration signals before and after applying the MHCO method, respectively. As shown in Figure 23, after morphological convolution filtering, the degree of distinction of four gear health conditions is greater than that without filtering processing. This verifies the importance of morphological convolution filtering for signal preprocessing. Finally, the extracted GCMLZC was input into the softmax model for fault pattern identification. Figure 24 shows the identification results of the proposed method in the first trial. As shown in Figure 24, only one sample was misclassified, which indicates that the proposed method can obtain an identification accuracy of 99.5% (199/200). Thus, the proposed method exhibits good recognition performance for wind turbine gearbox faults. greater than that without filtering processing. This verifies the importance of morpholog ical convolution filtering for signal preprocessing. Finally, the extracted GCMLZC wa input into the softmax model for fault pattern identification. Figure 24 shows the identifi cation results of the proposed method in the first trial. As shown in Figure 24, only on sample was misclassified, which indicates that the proposed method can obtain an iden tification accuracy of 99.5% (199/200). Thus, the proposed method exhibits good recogni tion performance for wind turbine gearbox faults.     Similarly to case 1, to further prove the effectiveness of the proposed method, the identification abilities of five methods (i.e., GCMLZC, MLZC, MDE, MPE and MSE) were compared. Similarly, 10 trials of different methods were conducted to ensure the fairness of the comparison results. In addition, in all comparison methods (i.e., GCMLZC, MLZC, MDE, MPE and MSE), without a loss of generality, the largest scale factor τ m was set as 20. In the MDE method, the embedded dimension m = 3, time delay d = 1 and the number of classes c = 5. In the MPE method, the embedded dimension m = 3 and time delay d = 1. In the MSE method, the embedded dimension m = 3 and the similarity tolerance r = 0.15 × SD, where SD is the standard deviation of the analyzed gear vibration signal. Figure 25 shows the fault identification accuracy of different methods in 10 trials, and Table 7 gives the detailed comparison results of different methods. Seen from Figure 25 and Table 7, the average identification accuracy (99.35%) of the proposed method was higher than that of other four methods (i.e., MLZC, MDE, MPE and MSE). In addition, standard deviation (0.2415) of the proposed method was less than that of other methods. This again indicates that the superiority of the proposed method in identifying wind turbine gearbox faults is verified.
To further consolidate the fault diagnosis results of wind turbine gearbox, we also used the fivefold cross-validation method to analyze the practical wind turbine gearbox data. Table 8 shows the detailed fault identification results obtained by different methods in the fivefold cross-validation. As can be seen from Table 8, the proposed method obtained an average identification accuracy of 99.50%, which is greater than that of the other comparison methods (i.e., MLZC, MDE, MPE and MSE). In other words, the proposed method has a stronger fault discriminant ability. This further proves that the proposed method is effective in extracting fault features from wind turbine gearboxes and identifying different gear fault categories.
To further consolidate the fault diagnosis results of wind turbine gearbox, we also used the fivefold cross-validation method to analyze the practical wind turbine gearbox data. Table 8 shows the detailed fault identification results obtained by different methods in the fivefold cross-validation. As can be seen from Table 8, the proposed method obtained an average identification accuracy of 99.50%, which is greater than that of the other comparison methods (i.e., MLZC, MDE, MPE and MSE). In other words, the proposed method has a stronger fault discriminant ability. This further proves that the proposed method is effective in extracting fault features from wind turbine gearboxes and identifying different gear fault categories.

Further Discussion
Although the proposed fault diagnosis scheme has been proven effective in identifying wind turbine gearbox faults, further research needs to be suggested. Firstly, in the signal preprocessing step of the proposed method, morphological convolution filtering can be replaced by other advanced techniques (e.g., local iterative filtering [34], total variation denoising [35] and sparse coding shrinkage), which is viewed as our future work. Secondly, in order to simultaneously obtain fault feature information of the vibration signal at different levels and scales, the idea of hierarchical decomposition can be integrated into the GCMLZC to further propose generalized hierarchical multiscale Lempel-Ziv complexity (GHMLZC), which is regarded as a future research direction. Thirdly, the softmax classification model was adopted in the fault pattern identification step of the proposed method; some other valuable classification models (e.g., weighted k-nearest neighbor, kernel extreme learning machine [36] and deep learning [37][38][39]) could also be adopted to replace the softmax model to automatically recognize wind turbine gearbox faults. Finally, the proposed method was only applied in gear fault diagnoses of wind turbines; therefore, our future work will focus on extending the proposed method to analyze other units (e.g., the bearing, rotor and blade) of mechanical systems. To avoid the dependence on knowing equipment information in advance, in future, we will integrate the proposed method into the recently popularized transfer learning model based on digital-analog drive to identify the unknown faults of different devices. In addition, it is worth mentioning that the proposed method was implemented on the MATLAB R2010a platform and operated on a computer with an Intel Core i7-9750H CPU @ 2.60 GHz/8.00 GB RAM processor. To implement and extend the proposed method repeatedly to other fields, the morphological filtering and Lempel-Ziv complexity software package will need to be downloaded, or the related code can be obtained directly from our research group.

Conclusions
In this paper, a new wind turbine gearbox fault identification method based on morphological convolution filtering and generalized composite multiscale Lempel-Ziv complexity has been presented. The main advantages of the proposed method are that fault feature extraction capability and identification accuracy can be improved by the combination of two methods (i.e., multiscale morphological-hat convolution operator and generalized composite multiscale Lempel-Ziv complexity). The effectiveness of the proposed method was also verified by experimental and engineering data analysis. Compared with traditional multiscale Lempel-Ziv complexity and several representative complexity indexes (i.e., multiscale dispersion complexity, multiscale permutation entropy and multiscale sample complexity), the proposed method could achieve a higher identification accuracy. Concretely, the contributions of this paper are summarized as follows: (1) An effective noise reduction process, named a multiscale morphological-hat convolution operator, has been developed, which can solve the problem of the empirical selection of structuring elements with the aid of signal characteristic frequency-to-noise ratio; (2) A complexity evaluation index, entitled generalized composite multiscale Lempel-Ziv complexity, has been proposed, which can avoid the problem of data length shortening appearing in multiscale Lempel-Ziv complexity; (3) A new fault diagnosis scheme for wind turbine gearbox faults is proposed via the integration of a multiscale morphological-hat convolution operator and generalized composite multiscale Lempel-Ziv complexity; (4) The experimental and engineering data analysis demonstrated the effectiveness of the proposed method in identifying wind turbine gearbox faults.
It should be pointed out that the influence of friction slip were not taken into account in the simulation model used in the paper. Hence, more accurate and comprehensive simulation signal analysis will be investigated in future research.

Data Availability Statement:
The data used in this study are all owned by the research group and will not be communicated.