Deep Learning Method for Fault Detection of Wind Turbine Converter

: The converter is an important component in wind turbine power drive-train systems, and usually, it has a higher failure rate. Therefore, detecting the potential faults for prediction of its failure has become indispensable for condition-based maintenance and operation of wind turbines. This paper presents an approach to wind turbine converter fault detection using convolutional neural network models which are developed by using wind turbine Supervisory Control and Data Acquisition (SCADA) system data. The approach starts with the selection of fault indicator variables, and then the fault indicator variables data are extracted from a wind turbine SCADA system. Using the data, radar charts are generated, and the convolutional neural network models are applied to feature extraction from the radar charts and characteristic analysis of the feature for fault detection. Based on the analysis of the Octave Convolution (OctConv) network structure, an improved AOctConv (Attention Octave Convolution) structure is proposed in this paper, and it is applied to the ResNet50 backbone network (named as AOC–ResNet50). It is found that the algorithm based on AOC–ResNet50 overcomes the issues of information asymmetry caused by the asymmetry of the sampling method and the damage to the original features in the high and low frequency domains by the OctConv structure. Finally, the AOC–ResNet50 network is employed for fault detection of the wind turbine converter using 10 min SCADA system data. It is veriﬁed that the fault detection accuracy using the AOC–ResNet50 network is up to 98.0%, which is higher than the fault detection accuracy using the ResNet50 and Oct–ResNet50 networks. Therefore, the effectiveness of the AOC–ResNet50 network model in wind turbine converter fault detection is identiﬁed. The novelty of this paper lies in a novel AOC–ResNet50 network proposed and its effectiveness in wind turbine fault detection. This was veriﬁed through a comparative study on wind turbine power converter fault detection with other competitive convolutional neural network models for deep learning.


Introduction
In recent years, the penetration of wind energy into the whole energy market is constantly growing. The installed capacity worldwide reached 650.8 GW by the end of 2019 with a yearly average increase rate of more than 10% in the past 10 years [1]. Wind energy is captured using wind turbines. However, most wind turbines are operating in a very harsh environment. As a result, wind turbine health condition assessment has become increasingly important for the purpose of realization of condition-based maintenance and operation.
In a wind turbine, the power drive-train system usually has a higher failure rate than others. As a critical component in the power drive-train system, the converter plays a role

Model-Based Methods
In [33], a hybrid model is established covering inverter and permanent magnet synchronous generators to diagnose inverter open circuit fault by analyzing the residual signal of the generator stator current through the model. In [34], the use of heat flux sensors to monitor the condition of a wind power converter, mainly for the failure and aging of electronic power devices, i.e., insulated-gate bipolar transistors (IGBTs) in the converter, where models are developed for the implementation of condition monitoring is proposed. In [35], a generic mathematical model of a two-level converter with open-switch faults is described. The impact of open-switch fault on the current control system of isotropic permanent-magnet synchronous generator (PMSG) was investigated.
The method based on the analytical model has the advantages of being unaffected by the load, not requiring new hardware such as sensors and fast diagnosis, but its accuracy depends heavily on the accuracy of the system model and parameter values estimated.
The wind turbine power converter is a multivariable system involving strong coupling and nonlinearity, and its physical parameters change dynamically in operation. Therefore, the fault detection and diagnosis method based on analytical models is limited in practical applications.

Methods Based on Signal Processing
This kind of method extracts fault features by analyzing the mean value, frequency and harmonics of sensor signals to realize fault detection and diagnosis. In this category, the methods are mainly divided into current signal-based and voltage signal-based methods.
Fault diagnosis by analysis of current signal The average current Park's vector method is proposed in [36], where the average value of the three-phase current is transformed to the average current Park's vector by Clark transform. Under normal and fault conditions of the system, the faulty power device is located by analyzing the replication and phase difference of the average current Park's vector. In [37], a method is introduced for the diagnosis of DFIG back-to-back converter open circuit fault, which can realize fault detection and fault location. For fault detection, it gives an absolutely normalized Park's vector method. When the wind turbine generator is running at a synchronous speed, this method can detect multiple open-circuit switch faults and ensure that they are free from false alarms. For fault location, this method uses the normalized current average value to identify single-circuit and double-circuit open-circuit faults. In [38], converter fault occurrence is identified by detecting the current Park's vector phase angle change. The fault is located based on the Park's vector phase angle interval of the average current on the machine side converter and the positive and negative of each phase current on the grid side converter. In view of the fluctuation and high noise of wind speed, in [39], a diagnosis algorithm is proposed based on the Park current amplitude normalization method by reference to wind speed. Combined with the current trajectory, the fault diagnosis of a permanent magnet synchronous wind turbine power converter can be achieved under variable wind speed.
Fault diagnosis by analysis of voltage signal In [40], a method is presented for the diagnosis of open circuit fault of an inverter by measuring and comparing the three-phase output voltage of the inverter and the midpoint voltage at the DC side. For the three-level topology, however, this method can only judge the faulty phase but cannot locate the faulty element. In [41], an inverter phase voltage observer is established. It is used to diagnose open-circuit faults of a converter by comparing the observed voltage with a reference voltage, but the accuracy of the observer is greatly affected by generator parameters. In [42], the operating characteristics of various voltages when the permanent magnet synchronous wind turbine power converter has an open circuit fault are revealed through detailed theoretical derivation. In [43], the deviation of the line voltage before and after the converter fault of the permanent magnet synchronous wind turbine is used to diagnose and locate the fault. The fault diagnosis reliability is improved by satisfying both the voltage amplitude and time width criteria. In [44] a new switch fault diagnosis method which treats output inductor voltages as diagnostic criteria is developed. Based on the features of diagnostic signals in such a system, a low-cost diagnostic circuit is designed. The failed module caused by opencircuit or short-circuit fault could be detected within one switching period, allowing for an immediate fault-tolerant action.
The fault diagnosis method based only on current or voltage signal has certain limitations in application. First of all, the electronic control system is relatively independent and compact; it is not easy to add additional sensors or data acquisition units unless it is redesigned. It is also required that the monitoring and diagnosis system cannot interfere with the normal operation of the electronic control system. Second, relying only on a single signal for diagnosis will increase the missed fault detection and false detection rate due to signal loss and signal interference.

Data-Driven Methods
The data-driven method does not require the precise mathematical model of the diagnostic object, nor does it need to add extra sensors or hardware circuits. It is widely applied to the fault detection/diagnosis of the converter. The typically applied techniques include neural networks, expert systems, support vector machine (SVM), fuzzy logic and cluster analysis [45][46][47][48][49].
In [45], wavelet transform is used to obtain three-phase current fault characteristics, where the artificial neural network algorithm or fuzzy expert system is employed to complete classification of failure modes from which the open circuit fault of the converter is diagnosed. In [41], a summary of a variety of diagnostic methods for open circuit, short circuit and drive signal loss for converter IGBT module failure is provided and compared to the effectiveness and anti-interference abilities of the methods. In [46], a data-driven fault diagnostic method using long short-term memory (LSTM) network to detect multiple open-circuit switch faults of the back-to-back converters used in DFIG wind turbines is presented. In [47], an advanced Fault Detection and Diagnosis (FDD) approach aiming to increase the availability, reliability and required safety of wave energy converters (WEC) under different conditions is described. The developed approach exploits the benefits of the Machine Learning (ML)-based Hidden Markov Model (HMM) and the PCA model. To improve the accuracy of fault diagnosis for wind turbine converters, a fault feature extraction method combined with a wavelet transform and compressed sensing theory is proposed in [48], in which an improved AdaBoost-SVM is developed and used for fault diagnosis. The three-phase output current signal is selected as the research object and is processed by the wavelet transform to reduce the signal noise. In [49], fault detection was conducted for three wind turbine subsystems, including a pitch system, generator and converter by developing SVR, SVM and convolutional neural network (CNN) models using SCADA system data. It is verified that the CNN model's performance is superior to SVR and SVM models.
In view of the literature review above, the following conclusions can be drawn:

1.
The model-based fault diagnosis method has higher diagnosis speed and lower cost, but it is sensitive to system parameters so that its practical application is limited.

2.
The current signal-based fault diagnosis method does not require new hardware to be added, but its diagnosis speed is lower, and it is easily affected by noise disturbance. The voltage signal-based method has fast diagnosis and high accuracy, but it requires additional hardware circuits, such as voltage sensors, which increases the system cost and complexity. Usually, it is not suitable to add new sensors to a wind turbine control system unless it is redesigned. The fault diagnosis performance is not sufficient, and it is not stable if only one signal is used.

3.
The data-based method does not require the establishment of an accurate mathematical model and has certain advantages for complex systems, but the algorithm is relatively complicated.
Nowadays, with the fast improvement in computational capacities and large volumes of high frequency and multi-dimensional parametric data recorded, new opportunities are created that make it possible to take advantage of deep learning techniques in fault detection and diagnosis for wind turbines. Based on a quick literature review, it was found that the future techniques applied to fault detection and diagnosis of wind turbines will be driven by deep learning approaches and algorithms. Motivated by this finding, this paper focuses on fault detection of wind turbine power converters using convolutional neural network models which are one kind of important model in deep learning.
Considering that the operation of wind turbines is affected by many factors, such as wind load and environment conditions, the condition monitoring signals involve a certain degree of randomness, and hence the signals are related to the faulty converter status. These signals are generally weaker than mechanical signals. Certain correlations exist between the converter fault signals. They interact and influence each other. In order to explore the relationship between the fault signals and improve the accuracy of fault detection, this research work starts with the selection of fault indicators and construction of radar charts based on the fault indicator variables data. The second step is to convert the radar charts into images which are then utilized for further processing in the fault detection. From the perspective of image processing, different features embedded in the images are extracted and analyzed to identify the normal and faulty operations of the wind turbines. This method helps to discover the correlation between signals, and can determine more characteristics from some weak signals involved in the process.
As one kind of important network model of deep learning, the convolutional neural network can take the original image as data input, analyze the relationship and characteristics between pixels in the image and reduce the information loss of the image in the processing and hence reduce errors. With the continuous increase in the number of image samples, the network depth is required to be increasingly deeper in order to seek better processing results. However, with increase in the network depth, it is difficult to optimize the neural network, and the accuracy of the network is obviously "degraded" so that the satisfactory learning effects cannot be obtained. Residual Network (ResNet), which introduces residual learning into the convolutional neural network, can help effectively solve this problem of rapid degradation of the accuracy of the network as the depth increases. It can greatly increase the depth of the network, but does not cause a sharp increase in the number of parameters. In 2019, Facebook AI, the National University of Singapore and Qihoo 360 AI Institute, jointly proposed OctConv (Octave Convolution) based on the mixed characteristics of information at different frequencies [50]. Using it to replace the traditional convolution can greatly save the computing resources while improving the learning effect. Therefore, this present paper applies OctConv to the ResNet network to detect faults in the wind turbine power converter. At the same time, in order to verify the effectiveness of the improved convolutional neural network for fault detection, the fault detection effects are compared with the ResNet50 convolutional neural network and the OctConv (Oct-ResNet50) network based on the ResNet50 backbone.
However, the OctConv structure has two shortcomings: one is that the asymmetry of the upsampling and downsampling methods causes the asymmetry of the information in modeling, and the second is that OctConv directly adds the high-and low-frequency domain features to realize the information interaction between different frequency domains, which greatly destroys the original features in the high-and low-frequency domains. Aiming to overcome the above shortcomings, an improved network structure based on the OctConv one, named the Attention Octave Convolution (AOctConv) structure, is proposed in this paper. First, this is to modify the existing sampling method of OctConv, replace the original downsampling method with max pooling and replace the original upsampling method with max unpooling to achieve symmetry of the downsampling and upsampling processes. Secondly, a self-attention mechanism is introduced to adaptively control the interaction process of information in the OctConv module, and self-supervise the output of the two branches in the high-and low-frequency domains. Therefore, this paper applies the improved AOctConv structure to the ResNet50 backbone network and hence proposes the AOC-ResNet50 network to realize fault detection for the wind turbine power converter.
In summary, the overall workflow of this paper is illustrated in Figure 1. The first step is to determine the fault indicator variables that provide some indications or reflection of the converter health status. Seven fault indicator variables were selected based on analyses of existing failure cases and understanding of the requirements for converter functions.
Step 2 is to determine an appropriate approach for fault detection. In this paper, the data-driven approach was selected based on the literature review and discussions as well as the available data.
Step 3 is to select the techniques for modelling and analysis for which CNN models were applied to feature extraction of radar charts generated using SCADA system data.
Step 4 is to give an introduction to the typical CNN structures for the purpose of the development of a new improved CNN architecture.
Step 5 is to propose a new improved CNN architecture. With this new development, Step 6 is to apply the AOC-ResNet50 network to converter fault detection, and Step 7 provides a brief discussion based on the research findings and conclusion of the paper.
analyses of existing failure cases and understanding of the requirements for converter functions.
Step 2 is to determine an appropriate approach for fault detection. In this paper, the data-driven approach was selected based on the literature review and discussions as well as the available data.
Step 3 is to select the techniques for modelling and analysis for which CNN models were applied to feature extraction of radar charts generated using SCADA system data.
Step 4 is to give an introduction to the typical CNN structures for the purpose of the development of a new improved CNN architecture.
Step 5 is to propose a new improved CNN architecture. With this new development, Step 6 is to apply the AOC-ResNet50 network to converter fault detection, and Step 7 provides a brief discussion based on the research findings and conclusion of the paper. In view of the overall process as discussed above, the remainder of this paper is organized as follows: Section 2 gives a brief introduction to the generation of radar charts; Section 3 presents deep learning network principles; Section 4 describes the improved convolutional neural network; Section 5 discusses the fault detection of the wind turbine power converter using the convolutional neural network models with a comparison of the model performance; Section 6 provides a brief discussion, and Section 7 concludes the paper and indicates future research directions.

Generation of Radar Chart
The radar chart analysis method is a multivariate comparative analysis based on the graphics on the similar navigation radar screen.

Radar Chart Introduction
A radar chart is a graph that shows multiple quantitative parameter value changes along different axes starting from the same original point. It can be used to describe multivariate data. The relative position and angle of the radar chart axis are usually the undefined information. It is often called a network chart, spider chart, star chart, Kiviat chart or irregular polygon. It is a typical evaluation method based on graphically comprehensive analysis. It is, therefore, suitable for the comprehensive analysis and comparison of multiple factors. It can vividly and intuitively reflect the comprehensive attributes of the evaluation target. Its advantages are that it is intuitive, vivid and easy to operate [51].
The radar chart shows the visual expression of numerical data from the perspective of "face thinking", and converts the information in the high-dimensional invisible space into intuitive planar information. Radar charts are widely used in scientific research and industrial fields for data visualization and graphical representation for effectively displaying multiple variables [52][53][54][55].
Before drawing a graph, the data need to be standardized, usually in the interval [0,1]. If there are m numerical data with n-dimensional features, each line in the figure In view of the overall process as discussed above, the remainder of this paper is organized as follows: Section 2 gives a brief introduction to the generation of radar charts; Section 3 presents deep learning network principles; Section 4 describes the improved convolutional neural network; Section 5 discusses the fault detection of the wind turbine power converter using the convolutional neural network models with a comparison of the model performance; Section 6 provides a brief discussion, and Section 7 concludes the paper and indicates future research directions.

Generation of Radar Chart
The radar chart analysis method is a multivariate comparative analysis based on the graphics on the similar navigation radar screen.

Radar Chart Introduction
A radar chart is a graph that shows multiple quantitative parameter value changes along different axes starting from the same original point. It can be used to describe multivariate data. The relative position and angle of the radar chart axis are usually the undefined information. It is often called a network chart, spider chart, star chart, Kiviat chart or irregular polygon. It is a typical evaluation method based on graphically comprehensive analysis. It is, therefore, suitable for the comprehensive analysis and comparison of multiple factors. It can vividly and intuitively reflect the comprehensive attributes of the evaluation target. Its advantages are that it is intuitive, vivid and easy to operate [51].
The radar chart shows the visual expression of numerical data from the perspective of "face thinking", and converts the information in the high-dimensional invisible space into intuitive planar information. Radar charts are widely used in scientific research and industrial fields for data visualization and graphical representation for effectively displaying multiple variables [52][53][54][55].
Before drawing a graph, the data need to be standardized, usually in the interval [0,1]. If there are m numerical data with n-dimensional features, each line in the figure represents a one-dimensional variable in the process of drawing, and there are n in total. They intersect at the center of the circle. Connect the points corresponding to each variable value in order to form a closed n polygon. Finally, m data samples can be represented as m two-dimensional n polygons so that the original numerical data are represented by an image.

Radar Chart Drawing
The SCADA system data of wind turbines were collected from two-year operation data of 27 wind turbines in a wind farm in Hebei Province, China, which was put into operation in 2012. Before proceeding to draw radar charts, the fault indicator variables Appl. Sci. 2021, 11, 1280 7 of 22 must be determined. The function of the converter in the operation control of a wind turbine is illustrated in Figure 2. The converter plays a role to make sure that the output power complies with the requirements of grid. It ensures that the output voltage is in a threshold range and the current frequency is in accordance with the grid. The output power of the generator is under control through adjusting the blade angle by a pitch control system. According to an investigation and analysis, the grid-side converter voltage, generator torque set-point, active and reactive power, wind speed, turbine rotor position and generator rotor speed are selected as the fault indicators of the converter. Wind speed is measured by wind sensor standing on the top and rear of wind turbine nacelle. Turbine rotor position and generator rotor speed are measured, and the other parameter values are recorded in wind turbine operation. The specific steps for drawing a radar chart in this paper are as follows: The first step is to select the wind speed index as the drawing reference; The second step is to normalize the fault indicator variable values representing the converter operation condition status; Step 3 is to collect all fault indicator variable values with a determined wind speed, and draw the 7 indicator variable values on a radar chart; Step 4 is to draw a closed heptagon through the 7 points on the 7 axes representing the seven fault indicator variables. It is noted that the closed area defines the size of the image converted later.
The radar charts generated are shown in Figure 3 below. The numbers 1-7 in the figure represent the 7 fault indicator variables, respectively. Through drawing the graphs, the changes of the fault indicator variable values and the correlation between the fault indicator variables can be intuitively reflected.  With the SCADA system data, one can extract the data 30 min, 3 h, 24 h, 7 days or 15 days ahead of occurrence of failures and put the data into a design Excel table. Table 1 shows an example of the fault indicator variable data which are 24 h ahead of a converter failure. Each table, like Table 1, has 181 rows of data, and a total of 100 sets of data were extracted for study in this paper; the extracted normal operation data were in the same sample size as the failure data set. The specific steps for drawing a radar chart in this paper are as follows: The first step is to select the wind speed index as the drawing reference; The second step is to normalize the fault indicator variable values representing the converter operation condition status; Step 3 is to collect all fault indicator variable values with a determined wind speed, and draw the 7 indicator variable values on a radar chart; Step 4 is to draw a closed heptagon through the 7 points on the 7 axes representing the seven fault indicator variables. It is noted that the closed area defines the size of the image converted later.
The specific steps for drawing a radar chart in this paper are as follows: The first step is to select the wind speed index as the drawing reference; The second step is to normalize the fault indicator variable values representing the converter operation condition status; Step 3 is to collect all fault indicator variable values with a determined wind speed, and draw the 7 indicator variable values on a radar chart; Step 4 is to draw a closed heptagon through the 7 points on the 7 axes representing the seven fault indicator variables. It is noted that the closed area defines the size of the image converted later.
The radar charts generated are shown in Figure 3 below. The numbers 1-7 in the figure represent the 7 fault indicator variables, respectively. Through drawing the graphs, the changes of the fault indicator variable values and the correlation between the fault indicator variables can be intuitively reflected.  In Figure 3a, the charts were plotted corresponding to the wind speeds of 3.1 and 6.4 m/s, respectively. In the case of normal operation of the converter, the fault indicator variables data do not change much and are relatively stable. Although the amount of data is large, the number of curves shown in the graph is relatively small because of the phenomenon of data overlap, and the graph is relatively regular. In Figure 3b, the charts were plotted under the operation condition with wind speeds of 4.3 and 6.5 m/s, respectively. In the case of faulty status of the converter in operation, the fault indicator variables data change remarkably, resulting in poor regularity of the graph and irregular lines. When the system fails, some fault indicator variable data appear to be 0, such as the power data, so that incomplete graphics are observed in the radar chart. It can be seen from the figure that although there is a certain overlap in the radar chart of the fault indicator variables data, there is still a big difference from the radar charts drawn by using the normal operation data. In Figure 3a, the charts were plotted corresponding to the wind speeds of 3.1 and 6.4 m/s, respectively. In the case of normal operation of the converter, the fault indicator variables data do not change much and are relatively stable. Although the amount of data is large, the number of curves shown in the graph is relatively small because of the phenomenon of data overlap, and the graph is relatively regular. In Figure 3b, the charts were plotted under the operation condition with wind speeds of 4.3 and 6.5 m/s, respectively. In the case of faulty status of the converter in operation, the fault indicator variables data change remarkably, resulting in poor regularity of the graph and irregular lines. When the system fails, some fault indicator variable data appear to be 0, such as the power data, so that incomplete graphics are observed in the radar chart. It can be seen from the figure that although there is a certain overlap in the radar chart of the fault indicator variables data, there is still a big difference from the radar charts drawn by using the normal operation data.

ResNet50 Convolutional Neural Network
Deep learning is an important research direction in the field of machine learning. Its introduction to machine learning makes it closer to artificial intelligence. The convolutional neural network is a deep neural network model that includes convolution. Its core process is to train and learn from a large amount of sample data through multiple iterations to extract the deep feature expression of the sample data, and finally predict the sample data according to different required tasks [56].
As the number of layers in the deep learning network increases, features of different layers can be extracted. The more abstract the feature expression, the richer the semantic information. However, as the number of original network layers in deep learning simply increases, it will cause gradients to disappear or become extremely large. The traditional method to solve this problem is generally to use reasonable weight initialization and regularization, but the implementation of the method will bring new problems of network performance degradation [57]. ResNet is a residual learning framework that can improve the performance of the network under the premise of increasing depth. The basic residual unit structure diagram of ResNet is shown in Figure 4 [58].  If the back layer of the deep network is an identity mapping, the model can be degenerated into a shallow network. However, it is more difficult to directly use some layers to fit potential identity mapping functions, such as H(x) = x. Therefore, the network is , and the problem is transformed into learning a residual function F(x) = H(x) -x. When F(x) = 0, it constitutes an identity mapping H(x) = x, so that it is easier to implement residual fitting.
In summary, the residual network structure was chosen for this paper to increase the network depth to improve the performance and accuracy of the network, and the residual structure can solve the problem of the disappearance of the gradient caused by the increase in the network depth.
The common residual network models are ResNet18, ResNet50 and ResNet101. As the number of layers increases, the amount of network calculation also increases [15,59]. By considering the calculation speed and the fault detection accuracy of the network employed, the ResNet50 structure is selected as the backbone network, and the residual unit adopts a three-layer bottleneck layer design as shown in Figure 5. If the back layer of the deep network is an identity mapping, the model can be degenerated into a shallow network. However, it is more difficult to directly use some layers to fit potential identity mapping functions, such as H(x) = x. Therefore, the network is designed as H(x) = F(x) + x, and the problem is transformed into learning a residual function F(x) = H(x) − x. When F(x) = 0, it constitutes an identity mapping H(x) = x, so that it is easier to implement residual fitting.
In summary, the residual network structure was chosen for this paper to increase the network depth to improve the performance and accuracy of the network, and the residual structure can solve the problem of the disappearance of the gradient caused by the increase in the network depth.
The common residual network models are ResNet18, ResNet50 and ResNet101. As the number of layers increases, the amount of network calculation also increases [15,59]. By considering the calculation speed and the fault detection accuracy of the network employed, the ResNet50 structure is selected as the backbone network, and the residual unit adopts a three-layer bottleneck layer design as shown in Figure 5.
increase in the network depth.
The common residual network models are ResNet18, ResNet50 and ResNet101. As the number of layers increases, the amount of network calculation also increases [15,59]. By considering the calculation speed and the fault detection accuracy of the network employed, the ResNet50 structure is selected as the backbone network, and the residual unit adopts a three-layer bottleneck layer design as shown in Figure 5.

Oct-ResNet50 Convolutional Neural Network
Chen et al. [60] proposed a novel Octave Convolution (OctConv) operation applied to convolutional neural networks (CNNs). OctConv is designed as a single, generic, plug-and-play convolution unit that can directly replace the original convolution without the need to adjust the network architecture [60]. OctConv is dedicated to reducing the spatial redundancy in CNNs and aims to replace the ordinary convolution operations without adjusting the backbone CNN architecture. It is confirmed that OctConv has su-

Oct-ResNet50 Convolutional Neural Network
Chen et al. [60] proposed a novel Octave Convolution (OctConv) operation applied to convolutional neural networks (CNNs). OctConv is designed as a single, generic, plugand-play convolution unit that can directly replace the original convolution without the need to adjust the network architecture [60]. OctConv is dedicated to reducing the spatial redundancy in CNNs and aims to replace the ordinary convolution operations without adjusting the backbone CNN architecture. It is confirmed that OctConv has superiority over the ordinary convolution methods in improving efficiency and performance of CNN models [60].

Idea for OctConv
The idea for OctConv is to understand the image from the perspective of the frequency domain. To view an image from the perspective of spatial domain, it can be generally represented by a c × H × W matrix where H and W denote the spatial dimensions and c is the number of feature maps or channels. Each position in the matrix corresponds to a value of [0, 255]. From the perspective of the frequency domain, the image can be decomposed into low spatial frequency components (low frequency domain) that describe smoothly changing structures and high spatial frequency components (high frequency domain) that describe the fast-changing fine details. It can effectively process the corresponding low-frequency and high-frequency components, and can also achieve effective interfrequency communication.
OctConv defines the feature map after the "downsampling" operation as the "low frequency domain", while the original size feature map without downsampling is defined as the "high frequency domain". After the above operations, the size of the feature map is reduced due to downsampling, thereby reducing the computation amount of OctConv. In addition, because the network has different scales of information (two frequency domains) and the two scales of the information are aggregated after the convolution is completed, the performance of OctConv is improved.

OctConv Principle
OctConv is a combination of downsampling and upsampling operation, as shown in Figure 6, where the green arrows represent the operation of information updates; the red arrows facilitate information exchange between the two frequencies; and X and Y are the input and output tensors, respectively. Between the input and output, it is the convolution process.

OctConv Principle
OctConv is a combination of downsampling and upsampling operation, as shown in Figure 6, where the green arrows represent the operation of information updates; the red arrows facilitate information exchange between the two frequencies; and X and Y are the input and output tensors, respectively. Between the input and output, it is the convolution process.  for the convolution with the input feature maps of H X and L X , respec- The output, Y, is expressed by: where the format Y A→B represents the convolutional update from the feature map group A to B, Y H and Y L denote the high-and low-frequency components of Y, respectively. Specifically, Y H→H and Y L→L indicate intrafrequency update, and Y H→L and Y L→H indicate interfrequency communication.
In order to obtain the above terms, the convolution kernel W is divided into W = W H , W L for the convolution with the input feature maps of X H and X L , respectively. Each component can be further divided into two parts (within frequency part and between frequencies part), namely, W H = W H→H , W L→H and W L = W L→L , W H→L .
The main work of OctConv is to split the original convolution operation into four operations, and the input processed by three of these four operations is half the height and width of the original feature map. Therefore, the amount of computation is reduced.
Using average pooling for downsampling, the output Y = Y H , Y L is: where, f (X; W) represents the convolution with the parameter, W; pool(X, k) represents the average pooling operation with the kernel size k × k and the step size, k; and upsample(X, k) is the upsampling operation with the factor k through the nearest interpolation. The four parallel lines in Figure 4 correspond to the four terms in Equation (1). The two green paths, namely, the first and the fourth, correspond to the information update of the high-frequency and low-frequency feature maps, respectively; the two red paths play the role of information exchange between the two frequency domains.

3.
OctConv Operation The number of feature map channels c in is divided into high frequency (1 − a in ) c in and low frequency a in c in according to the preset coefficient, a m . The width and height of the low frequency part are reduced to half of the original. OctConv performs the following operations: (1) The high frequency part is directly convolved through f X H , that is, high frequency to high frequency convolution; the number of output channels is (1 − a out ) c out . (2) The high-frequency part is first downsampled and then convolved. The downsampling is by pool X H , 2 and then f pool X H , 2 , that is, the convolution from high-frequency to low-frequency, and the number of output channels is a out c out .
(3) The low frequency part is directly convolved and then upsampled, f X L is the convolution from low frequency to high frequency and the number of output channels is (1 − a out ) c out . (4) The low frequency part directly convolves in f X L , that is, the low frequency to the low frequency convolution; the number of output channels is a out c out .
The above operations are followed by the information aggregation process, i.e., the results from the first and third path are added by bit, and it is the same for the output results from the second and fourth path; see Figure 6.

Analysis of Principles of Improved Convolutional Neural Network Based on Frequency Domain Features
OctConv is suitable for the majority of existing trunk network structures, such as ResNet and MobileNet, which can bring performance improvement while reducing the amount of computation of existing models and is validated in ImageNet Classification tasks.
However, this paper finds that there are two deficiencies in the OctConv structure. The first is the asymmetry of information caused by the asymmetry of the upper-and lower-sampling methods; the second is that OctConv adopts the way that the two frequency domain features are directly added together to ensure the information interaction between the high-and low-frequency domains; however, it is likely to destroy the original characteristics of the features at the high frequency domain and the low frequency domain.
Therefore, this paper makes improvement to the OCtConv structure to overcome the above two deficiencies, and proposes the structure of Attention Octave Convolution.

Improvement in Sampling Methods
OctConv's downsampling uses the average pooling method. When conducting upsampling, the same pixel values are copied in the upsampling step size, and the step size is 2 for both the downsampling and the upsampling, as shown in Figure 7. inal characteristics of the features at the high frequency domain and the low frequency domain. Therefore, this paper makes improvement to the OCtConv structure to overcome the above two deficiencies, and proposes the structure of Attention Octave Convolution.

Improvement in Sampling Methods
OctConv's downsampling uses the average pooling method. When conducting upsampling, the same pixel values are copied in the upsampling step size, and the step size is 2 for both the downsampling and the upsampling, as shown in Figure 7.
The OctConv sampling method will cause information asymmetry. In view of this problem, this paper proposes to change OctConv's existing sampling method as shown below.
Replace the original OctConv downsampling with max pooling and the original upsampling with the form of max unpooling as shown in Figure 8 below.   The OctConv sampling method will cause information asymmetry. In view of this problem, this paper proposes to change OctConv's existing sampling method as shown below.
Replace the original OctConv downsampling with max pooling and the original upsampling with the form of max unpooling as shown in Figure 8 below.
The improved downsampling process outputs the sampling index in addition to the sampled feature map. During the sampling process, the feature map can be restored according to the corresponding index, and its other positions can be made up to 0.
Compared to average pooling, the max pooling operation can better preserve the local maximum points in the feature graph where the local maximum points generally characterize the information of the edges or corners of the image. In addition, the max pooling operation can produce a corresponding sampling index which provides a basis for subsequent upsampling operations, thereby achieving symmetry in the downsampling and upsampling processes. The improved downsampling process outputs the sampling index in addition to the sampled feature map. During the sampling process, the feature map can be restored according to the corresponding index, and its other positions can be made up to 0.
Compared to average pooling, the max pooling operation can better preserve the local maximum points in the feature graph where the local maximum points generally characterize the information of the edges or corners of the image. In addition, the max In summary, max pooling and max unpooling are better suited for sampling tasks in OctConv. Because of the number of channels in the index, this paper adjusts the operation from the low frequency domain to the high frequency domain in OctConv, that is, the max unpooling operation is carried out first, followed by the convolution operation, in order to restore the feature graph in the upper sample.

Add a Branch of the Self-Attention Mechanism
OctConv transforms an existing backbone network into a two-stream format, i.e., highfrequency and low-frequency domains. At the same time, in order to ensure the two-stream information interaction, the information at the two frequency domains is directly added to the corresponding position and then reverted to the form of two streams at the end of each OctConv module, as shown in the following procedure.
where, X H→H and X L→L are the information after convolution transforms within their respective frequency domains, X L→H and X H→L are the information after the convolution transforms across the frequency domains and the addition of the corresponding positions of the feature graph can ensure the information interaction within the two frequency domains, but the information in X L→H and X H→L may destroy the information of X H→H and X L→L in the original frequency domain. In summary, this paper introduces an Attention mechanism, an adaptive control of the information interaction process in the OctConv module. This is to control X L→H and X H→L feature diagram information to be added to the feature diagrams of X H→H and X L→L .
The specific operation is to multiply each channel in X L→H and X H→L by its corresponding coefficient to achieve the effect of scaling the feature map, as shown in Figure 7. The size of the input feature map X in Figure 9 is w × h × c, and the layer in the right branch is responsible for the Attention task. The specific implementation process is as follows: First, the input feature map X is subjected to global average pooling, and the output size becomes 1 × 1 × c. Second, the feature extraction is achieved through two fully connected layers, f c1 and f c2 , successively. The output of the f c1 layer is the feature vector with 1 × 1 × c/16 dimensions, and it is activated by ReLU. The purpose of c/16 is to reduce the number of channels for reducing the number of parameters and computation. The output of the f c2 layer is the feature vector with 1 × 1 × c dimensions, which serves the purpose of the further extraction of features and re-upgrading the number of channels from c/16 to c, keeping it consistent with the original input. In order to prevent the output characteristics from being destroyed, the output of the f c2 layer is no longer activated by ReLU. Finally, the sigmoid function is used to compress the features of each channel in the output feature vector to the range of [0,1]. After the original input X passes through the attention branch, the output feature vector becomes 1 × 1 × c dimensions. The feature vector is multiplied by the original input X channel by channel, and the result is used as the new output of Y.
range is limited to [0,1], which ensures that the value in the original feature will not be excessively scaled.
3) When the model performs a prediction task, the output of the Attention branch is not a fixed value but is determined by the original input, that is, it responds differently to different inputs.
The structure of AOctConv proposed in this paper is shown in Figure 10, where f(X H ; W HH ) and f(X L ; W LL ) represent high frequency to high frequency and low frequency to low frequency convolution, respectively; max pool(X H , 2) denotes the maximum pooling operation; max unpool(X L , 2) is a function which applies the maximum unpooling operation and upsamples the spatial dimensions of the input data;  is an upsampling   The above structure is introduced into X L→H and X H→L branches, and the output of the two branches is controlled by self-supervision. When X L→H and X H→L are not suitable for fusion with the original X H→H and X L→L , the scale coefficient will approach 0, and, on the contrary, it will approach 1.
The self-attention mechanism proposed in this paper has the following characteristics: (1) All layers in the Attention branch are differentiable, that is, they can directly participate in the network end-to-end training task. (2) After the output of the Attention branch passes the sigmoid threshold, the output range is limited to [0,1], which ensures that the value in the original feature will not be excessively scaled. (3) When the model performs a prediction task, the output of the Attention branch is not a fixed value but is determined by the original input, that is, it responds differently to different inputs.
The structure of AOctConv proposed in this paper is shown in Figure 10, where f (X H ; W H→H ) and f (X L ; W L→L ) represent high frequency to high frequency and low frequency to low frequency convolution, respectively; max pool(X H , 2) denotes the maximum pooling operation; max unpool(X L , 2) is a function which applies the maximum unpooling operation and upsamples the spatial dimensions of the input data; ψ is an upsampling operation based on the index; and f a (·) is a function of the information represented by X H→L or X L→H . When max unpool(X L , 2) performs upsampling, the index generated by downsampling is used for max unpooling, and the Attention branch added by X L→H and X H→L branches is represented by f a (·). The branches represented by f (X H ; W H→H ) and f (X L ; W L→L ) keep the original operation without change.

Comparison of Three Network Structures
The overall structures of ResNet50, Oct-ResNet50 and AOC-ResNet50 networks are shown in Table 2 below. The image input size is 224 × 224, and the number of output categories is 2. It should be noted that when calculating the amount of computation, the computation at the fully connected layer and the convolutional layer includes the addition operation and also includes the computation consumption at other layers, such as the pooling layer, the BN layer and the ReLU layer.
In a forward process, ResNet50 needs 4.11 GFLOPs of computation, and the parameter number is 23.51 M; Oct-ResNet50 needs 2.38 GFLOPs of computation, and the parameter number is 23.51 M; AOC-ResNet50 needs 2.42 GFLOPs of computation, and parameter number is 23.53 M. Through the comparison of network parameters, Oct-ResNet50 has a significant reduction in the amount of computation, which is reduced by 42.09%, and the number of parameters does not change by comparing with the ResNet50 network. Compared with Oct-ResNet50, the improved algorithm of AOC-ResNet50 has a computation increase of 0.04 G FLOPs, accounting for 0.84% of the original computation and a parameter increase of 0.02 M Params, accounting for 0.08% of the original parameters, a slight increase. However, compared with the ResNet50 network, AOC-ResNet50 still has a significant decrease in the amount of computation.
The improved algorithm of AOC-ResNet50 solves the problem of information asymmetry caused by the asymmetric sampling method in the original OctConv structure and the damage to the original features in the high and low frequency domains by the OctConv structure, and improves the accuracy of fault detection as demonstrated in the following section.

Wind Turbine Power Converter Fault Detection
This article applied the improved AOctConv structure to the ResNet50 backbone network, and proposes the AOC-ResNet50 network. Based on the radar charts of converter fault indicator variables data which occurred 24 h, 3 days, 7 days and 15 days before a failure, this paper used ResNet50, Oct-ResNet50 and AOC-ResNet50 network structures to train and validate the fault detection models. After validation, each model's structure was determined including the weight of each link. Then, the models were applied to fault detection using SCADA system data. The fault detection performance indices were compared among these models using the test sample data. The effectiveness of these network models was then verified.
(1) Sample data As a pure electric component, the wind turbine converter has a high frequency of failures due to frequent starting and braking. For the faulty operation data, the data that occurred 24 h, 3, 7 and 15 days before the occurrence of a converter failure were extracted, the sampling frequency was every 10 min and 100 sets of the data were collected during each fault time period (each fault time period refers to 1, 3, 7 and 15 days, respectively).
The normal operation data are the data when the system has no fault, and the extracted quantity is the same as the failure operation data. In each group of the data collected, the normal operation data dimensions were 14,400 × 7; and corresponding to each fault time period, the faulty operation data dimensions were 14,400 × 7 as well. Taking the wind speed as the reference index, after normalization of each fault indicator variable values, the radar charts were drawn. Corresponding to each fault time period, 17,001 radar charts representing the normal operation of converter and 17,001 radar charts corresponding to the faulty operation status of the converters were generated, totaling 34,002. Among them, 11,900 normal and faulty status radar charts were selected for training, and 5101 normal and faulty status radar charts were selected for testing.
(2) Implementation of the training and testing method In order to adapt to the task of fault detection, the output dimension of the final fully connected layer of the AOC-ResNet50 network was set as 2 in this paper. In the training process, the original radar image size of 256 × 256 pixels was first scaled to the universal size 224 × 224 pixels, and then the whitening operation was performed after expansion to three channels, and no other data augmentation operations were performed. Finally, the picture was used as the input of the model training. The loss function of the task training used cross-entropy loss, batch size was 64, stochastic gradient descent (SGD) method was used for optimization and the training framework was Pytorch.  Tables 3-7 below. In Tables 3 and 5, TP denotes True Positive, which means that the true value is true and the detected/predicted value is true; FN is False Negative, which means that the true value is true and the detected/predicted value is false; FP is False Positive, meaning that the true value is false and the detected/predicted value is true; and TN is True Negative, which means that the true value is false and the detected/predicted value is false. In Tables 4, 6 Tables 3 and 4 show the fault detection results by using the models which were trained using the variables data that occurred 3 days before occurrence of converter failures. Tables 5 and 6 present the fault detection results by using the models which were trained using the variables data which occurred 7 days before converter failures. Table 7 gives a comparison of the fault detection performance of three models developed using the AOC-ResNet50 network and trained using the variables data that occurred 3, 7 and 15 days ahead of converter failures. It is observed that the model's performance becomes better if trained using the variables data which occurred nearer to the time of a converter failure.  From the fault detection results and detection performance indices shown in Tables 3-6, it can be observed that the AOC-ResNet50 network proposed in this paper has the best performance. Taking the results shown in Table 4 as an example, the fault detection accuracy of the AOC-ResNet50 network model is 5.48% higher than the Oct-ResNet50 model and 7.52% higher than the ResNet50 model, while other performance indices are also superior. In addition, there is no significant difference in the values of the recall rate of the three network models. It is clearly verified that the AOC-ResNet50 outperforms Oct-ResNet50 and ResNet50 in fault detection of the wind turbine converter.
In order to verify the robustness of each obtained model, the training and testing process was repeated five times. Then, the obtained models were applied to fault detection using the same data set. It was found that each model has the same performance in terms of fault detection using the same dataset. Therefore, the model's robustness was confirmed.

Discussion
A data-driven approach was selected for converter fault detection using SCADA system data in this paper. It was found that the proposed CNN network, AOC-ResNet50 network model, is better than other competitive CNN network models, such as ResNet50 and Oct-ResNet50 network models, in converter fault detection using the SCADA system data. This is due to the advantage of the AOC-ResNet50 network that it avoids the information asymmetry and the damage to the original features in the high and low frequency domains in the information sampling by comparing to the OctConv structure.
Although the AOC-ResNet50 network can provide higher accuracy in fault detection, it requires a large data sample for training the model, and the detection accuracy also relies on the data quality. Another disadvantage is that the algorithm associated with the AOC-ResNet50 network is more complex and requires the developer to have a strong mathematical background to understand the network structure and very good skills in software programing.
Selection of an approach to fault detection or diagnosis depends on the available signals and data recordings. If there are direct measurement signals corresponding to a failure mode, the methods based on signal processing and analysis would be preferred. If there are no direct measurement signals but there are indirect measurement data and operation data, a data-driven approach would be selected. In this case, CNN and other neural network models can be selected for testing. In general, if it is a complex system with multiple-dimension data available including operation and condition monitoring data, the newly proposed AOC-ResNet50 network as well as other typical CNN models are recommended for trial.
In the cases where there are direct measurement signals and the data sample size is large enough, the AOC-ResNet50 network and other typical CNN models are also recommended for application. In this situation, one may carry out a comparative study of the model performance by making a comparison between the CNN models and the models developed using signal processing and analysis methods, e.g., to detect and diagnose bearing faults for failure prediction [61].

Conclusions
In this paper, a deep learning approach was employed to develop fault detection models for wind turbine converters. The contribution of this paper includes two aspects: first, it proposes an innovative convolutional neural network structure named the AOC-ResNet50 network based on improvement of the Octave Convolution (OctConv) network by overcoming its two shortcomings; second, the AOC-ResNet50 network model is established and applied to wind turbine converter fault detection using wind turbine SCADA system data, and its effectiveness in fault detection was verified by a comparative study with other competitive CNN models including ResNet50 and Oct-ResNet50 network models.
The algorithm based on the AOC-ResNet50 network first replaces the downsampling and upsampling in the original OctConv structure with max pooling and max unpooling methods, and then introduces the branch of self-attention when the high-frequency domain features are fused to the low-frequency domain features, and the low-frequency domain features are fused to the high-frequency domain features. It controls the fusion process of the two frequency domains. After the AOC-ResNet50 network is developed, it is employed to extract the features of radar charts generated using the seven fault indicator variables data extracted from the wind turbine SCADA system for fault detection of the wind turbine converter. The fault detection performance indices were compared with the ResNet50 and Oct-ResNet50 networks to verify the effectiveness of the improved network. It was found that the fault detection performance using the AOC-ResNet50 network is superior to the ResNet50 and Oct-ResNet50 networks. The fault detection accuracy of AOC-ResNet50 network model can be up to 98.0%, which is 5.48% higher than Oct-ResNet50 and 7.52% higher than the ResNet50 model based on the radar charts generated using the fault indicator variables data that occurred 3 days ahead of converter failures. In the next step of our research work, the AOC-ResNet50 network will be applied to wind turbine converter failure prediction using fault indicator variables data that occur at different time periods before a failure occurs. At the same time, the AOC-ResNet50 network will be applied to fault detection and failure prediction for other wind turbine components.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.