Multiscale Kernel-Based Residual CNN for Estimation of Inter-Turn Short Circuit Fault in PMSM

The diagnosis of an inter-turn short circuit (ITSC) fault at its early stage is very important in permanent magnet synchronous motors as these faults can lead to disastrous results. In this paper, a multiscale kernel-based residual convolutional neural network (CNN) algorithm is proposed for the diagnosis of ITSC faults. The contributions are majorly located on two sides. Firstly, a residual learning connection is embedded into a dilated CNN to overcome the defects of the conventional convolution and the degradation problem of a deep network. Secondly, a multiscale kernel algorithm is added to a residual dilated CNN architecture to extract high-dimension features from the collected current signals under complex operating conditions and electromagnetic interference. A motor fault experiment with both constant operating conditions and dynamics was conducted by setting the fault severity of the ITSC fault to 17 levels. Comparison with five other algorithms demonstrated the effectiveness of the proposed algorithm.


Introduction
Owing to the advantages of high efficiency, high power density, and excellent torque control performance, permanent magnet synchronous motors (PMSMs) have been widely applied in various applications, such as domestic appliances, wind power generators, and electric vehicles [1]. With the diversity of applications, the reliability of the PMSM has gradually attracted more attention. The motor faults would lead to unexpected shutdown or even catastrophic consequences, especially in systems with high-security requirements [2]. Thus, to avoid disastrous results, the fault diagnosis of PMSM is crucial for system safety.
As one of the most common faults in PMSM, the stator winding inter-turn short circuit (ITSC) fault is usually hard to identify [3]. Moreover, without timely and appropriate management, the PMSM runs into a severe short circuit fault or even an open circuit fault [4]. The ITSC fault is formed by winding insulation failure, which is usually caused by overcurrent, thermal stress, mechanical stress, and aging [5]. When an ITSC fault happens, the shorted point forms an extra circuit connection, which is parallel with the fault-winding phase and coupled with other windings and rotor magnets, through flux linkages [6]. Then, an overcurrent is generated in the fault winding, causing a large amount of additional heat generated by ohmic losses, which can further intimidate the adjacent wires or even melt them down. As a consequence of the above description, in some cases, even a slight ITSC fault can quickly expand to neighboring wires, so that a slight fault easily develops into a critical one [7]. Therefore, it is very important to detect and manage an ITSC fault at its early stage.
Traditional machine learning fault diagnosis algorithms, such as artificial neural networks (ANN) and support vector machines (SVM), are usually used in conjunction with characteristic acquisition methods, as they commonly have difficulties in extracting complicated features [8,9]. Characteristic acquisition methods can convert the collected original less efficient regarding time-series signals, especially time-varying signals. Furthermore, the performance of deep networks depends enormously on the amount and distribution of the data set [28]. To enhance the performance of deep networks the data set should cover enough working area and fault conditions, particularly those under dynamic operating conditions, which has been less emphasized in much research. Finally, as the data set increases, the hyperparameter tuning of the deep network consumes lots of time, especially for people who are not familiar with hyperparameter tuning. Therefore, how to speed up the whole process of hyperparameter tuning is also an essential task to be resolved.
To address the aforementioned problems, a multiscale residual dilated CNN architecture is proposed in this article for the ITSC fault diagnosis of the PMSM. The key contributions of this article are concluded as below: (1) A well-designed deep learning network, termed multiscale residual dilated CNN, is proposed for the early-stage severity estimate of ITSC in a PMSM. In multiscale residual dilated CNN, multiscale dilated CNN is used to capture features from huge raw current signals in various scales, which enhances the characterization ability of captured features and the robustness of the network, even under dynamic operating conditions. In addition, the dilated CNN can extend the receptive field of the deep network at an exponential level. Moreover, the residual connection architecture is adopted to solve the performance degradation problem, making rather deep networks possible. (2) The Bayesian optimization algorithm is adopted to settle the hyperparameter tuning problem, which means the tuning process can be automated. What is more, the use of the Bayesian optimization algorithm suggests that comparisons between different methods becomes objective, as the tuning of hyperparameters is implemented automatically. (3) The proposed algorithm has the advantage that it can extract features and diagnose faults from the huge raw current signals without the operating of signal processing, as it belongs to CNN. This also indicates that the relevant learning time of prior knowledge of fault characteristics can be saved. (4) The proposed algorithm is validated by a motor fault experiment with both constant operating conditions and dynamic operating conditions. The result is compared with five other diagnosis algorithms to exhibit the advantages of the proposed algorithm.
The rest of this paper is arranged as follows. Section 2 derives an index to instruct the severity settings for ITSC faults. Section 3 explains the whole structure and components of the proposed algorithm. The experimental setup, severity setting of the ITSC fault, and data set description are described in Section 4. In Section 5, the proposed algorithm and five other deep learning algorithms are adopted to diagnose the same experimental test data set, which exhibits the effect and superiority of the proposed algorithm. Lastly, Section 6 concludes this paper and gives an outlook on the next steps.

ITSC Fault in PMSM
The detection of ITSC faults is very important as an overcurrent and overheating can lead to more serious problems. In previous studies, no index has been especially well-suited for guiding the early-stage fault severity setting of an ITSC fault. In this paper, an index is proposed for guiding the severity setting of the ITSC fault experiment.
A cross-sectional view of a PMSM with an ITSC fault is shown in Figure 1a. The motor is a concentrated winding structure with 8 poles and 36 slots. Wherever the shorting point occurs in the coil, the wires in the corresponding slots are shorted accordingly, as shown by the red wires in Figure 1a. The Pc-t represents the unique numbering of every wire in one slot. Take A1-3, for example, it represents the third turn of wire in the first coil within phase A. Assuming the ITSC fault occurs in the first coil of phase A, the equivalent circuit model is shown in Figure 1b. When the fault occurs, an additional circuit that is in parallel with the fault windings of the same phase is formed. According to Figure 1b, the equivalent circuit model can be described as follows.  A cross-sectional view of a PMSM with an ITSC fault is shown in Figure 1a. The motor is a concentrated winding structure with 8 poles and 36 slots. Wherever the shorting point occurs in the coil, the wires in the corresponding slots are shorted accordingly, as shown by the red wires in Figure 1a. The Pc-t represents the unique numbering of every wire in one slot. Take A1-3, for example, it represents the third turn of wire in the first coil within phase A. Assuming the ITSC fault occurs in the first coil of phase A, the equivalent circuit model is shown in Figure 1b. When the fault occurs, an additional circuit that is in parallel with the fault windings of the same phase is formed. According to Figure 1b, the equivalent circuit model can be described as follows.
(a)   v a , v b , and v c denote the phase voltage. v N denotes the voltage of the neutral point. R a , R b , and R c denote the phase resistance. L aa , L bb , and L cc denote the inductance of phase A, phase B, and phase C respectively. M ab , M bc , and M ca denote the mutual inductance between phase A, phase B, and phase C. The value µ denotes the shorted turn ratio, which indicates the ratio of shorted turns N c to the total number of turns N p in one phase. R f Sensors 2022, 22, 6870 5 of 23 denotes the contact resistance between the shorted turns, i a , i b , and i c denote the phase current, i f denotes the fault current in the shorted path, and λ denotes the amplitude of the permanent magnet flux linkage. It can be seen from the model that the severity of the ITSC fault is influenced by the fault resistance R f and the shorted turn ratio µ.
According to Kirchhoff's current law, the relationship of the currents that flow through the neutral point of the winding can be expressed as: Combining (1) and (2), the fault current i f can be derived as: Define (3) can be rewritten as: As the amplitude of v N is much smaller than v a in the early stage of an ITSC fault, then v a ≈ v a − v N . Assuming v a = V a sin(ωt), the solution of (4) can be described as: In the early stage of the fault, |d 1 | >> |d 2 |, the ratio of d 1 to d 2 tends to infinity while the ratio of d 2 to d 1 tends to 0. Combined with the above analysis, meanwhile, substituting the expression of d 1 and d 2 into (5), the amplitude of i f can be rewritten as: Moreover, the amplitude of the three-phase voltage is proportional to the speed of the motor [29]. Then the relationship among parameters I f , R f , µ, and the speed of the motor ω r can be expressed as: where R a can be treated as a known parameter. It can be seen from (7) that the parameters µ, R f and ω r directly influence I f . Among the related parameters, only ω r does not affect the ITSC fault, If divided from (7), a new expression that can reveal the severity of the fault to some extent can be derived: where FI represents the fault index of the ITSC fault. When the motor is healthy, the index is 0. In the early stage of the fault, the index is almost invariant to the motor speed, and increases as R f decreases or µ increases. Each degree of ITSC fault can be seen as a unique combination of R f and µ. Since it is difficult to measure R f and µ during the operation of a motor, the index is not suitable to directly do the severity estimate of an ITSC fault. However, it can be used as an indicator to guide the severity setting of an ITSC fault.

Dilated Convolutional Neural Networks
The dilated CNN is a variant of conventional CNN, which contains dilated convolutional layers, RELU layers, a Batch normalization layer, and a dropout layer.
As a variant of conventional CNN, the dilated CNN inherits the characteristics of local connectedness and weight sharing and the result of the loss function can be optimized by a backpropagation algorithm [30]. The purpose of a convolution operation is to extract diverse levels of hierarchical features from the input signals. The top layers can only extract shallow features. The deeper a convolutional layer is, the more complicated the features that can be extracted from the input signals. The CNN achieves sparse connectivity by connecting neurons of neighboring layers using a local connection mode and adopting kernels of different sizes [26]. Thus, not only does it efficiently represent the complicated relationship, but it can also reduce the potential of overfitting.
The use of dilated CNN can eliminate the need for pooling layers, which allows the deep network to enlarge the receptive field, while minimizing the loss of coverage or resolution, making possible a rather deep network architecture [14,27,31]. For a 1-D series input data S ∈ R n and the kernel f : {0, 1, . . . , k − 1} → R, the expression of the dilated convolution F on input data segment x can be defined as: where d stands for the dilation factor, k denotes the filter size, and x-d·i indicates the element of segment x traversing the i-th convolution operation. Thus, the dilated convolution represents running convolution operations on elements of the series data by kernels whose filters are separated by a step size of d. Conventional convolution is a special case where the dilation factor d = 1. As the depth of the network grows, the dilated factor increases accordingly, and the receptive field of the final output layer becomes wider. The rest layer in CNN helps to enhance the performance of the deep networks. The normalization layer is used to get rid of possible gradient disappearance and explosion problems, and the method here adopted is Batch Normalization. The activation layer is used to extend the nonlinearity of a neuron. We adopted rectified linear unit (ReLU) as our activation function to speed up the training process. To address the overfitting problem of the networks, some proportion of neurons and their connections can be randomly dropped, namely the dropout layer.

Residual Learning Block
The object that is analyzed for the diagnosis of an ITSC fault is the three-phase current of the PMSM. Current signals are usually long 1-D signals, which are highly susceptible to changeable operating conditions and electromagnetic interference, making the motor fault diagnosis more difficult. If the CNN is adopted to do the severity estimate of the ITSC fault, deeper 1-D architecture CNN is required, because normally the deeper a CNN architecture is, the more complex the features it can extract. Whereas former experiments proved that the CNN had a degradation difficulty, namely with the increase of CNN depth, the performance of the network tended to saturate and even degrade. That is to say, the addition of network depth decreases its performance, and the phenomenon is different from overfitting.
The degradation difficulty indicates that it is hard to train a CNN well. Theoretically speaking, if an added CNN layer just repeats the features of former layers, rather than learning new features, namely, the identity map, the performance of the network should not decrease [32]. Inspired by this instinct, the residual learning block was employed in the proposed algorithm. For a residual learning block, if the input is defined as x, the learned feature is denoted as F(x), which should be added to the input of the residual learning block. The output of the residual learning block is expressed as: where y stands for the output of the residual learning block, and σ denotes the activation function of the residual learning block.
Assuming the result of a residual learning block is greater than 0, the performance of the CNN can be further enhanced by increasing the depth of layers. On the contrary, if the result of the residual learning block is 0, the newly added layer does not influence the performance of the residual learning block, which is called an identity map. Hence, a deeper CNN architecture can be formed by adopting residual learning blocks to prevent the degradation difficulty.
The residual learning block is shown in Figure 2. As Figure 2a shows, the residual learning block is realized by stacking several layers of dilated convolution layer, Batch Normalization layer, RELU layer, and dropout layer. In a standard Residual Neural Network, the output of the block is added to the input without any transformation, (as shown in Figure 2b). Whereas, for the condition of 1-D CNN, an extra 1 × 1 convolution is used to avoid the situation of tensor disagreement between the input and output of the block. different from overfitting.
The degradation difficulty indicates that it is hard to train a CNN well. Theoretically speaking, if an added CNN layer just repeats the features of former layers, rather than learning new features, namely, the identity map, the performance of the network should not decrease [32]. Inspired by this instinct, the residual learning block was employed in the proposed algorithm. For a residual learning block, if the input is defined as x, the learned feature is denoted as F(x), which should be added to the input of the residual learning block. The output of the residual learning block is expressed as: where y stands for the output of the residual learning block, and σ denotes the activation function of the residual learning block.
Assuming the result of a residual learning block is greater than 0, the performance of the CNN can be further enhanced by increasing the depth of layers. On the contrary, if the result of the residual learning block is 0, the newly added layer does not influence the performance of the residual learning block, which is called an identity map. Hence, a deeper CNN architecture can be formed by adopting residual learning blocks to prevent the degradation difficulty.
The residual learning block is shown in Figure 2. As Figure 2a shows, the residual learning block is realized by stacking several layers of dilated convolution layer, Batch Normalization layer, RELU layer, and dropout layer. In a standard Residual Neural Network, the output of the block is added to the input without any transformation, (as shown in Figure 2b). Whereas, for the condition of 1-D CNN, an extra 1 × 1 convolution is used to avoid the situation of tensor disagreement between the input and output of the block.

Multiscale Kernel Dilated CNN
The essence of ITSC fault diagnosis is the use of time-series signals to achieve fault identification or regression. However, there are still some problems that need to be solved: firstly, the conventional convolution usually extracts features in the same scale size, namely single-scale. Whereas the signals of the motor are always changing with system structure, electromagnetic interference, and sampling rates, which indicates that a single-scale convolutional kernel size is not enough for extracting the signal features [33]. Secondly, the operating conditions of motors are always dynamic, even the setting of constant operating conditions is highly susceptible to changing loads, power supply stability, and environmental factors. Thus, the potential to evaluate signals over dynamic operating conditions should be taken into account if the fault diagnosis algorithm is intended to be adapted in more situations.
To address these difficulties, a multiscale kernel-based dilated CNN architecture is proposed in this paper. In contrast to the multiscale of conventional CNN, which can only change the size of the extracted features by the size of the convolutional kernel, dilated CNN can also change the extracted features by changing the dilated factor d [34]. By influencing the receptive field of the network, the kernel size f and dilated factor d achieve extraction of different sized features. The formula for the receptive field is defined as: where r f represents the size of the receptive field. As the dilated convolution does not sample the dilated parts, this results in the extracted information not being continuous. Hence, to extract the feature information of the dilated parts, a multiscale kernel-based dilated convolution architecture is proposed in this paper. As shown in Figure 3, the conventional convolution, with kernel sizes of 1 × 1, 1 × 3, and 1 × 5, is added before the dilated convolution of each branch is separately added. The red boxes of the figure show the features extracted by the dilated convolution, and the rest are the features extracted by conventional convolution. After adding the conventional convolution, the information of the dilated parts can be extracted by the conventional convolution, which can enhance the continuity of the information and obtain different sizes of receptive fields, realizing the extraction features at different scales. X is the input feature vector, and Y is the output feature vector. The purpose of conventional convolution with a kernel size of 1 × 1 is to decrease the number of channels, which can also reduce the amount of computation. The dilated convolution size of each branch is set to 1 × 3, and the dilated factor of each branch is set to d = 1, d = 3, and d = 5. The features of the three branches are spliced together after extracting different scale sizes of features by using different sizes of dilated factors. The expression of the features extracted by the three branches are: where X 1 , X 2 , and X 3 are the features of each branch separately, the symbol * is the convolution operation, and C 1×1 , C 1×3 , and C 1×5 represent the conventional convolution with kernel sizes of 1 × 1, 1 × 3, and 1 × 5 separately. D f=1 , D f=3 , and D f=5 represent the dilated convolution with a kernel size of 1 × 3 and dilated factors of 1, 3, and 5. In this architecture, the way of feature fusion is layer-by-layer stacking. Thus, the output of the architecture is expressed as: where the symbol {+} represents the element-wise addition operation, and {·} represents the splicing operation of different channels. The operation of adding up the features from dilated factor d = 1, to d = 3 and d = 5, layer-by-layer, makes the branch of d = 3 and d = 5 not only contain the feature with larger dilated factors but also contain the feature with small dilated factor, which enhances the network's ability in extracting different scales of features. In this architecture, the way of feature fusion is layer-by-layer stacking. Thus, the output of the architecture is expressed as: where the symbol {+} represents the element-wise addition operation, and {·} represents the splicing operation of different channels. The operation of adding up the features from dilated factor d = 1, to d = 3 and d = 5, layer-by-layer, makes the branch of d = 3 and d = 5 not only contain the feature with larger dilated factors but also contain the feature with small dilated factor, which enhances the network's ability in extracting different scales of features.

Bayesian Optimization for Hyperparameter Tuning
The performance of the deep networks is highly dependent on an appropriate set of hyperparameters. However, there are interactions between different hyperparameters, and it is difficult to tune a set of appropriate hyperparameters without experience, and, even if you could, it takes a lot of time. Moreover, when compared with other deep learning algorithms, if the function of automatic hyperparameter tuning can be implemented, the introduction of subjective intent can be avoided, which makes the comparison of different algorithms more objective. Therefore, it is necessary to realize the automatic hyperparameter tuning by using the optimization algorithm that not only has the capability of global optimization but can also get good results in fewer iterations or taking less time.
The Bayesian optimization algorithm can realize highly efficient parameter optimization with a fast optimization rate and few iterations [32]. By iterating and evolving a global statistical model that has no clear target function, the Bayesian optimization algorithm achieves the estimation and evaluation of a task [35]. It mainly consists of two parts: one is the acquisition function and the other is the Bayesian statistical model [36]. The acquisition function is utilized to determine the optimal sampling point, where the optimal solution is most likely to exist, or the area that has not yet been sampled. As a prior over function, the Bayesian statistical model can estimate the hypothesis posterior distribution of the function to be optimized using the prior information and observed results. To maximize the expected benefit of the new measurement points, compared with the historical optimal value, the Excepted Improvement is adopted as the acquisition function. Owing to the merits of flexibility and tractability, the Gaussian process is adopted as an a priori function of the algorithm.
The process of hyperparameter tuning using the Bayesian optimization algorithm is graphically shown in Figure 4. The whole process can be divided into two parts: the CNN process, and the Bayesian optimization process. The blue box in Figure 4 is the Bayesian optimization process, which mainly completes the initialization selection of the hyperparameter combination, and then updates it according to the previously gathered results. The red box in Figure 4 is the CNN process, which mainly completes the training and testing of deep network parameters. When reaching the termination conditions, the CNN process passes the test accuracy of the network to the Bayesian optimization process. The hyperparameters that are chosen for optimization include the Momentum (M), the InitialLearnRate (L init ), the dropoutProb (P), and the L2Regularization (L 2R ). During the execution of the optimization, the two processes interact with each other until reaching the termination condition of the optimization, and then the best test accuracy, and its corresponding hyperparameters, are output as the result of the whole optimization.
CNN process passes the test accuracy of the network to the Bayesian optimization process The hyperparameters that are chosen for optimization include the Momentum (M), th InitialLearnRate (Linit), the dropoutProb (P), and the L2Regularization (L2R). During th execution of the optimization, the two processes interact with each other until reaching the termination condition of the optimization, and then the best test accuracy, and it corresponding hyperparameters, are output as the result of the whole optimization.

Multiscale Kernel-Based Residual CNN Architecture
The structure of the multiscale kernel-based residual CNN architecture is shown in Figure 5, which is inspired by the multiscale feature extraction in image processing and speech recognition. As the figure shows, the structure of the algorithm can be divided into three parts, namely, the input part, the feature extraction part, and the output part.

Multiscale Kernel-Based Residual CNN Architecture
The structure of the multiscale kernel-based residual CNN architecture is shown in Figure 5, which is inspired by the multiscale feature extraction in image processing and speech recognition. As the figure shows, the structure of the algorithm can be divided into three parts, namely, the input part, the feature extraction part, and the output part.
In this architecture, time-series three-phase current segments are adopted as the input signal. Each current signal has a fixed size of 1 × 3000 × 3. The feature extraction part consists of two structures: one is the basic network layer, and the other is the multiscale layer. Since the dilated CNN has the advantage in feature extraction and enlarging receptive field, it was chosen for the construction of feature extraction. The basic network layer, which consists of three dilated CNN layers, is mainly responsible for the extraction of basic features. The three dilated CNN layers are used to extract the shallow, medium, and high levels of features with a layer depth of 5, 9, and 6, respectively, and a kernel size of 1 × 3. The multiscale layer, which consists of four branches, each of which represents a different feature scale, is used to extract the extra multiscale features. In each branch, the conventional convolutions are added before the dilated convolution as they can extract the information of the dilated parts, which enhances the continuity of the information. Branch 1 is used to extract the shallow level of features with a kernel size of 1 × 3 and dilated factor d = 1. Branch 2 is used to extract the medium-level features by a conventional convolution layer with a kernel size of 1 × 3, which can then be extracted by a dilated convolution layer with a kernel size of 1 × 3 and dilated factor d = 3. Branch 3 is used to extract the high-level features by a conventional convolution layer with a kernel size of 1 × 5, which can then be extracted by a dilated convolution layer with a kernel size of 1 × 5 and dilated factor d = 5. Branch 4 represents the residual learning result of the basic network layer features. In this architecture, time-series three-phase current segments are adopted as the input signal. Each current signal has a fixed size of 1 × 3000 × 3. The feature extraction part consists of two structures: one is the basic network layer, and the other is the multiscale layer. Since the dilated CNN has the advantage in feature extraction and enlarging receptive field, it was chosen for the construction of feature extraction. The basic network layer, which consists of three dilated CNN layers, is mainly responsible for the extraction of basic features. The three dilated CNN layers are used to extract the shallow, medium, and high levels of features with a layer depth of 5, 9, and 6, respectively, and a kernel size of 1 × 3. The multiscale layer, which consists of four branches, each of which represents a different feature scale, is used to extract the extra multiscale features. In each branch, the conventional convolutions are added before the dilated convolution as they can extract the information of the dilated parts, which enhances the continuity of the information. The features of the four branches are fused by the feature vector concatenate layer, and then pass the features of different scales to the output of the whole architecture, which consists of a fully connected layer, softmax layer, and output layer.
The flowchart of the multiscale kernel-based residual CNN algorithm is shown in Figure 6. There are four procedures in the process: (1) Data set preparing. The three-phase current signal is collected in experiments on the motor with an ITSC fault using the data collection equipment. Then, the acquired current signals are processed into segments of equal length and labeled with corresponding markers. After that, all the segments are split into two disjointed sets, namely, the training set and the testing set.
(2) Algorithm initialization. The value range of the hyperparameters that are optimized are determined. Then, a set of hyperparameters is randomly selected as the starting point for Bayesian optimization. (3) Implementation of the proposed algorithm. The given starting point of hyperparameters training and testing of the proposed deep network is used, and then the hyperparameters are updated according to the Bayesian optimization algorithm. The procedure is repeated until the termination condition is met. (4) Output results. When the maximum number of the optimization is reached, the best test accuracy and its corresponding hyperparameters of the proposed deep network are selected as the result and output.
motor with an ITSC fault using the data collection equipment. Then, the a current signals are processed into segments of equal length and label corresponding markers. After that, all the segments are split into two disjoin namely, the training set and the testing set.

Experiment and Data Description
To demonstrate the effectiveness of the proposed algorithm, an experiment on a PMSM with an ITSC fault was conducted. As shown in Figure 7, the experimental setup consisted of a dynamometer, the tested motor, a torque sensor, a data recorder, etc. The yellow box in Figure 7c was the severity setting unit of the tested motor, which contained both many shorted points of the winding (as shown in Figure 7a) and the fault resistance with its heat sink (as shown in Figure 7b). The experimental data (three-phase current of the stator) was collected with a sampling rate of 1 MHz by the DL850EA oscilloscope recorder. The tested motor was driven by a VFD037C23A inverter at a switching frequency of 15 kHz. FOC was used for motor control, where the test motor ran under constant load and the speed was determined by the dynamometer. The tested motor was wye-connected with a concentrated winding of 108 turns per phase. The main specifications of the tested motor are shown in Table 1.
the stator) was collected with a sampling rate of 1 MHz by the DL850EA osci recorder. The tested motor was driven by a VFD037C23A inverter at a sw frequency of 15 kHz. FOC was used for motor control, where the test motor ra constant load and the speed was determined by the dynamometer. The tested m wye-connected with a concentrated winding of 108 turns per phase. Th specifications of the tested motor are shown in Table 1.  The experiment of the tested motor was conducted under various o conditions. The operating conditions were combinations of two-load torques rotational speeds, as shown in Table 2. The two-load torques were both constan among five speeds, four were constant and one was dynamic. There were 10 o conditions, while every operating condition was a combination of one torque and As shown in Figure 8, the dynamic rotational speed ranged from 850 to 1550 r/mi was controlled automatically. The motor was tested in health and fault states.  The experiment of the tested motor was conducted under various operating conditions. The operating conditions were combinations of two-load torques and five rotational speeds, as shown in Table 2. The two-load torques were both constant, while among five speeds, four were constant and one was dynamic. There were 10 operating conditions, while every operating condition was a combination of one torque and a speed. As shown in Figure 8, the dynamic rotational speed ranged from 850 to 1550 r/min, which was controlled automatically. The motor was tested in health and fault states.   The severity of the ITSC fault in the tested motor could be set according to (8). In our experiment, there was 1 health state and 16 fault states, as shown in Table 3. The ITSC faults were set in phase A and the fault resistance was connected between two shorted points. There were four fault resistances and four shorted turn ratios. Each label was a combination of fault resistance and shorted turn ratio. The labels represented the state of the tested motor. "HL" represented the healthy state of the motor. "R5", "R1", "R0.5", and "R0.1" represented the fault resistances of 5 Ω, 1 Ω, 0.5 Ω, and 0.1 Ω correspondingly. "A2", "A4", "A5", and "A6" represented the shorted turn ratios of 4.6%, 8.3%, 10.2%, and 13.8% correspondingly. The 17 labels were sorted in ascending order according to the result calculated by (7). To prevent low-frequency aliasing interference, the acquired raw current signals were filtered by a zero-phase filter, and, then, the sampling frequency was down-sampled to 15 kHz, which was consistent with the switching frequency of the inverter. The current signals were normalized and cut into segments of the same length. The comparison of the collected three-phase current signal before and after preprocessing is shown in Figure 9. There were 20,400 segments of the acquired data, and each label was 1200. The segments in each label were randomly divided into two disjoint sets: one was the training set with 840 segments, and the other was the testing set with 360 segments.  The severity of the ITSC fault in the tested motor could be set according to (8). In our experiment, there was 1 health state and 16 fault states, as shown in Table 3. The ITSC faults were set in phase A and the fault resistance was connected between two shorted points. There were four fault resistances and four shorted turn ratios. Each label was a combination of fault resistance and shorted turn ratio. The labels represented the state of the tested motor. "HL" represented the healthy state of the motor. "R5", "R1", "R0.5", and "R0.1" represented the fault resistances of 5 Ω, 1 Ω, 0.5 Ω, and 0.1 Ω correspondingly. "A2", "A4", "A5", and "A6" represented the shorted turn ratios of 4.6%, 8.3%, 10.2%, and 13.8% correspondingly. The 17 labels were sorted in ascending order according to the result calculated by (7). To prevent low-frequency aliasing interference, the acquired raw current signals were filtered by a zero-phase filter, and, then, the sampling frequency was down-sampled to 15 kHz, which was consistent with the switching frequency of the inverter. The current signals were normalized and cut into segments of the same length. The comparison of the collected three-phase current signal before and after preprocessing is shown in Figure 9. There were 20,400 segments of the acquired data, and each label was 1200. The segments in each label were randomly divided into two disjoint sets: one was the training set with 840 segments, and the other was the testing set with 360 segments.

Results and Comparisons
The proposed multiscale kernel-based residual CNN algorithm i]was utilized to accomplish the procedures of training, testing, and optimization offline using the current signal data sets. The optimized hyperparameters included InitialLearnRate (Linit), Momentum (M), L2Regularization (L2R), and dropoutProb (P). The corresponding search intervals, data types, and best results of each hyperparameter are shown in Table 4. The item "Transform" indicated whether the hyperparameter was optimized on a logarithmic scale. According to the experience during the tuning process, the optimization number was set to 80, and the training epoch number was set to 90 in every single optimization. The

Results and Comparisons
The proposed multiscale kernel-based residual CNN algorithm was utilized to accomplish the procedures of training, testing, and optimization offline using the current signal data sets. The optimized hyperparameters included InitialLearnRate (L init ), Momentum (M), L2Regularization (L 2R ), and dropoutProb (P). The corresponding search intervals, data types, and best results of each hyperparameter are shown in Table 4. The item "Transform" indicated whether the hyperparameter was optimized on a logarithmic scale. According to the experience during the tuning process, the optimization number was set to 80, and the training epoch number was set to 90 in every single optimization. The best result of each hyperparameter is shown in Table 4. To visualize the features learned by the trained network, a data dimensionality reduction visualization algorithm, called the t-distribution stochastic neighbor-embedding algorithm (T-SNE), was used to demonstrate the learned features. The features of the input layer and the output layer were compared in a two-dimensional (2-D) form, which not only simplified the comparison but also visualized the features, as shown in Figure 10. It can be seen that there were 17 colors in both feature maps; each color corresponded to a label of the ITSC fault, which represented the fault severity. Figure 10a is the feature map of the input layer, from which we can see that the features of different severity states were heavily overlapped. It was very hard to distinguish the labels from each other in this situation. Then, as shown in Figure 10b, the features of different severity states could be easily separated from each other after the feature extraction training of the network, which exhibited the effect of the proposed algorithm. by the trained network, a data dimensionality reduction visualization algorithm, called the t-distribution stochastic neighbor-embedding algorithm (T-SNE), was used to demonstrate the learned features. The features of the input layer and the output layer were compared in a two-dimensional (2-D) form, which not only simplified the comparison but also visualized the features, as shown in Figure 10. It can be seen that there were 17 colors in both feature maps; each color corresponded to a label of the ITSC fault, which represented the fault severity. Figure 10a is the feature map of the input layer, from which we can see that the features of different severity states were heavily overlapped. It was very hard to distinguish the labels from each other in this situation. Then, as shown in Figure 10b, the features of different severity states could be easily separated from each other after the feature extraction training of the network, which exhibited the effect of the proposed algorithm. To assess the performance of the proposed algorithm, four indicators, namely, the recall ratio (r), the precision ratio (p), the overall accuracy (acc), and the F1 score, were introduced to do the evaluation [37]. In massive data, the indicator of recall ratio and precision ratio are mutually constrained. The indicator of the F1 score simultaneously considers the effect of the recall ratio and precision ratio, which can better demonstrate the performance of the proposed algorithm [38]. The definitions of the evaluation indicators were expressed as (14) shown: where TP represented true positive, which stood for the positive samples that were rightly classified as positive. FP represented false positive, which stood for the negative samples that were wrongly classified as positive. FN represented false negative, which stood for To assess the performance of the proposed algorithm, four indicators, namely, the recall ratio (r), the precision ratio (p), the overall accuracy (acc), and the F1 score, were introduced to do the evaluation [37]. In massive data, the indicator of recall ratio and precision ratio are mutually constrained. The indicator of the F1 score simultaneously considers the effect of the recall ratio and precision ratio, which can better demonstrate the performance of the proposed algorithm [38]. The definitions of the evaluation indicators were expressed as (14) shown: where TP represented true positive, which stood for the positive samples that were rightly classified as positive. FP represented false positive, which stood for the negative samples that were wrongly classified as positive. FN represented false negative, which stood for the positive samples that were wrongly classified as negative. TN represented true negative, which stood for the negative samples that were rightly classified as negative. Moreover, the positive samples stood for the samples of the current label being diagnosed, while the negative samples stood for the samples that were not part of the current label.
The overall test accuracy of the proposed algorithm, based on the best result of optimization hyperparameters, was 98.0%. To give a more detailed illustration, the confusion matrix of the beat overall test accuracy is shown in Figure 11. The labels of the confusion matrix were ranked in ascending sequence, based on the severity results, which were calculated by (8). The leftmost labels of the matrix were the true classes, which represented the true labels of the number in each row. The bottom labels of the matrix were the predicted classes, which were classified by the proposed algorithm and represented the predicted labels of the number in each column. The bottom of the confusion matrix shows the precision ratio for each label, while the recall ratio for each label is located at the rightmost of the matrix. The numbers in the blue diagonal box indicate the true positive of each label. The numbers in each column, except those on the diagonal, were false positive, while the numbers in each row, except those on the diagonal, were false negative. The precision ratio for each label was calculated as the ratio of the number on the diagonal to the sum of the rest numbers in the column. Similarly, the recall ratio for each label was calculated as the ratio of the number on the diagonal to the sum of the rest numbers in the row.
confusion matrix were ranked in ascending sequence, based on the severity results, whic were calculated by (8). The leftmost labels of the matrix were the true classes, whic represented the true labels of the number in each row. The bottom labels of the matri were the predicted classes, which were classified by the proposed algorithm an represented the predicted labels of the number in each column. The bottom of th confusion matrix shows the precision ratio for each label, while the recall ratio for eac label is located at the rightmost of the matrix. The numbers in the blue diagonal bo indicate the true positive of each label. The numbers in each column, except those on th diagonal, were false positive, while the numbers in each row, except those on the diagona were false negative. The precision ratio for each label was calculated as the ratio of th number on the diagonal to the sum of the rest numbers in the column. Similarly, the reca ratio for each label was calculated as the ratio of the number on the diagonal to the sum of the rest numbers in the row. To demonstrate the superiority of the proposed multiscale architecture in handin the severity estimate of the ITSC fault, the performance of the proposed algorithm wa compared with five other deep learning methods: conventional CNN, dilated CNN (d CNN), dilated residual CNN (d-ResCNN), Long-Short Term Memory (LSTM), and B directional Long-Short Term Memory (Bi-LSTM). In particular, only the method concerned with RNN and CNN were involved in the comparison, because deep network have an unmatched advantage over traditional machine learning methods in dealing wit classification problems, while the chosen methods perform much better than other dee learning methods in dealing with time-series problems [14]. All the compared method were optimized by Bayesian optimization, which could ensure the objectivity of th comparison, eliminating the influence of perceived subjective factors. The iteratio number of Bayesian optimization was set to 80, and the result with the highest overall te To demonstrate the superiority of the proposed multiscale architecture in handing the severity estimate of the ITSC fault, the performance of the proposed algorithm was compared with five other deep learning methods: conventional CNN, dilated CNN (d-CNN), dilated residual CNN (d-ResCNN), Long-Short Term Memory (LSTM), and Bi-directional Long-Short Term Memory (Bi-LSTM). In particular, only the methods concerned with RNN and CNN were involved in the comparison, because deep networks have an unmatched advantage over traditional machine learning methods in dealing with classification problems, while the chosen methods perform much better than other deep learning methods in dealing with time-series problems [14]. All the compared methods were optimized by Bayesian optimization, which could ensure the objectivity of the comparison, eliminating the influence of perceived subjective factors. The iteration number of Bayesian optimization was set to 80, and the result with the highest overall test accuracy was elected for comparison after the optimization was finished in each method. All the methods were trained and tested in the same dataset, while the number was set to 90 for each optimization of the training epochs. The structure and training parameters of each algorithm after Bayesian optimization are shown in Table 5. The highest overall test accuracy of each method, and the F1 score of each label in each method, are shown in Table 6. The overall test accuracy of each method was 58.9%, 74.1%, 74.1%, 86.9%, 95.2%, and 98.0%, respectively, as shown in Table 6. The proposed method had the best performance in both the overall test accuracy and the test accuracy in each label. The performances of LSTM and Bi-LSTM were both very poor, maybe because the training of RNN was more complicated than that of CNN, requiring a longer training time or more training epoch numbers. To prove this, we increased the training epoch number to 200. After the training, the two RNN algorithms had not yet fully converged, and the Bi-LSTM performed better than the LSTM, while the overall test accuracy was less than 90%. Of course, this result did not necessarily indicate that the RNN performed worse than the CNN, but it did indicate that the RNN was much worse than CNN in terms of convergence speed. It can also be seen from Table 6 that the implementation of dilated convolution, residual learning architecture, and multiscale architecture could help enhance the performance of the conventional CNN, as the overall test accuracies of these algorithms performed better than that of the conventional CNN. In addition, the effect of dilated convolution was more pronounced than that of residual learning architecture in improving the overall test accuracy to some extent, while the residual learning architecture and multiscale architecture could further enhance the performance when the performance of the dilated CNN reached its limit. When simultaneously analyzing Figure 11 and Table 6, it could be concluded that in every method existed the same severe "false alarm rate" difficulty, which misclassified the healthy labels into faulty ones. A more serious problem was the "concealed alarm rate", which misclassified some faulty labels into healthy ones, which might lead to catastrophic consequences. The "concealed alarm rate" was obvious in labels with slight severities as the difference of the fault features between slight severities and healthy was very little, which increased the difficulty of the classification. As the severity of the fault increased, the distinction between different fault features in each label became more apparent, making classification easier and reducing the "concealed alarm rate" significantly. Hence, in most cases, the test accuracy for each label under every algorithm increased along with the aggravation of the fault severity. In addition, it could be concluded from the ranking of the labels that the fault resistance had more effect than that of the shorted turn ratio in the early stage of the fault. While the "false alarm rate" and the "concealed alarm rate" problems expose the fault diagnosis of an ITSC fault to some risk of misclassification, in practice these problems can be solved by continuous sampling, as the adopted dataset consists of many segments, which are portions of the data taken during a sampling period. When doing the fault diagnosis in practice, the proposed algorithm can be implemented for sequential analysis of the acquired signals, which means the result of the fault diagnosis is decided by more than one data segment. With an overall test accuracy of over 98%, the probability of two consecutive classification errors is less than 0.1%, so the fault diagnosis accuracy can be enhanced by comprehensively analyzing the diagnosis result of consecutive multiple data segments.
To make a comprehensive analysis of the compared six algorithms, the trends of the overall test accuracy and the loss function with the growth of iterations are shown in Figure 12. The compared results in Figure 12 are the best result of each algorithm with hyperparameters tuned by the Bayesian optimization. During the complete training process, the data of each algorithm was recorded for every epoch. The trends of the overall test accuracy are shown in Figure 12a, while the trends of the loss function are shown in Figure 12b, and the corresponding trends of each algorithm in both figures were consistent. It can be seen from the figure that the overall test accuracies of LSTM and Bi-LSTM were both slowly increasing, and, up to the set maximum number of iterations, neither of the two algorithms tended to converge. The performance of Bi-LSTM was better than that of LSTM, both in the overall test accuracy and the growth rate during the whole process, and the two algorithms had potential in enhancing the performance if the iteration times could be increased. The growth rate of the conventional CNN was better than that of the two RNN algorithms, but the final overall test accuracy was no more than 80%, which was limited by its receptive field and degradation problems. The performances of ResCNN and dilated CNN were better than that of conventional CNN in terms of growth rate and the final overall test accuracy, while the dilated CNN did better than ResCNN, which proved that dilated convolution was more useful than that of residual learning connection in improving the performance of the CNN. The proposed algorithm had the best performance, with a final overall test accuracy close to 98%, which outperformed the dilated CNN algorithm by more than 2.7%. The result of the comparison demonstrated that the residual learning connection and the architecture of multiscale kernels could both help to improve the limits of a CNN architecture in feature extraction, while the architecture of multiscale kernels outperformed the residual learning connection.
outperformed the dilated CNN algorithm by more than 2.7%. The result of the comparison demonstrated that the residual learning connection and the architecture of multiscale kernels could both help to improve the limits of a CNN architecture in feature extraction, while the architecture of multiscale kernels outperformed the residual learning connection. In order to avoid overfitting, resulting in excessive accuracy of the algorithm, this paper also investigated the noise immunity of the algorithm to different noise levels [23]. Gaussian white noise with signal noise ratios of 15 dB, 20 dB, and 25 dB were added to the original testing dataset, respectively, to simulate the data collected from different environments [26]. The definition of the SNR is expressed as: where Psignal represents the power of the current signal, and Pnoise represents the power of the noise signal.
To visualize the effect of noise on the three-phase current signal, the comparison before and after the addition of Gaussian white noise with SNR = 20 dB is shown in Figure  13. From the figure, we can see the original signal was heavily influenced by noise and it was difficult to consider the two signals as the same one.
To test the overall test accuracy at different noise levels, the proposed algorithm was tested under the dataset with Gaussian white noise of 15 dB, 20 dB, and 25 dB, and without noise. The results are shown in Figure 14. As can be seen from the figure, the performance of the proposed algorithm decreased as the noise level increased, but the reduction was small. The results proved that the high overall test accuracy of the proposed algorithm was not due to overfitting, and the proposed algorithm had high anti-noise capability with an SNR greater than 15 dB. In order to avoid overfitting, resulting in excessive accuracy of the algorithm, this paper also investigated the noise immunity of the algorithm to different noise levels [23]. Gaussian white noise with signal noise ratios of 15 dB, 20 dB, and 25 dB were added to the original testing dataset, respectively, to simulate the data collected from different environments [26]. The definition of the SNR is expressed as: SNR(dB) = 10 log 10 ( P signal P noise ) (15) where P signal represents the power of the current signal, and P noise represents the power of the noise signal.
To visualize the effect of noise on the three-phase current signal, the comparison before and after the addition of Gaussian white noise with SNR = 20 dB is shown in Figure 13. From the figure, we can see the original signal was heavily influenced by noise and it was difficult to consider the two signals as the same one.  To test the overall test accuracy at different noise levels, the proposed algorithm was tested under the dataset with Gaussian white noise of 15 dB, 20 dB, and 25 dB, and without noise. The results are shown in Figure 14. As can be seen from the figure, the performance of the proposed algorithm decreased as the noise level increased, but the reduction was small. The results proved that the high overall test accuracy of the proposed algorithm was not due to overfitting, and the proposed algorithm had high anti-noise capability with an SNR greater than 15 dB.

Conclusions
In this paper, a novel multiscale kernel-based residual CNN algorithm was proposed for the ITSC fault diagnosis of a PMSM. The proposed algorithm can be implemented both in constant operating conditions and in dynamic conditions. Firstly, the dilated convolution was used to extend the receptive field of the deep network. Secondly, the residual learning connection was applied to solve the performance degradation problem with the growth of network depth. Thirdly, a multiscale kernel architecture was proposed to extract more scale features from the acquired data. Then, the Bayesian optimization algorithm was introduced to overcome the hyperparameter tuning difficulties. Furthermore, a motor fault experiment with both constant operating conditions and dynamic operating conditions was conducted by setting the fault severity of the motor to 17 levels. The proposed multiscale kernel-based residual CNN algorithm was employed to analyze the current signals that were acquired in the experiment. LSTM, Bi-LSTM, conventional CNN, ResCNN, and d-CNN were also applied to the same data set for comparison. The results illustrated the capability and impressive performance of the proposed algorithm. The results also demonstrated that the proposed algorithm is not only a fault diagnosis method, that can directly extract features from raw current signals, but can also do the severity estimate missions of the classification data under dynamic operating conditions. The multiscale kernel architecture can analyze the signals on different scales. Meanwhile, the residual connection enables the proposed algorithm to construct a deeper network, despite the degradation difficulty. Unfortunately, the proposed algorithm is only concerned with the identification accuracy of fault severity

Conclusions
In this paper, a novel multiscale kernel-based residual CNN algorithm was proposed for the ITSC fault diagnosis of a PMSM. The proposed algorithm can be implemented both in constant operating conditions and in dynamic conditions. Firstly, the dilated convolution was used to extend the receptive field of the deep network. Secondly, the residual learning connection was applied to solve the performance degradation problem with the growth of network depth. Thirdly, a multiscale kernel architecture was proposed to extract more scale features from the acquired data. Then, the Bayesian optimization algorithm was introduced to overcome the hyperparameter tuning difficulties. Furthermore, a motor fault experiment with both constant operating conditions and dynamic operating conditions was conducted by setting the fault severity of the motor to 17 levels. The proposed multiscale kernel-based residual CNN algorithm was employed to analyze the current signals that were acquired in the experiment. LSTM, Bi-LSTM, conventional CNN, ResCNN, and d-CNN were also applied to the same data set for comparison. The results illustrated the capability and impressive performance of the proposed algorithm. The results also demonstrated that the proposed algorithm is not only a fault diagnosis method, that can directly extract features from raw current signals, but can also do the severity estimate missions of the classification data under dynamic operating conditions. The multiscale kernel architecture can analyze the signals on different scales. Meanwhile, the residual connection enables the proposed algorithm to construct a deeper network, despite the degradation difficulty. Unfortunately, the proposed algorithm is only concerned with the identification accuracy of fault severity estimation under a given inverter control method, so the generalization capability of the proposed algorithm under different control methods needs to be improved. The next research will focus on extracting some features independent of the control method through preprocessing to enhance the generalization ability of the proposed algorithm to different control methods.

Conflicts of Interest:
The authors declare no conflict of interest.