Self-Supervised Voltage Sag Source Identiﬁcation Method Based on CNN

: A self-supervised voltage sag source identiﬁcation method based on a convolution neural network is proposed in this study. In addition, a self-supervised CNN (Convolutional Neural Networks) voltage sag source identiﬁcation model is constructed on the basis of the convolution neural network and AutoEncoder. The convolution layer and pool layer in CNN are used to extract the voltage sag characteristics


Introduction
With the development of industrial equipment as well as electrical automation and intellectualization of buildings, influences of voltage sag on the production and operation of large industrial and commercial users is becoming more and more prominent [1][2][3][4], especially in fields with extensive applications of power electronic devices and sensitive to voltage sags, such as semiconductor manufacturing, precise instrument processing, automotive manufacturing, and other industries. As a common power quality problem, voltage sag can be caused by many factors, such as motor start, transformer switching, short circuit fault, etc. [5,6]. The production interruption and delay caused by voltage sag disturbance shows an obvious upward trend [7]. The direct and indirect economic losses caused by voltage sag disturbance are becoming more and more serious, which puts forward higher requirements for the quality of power supply. Accurate identification of sag sources and their fault phases is conducive to analyze, compensate and suppress local voltage sag. It can also be

Convolution Layer
The convolution layer consists of several feature maps, and the number of output features depends on the number of feature maps. Each feature map performs the following operations on the input signal: The convolution operation is performed on the input sample with convolution kernel. Results are processed with activation functions, and feature mappings are output. The convolutional layer structure is shown in Figure 2. The convolution core is a two-dimensional matrix which can be expressed by numerical or gray pixels, typically 5 × 5 or 3 × 3 in size. The convolution kernel is continuously updated with the number of iterations during the training process. This process can be compared with the weight update process of the ordinary neural network.
Given the input sample size and convolution kernel size, the feature size can be expressed by Equation (1).
represents that the input sample is a matrix of size i × j; n n K × represents a convolution kernel size of q × q. Therefore, the feature size is h denotes the sliding step size of the convolution kernel on the sample.

Convolution Layer
The convolution layer consists of several feature maps, and the number of output features depends on the number of feature maps. Each feature map performs the following operations on the input signal: The convolution operation is performed on the input sample with convolution kernel. Results are processed with activation functions, and feature mappings are output. The convolutional layer structure is shown in Figure 2. Convolutional Neural Networks (CNN) [28][29][30][31][32] contains two modules of feature extraction and classifier, in which the weights are updated continuously with the iterative process in training. The feature extraction neural network is mainly composed of multiple convolution layers and pooling layers. The main function of a convolution layer is to convolute the input signals to generate feature maps. The pooling layer merges adjacent data and compresses data dimensions. The neural network classifies the input signals according to final features extracted by the neural network. The CNN structure is shown in Figure 1.

Convolution Layer
The convolution layer consists of several feature maps, and the number of output features depends on the number of feature maps. Each feature map performs the following operations on the input signal: The convolution operation is performed on the input sample with convolution kernel. Results are processed with activation functions, and feature mappings are output. The convolutional layer structure is shown in Figure 2. The convolution core is a two-dimensional matrix which can be expressed by numerical or gray pixels, typically 5 × 5 or 3 × 3 in size. The convolution kernel is continuously updated with the number of iterations during the training process. This process can be compared with the weight update process of the ordinary neural network.
Given the input sample size and convolution kernel size, the feature size can be expressed by Equation (1).
represents that the input sample is a matrix of size i × j; n n K × represents a convolution kernel size of q × q. Therefore, the feature size is h denotes the sliding step size of the convolution kernel on the sample. The convolution core is a two-dimensional matrix which can be expressed by numerical or gray pixels, typically 5 × 5 or 3 × 3 in size. The convolution kernel is continuously updated with the number of iterations during the training process. This process can be compared with the weight update process of the ordinary neural network.
Given the input sample size and convolution kernel size, the feature size can be expressed by Equation (1). S i×j represents that the input sample is a matrix of size i × j; K n×n represents a convolution kernel size of q × q. Therefore, the feature size is F ( i−n h +1)×( j−n h +1) . h denotes the sliding step size of the convolution kernel on the sample.
If the sliding step is always assumed to be 1, the convolution operation can be expressed by Equation (2). S i×j is the input sample; K q×q is the convolution kernel; and F(n,m) is the element value of the nth row and mth column of the feature mapping.
The value of each element in the feature mapping depends on whether the image resources match the shape of the convolution kernels [33].
Activation is to modify features, shield some smaller values of features, improve the sparsity of networks, reduce the interdependence of parameters, and alleviate the problem of over-fitting. In this study, SIGMOD is used for the activation of CNN. The expression of the function is shown in Equation (3).

Pooling Layer
The pooling layer is a scaling mapping of the output of the previous convolution layer. It combines adjacent pixels of the input into a single representative value, which can reduce the size of the feature and parameters of the network, and compensate for the objects deviating from the center and tilting in the sample [34]. The commonly used pooling operators have max pooling and average pooling. The formulaic expressions are shown in Equations (4) and (5). a(u,v) represents the value of the uth row and the vth column in the input matrix of the pooling layer; p(i,j) represents the value of the ith row and jth column in the output matrix of the pooling layer; and w is the boundary value of the participating pooled region.

Classification Network
In this study, classical BP (Back Propagation) neural network is used as a classifier. The BP neural network is a multi-layer feedforward network which is trained by error back propagation. It contains a forward propagation of signals and a back propagation of errors. The gradient descent method is the basic idea of the classical BP neural network and it uses the gradient search technique to minimize the mean square error between the actual output value of the network and the expected output value.

AutoEncoder
AutoEncoder (AE) is an artificial neural network used to reproduce the input signal as much as possible. It has a good ability to learn data characteristics. The essence of AE is to find a set of optimal Energies 2019, 12, 1059 5 of 14 weights (or vectors) to solve the projection of input data on weights, and then to get coding [35,36]. The basic structure is shown in Figure 3. It consists of an encoder and a decoder. The former one compresses the data and extracts the features, while the later one reconstructs the input signal. AE is trained by the error between the original input and the reconstructed input, and weights of neurons are adjusted by the criterion of minimizing the loss function of Equation (6) in multiple iterations. X is the input signal; W is the encoding; U is the decoding weight; Φ is the nonlinear activation function; J is the neuron weight adjustment function; R is the regularization function; and λ is the regularization coefficient.
The reconstruction error converges to the threshold range, and the features extracted by the encoder can express data characteristics well. The advantage of AE is that it does not depend on any expertise, but only pays attention to the characteristics of the input sample. The extracted features can restore the original sample to the greatest extent. The whole process can be carried out under the premise that the original data has no classification label. Moreover, the whole process does not require any manual intervention to further increase automation. weights (or vectors) to solve the projection of input data on weights, and then to get coding [35,36]. The basic structure is shown in Figure 3. It consists of an encoder and a decoder. The former one compresses the data and extracts the features, while the later one reconstructs the input signal. AE is trained by the error between the original input and the reconstructed input, and weights of neurons are adjusted by the criterion of minimizing the loss function of Equation (6) in multiple iterations. X is the input signal; W is the encoding; U is the decoding weight; Φ is the nonlinear activation function; J is the neuron weight adjustment function; R is the regularization function; and λ is the regularization coefficient.
The reconstruction error converges to the threshold range, and the features extracted by the encoder can express data characteristics well. The advantage of AE is that it does not depend on any expertise, but only pays attention to the characteristics of the input sample. The extracted features can restore the original sample to the greatest extent. The whole process can be carried out under the premise that the original data has no classification label. Moreover, the whole process does not require any manual intervention to further increase automation.

Self-Supervised Voltage Sag Source Identification Method Based on CNN
Based on the basic principle and main structure of AE, CNN and BP neural network, the convolution layer and pooling layer were used to extract features which were then classified by the BP classification network. No correct labeling of training samples was needed throughout the training process for supervision. Instead, the self-supervised training was carried out by the error between the reconstructed samples and the input samples. The trained system can extract features and identify sag sources on the input voltage sag waveform samples [37,38]. The structure of the selfsupervised voltage sag source identification method based on CNN is shown in Figure 4.

Self-Supervised Voltage Sag Source Identification Method Based on CNN
Based on the basic principle and main structure of AE, CNN and BP neural network, the convolution layer and pooling layer were used to extract features which were then classified by the BP classification network. No correct labeling of training samples was needed throughout the training process for supervision. Instead, the self-supervised training was carried out by the error between the reconstructed samples and the input samples. The trained system can extract features and identify sag sources on the input voltage sag waveform samples [37,38]. The structure of the self-supervised voltage sag source identification method based on CNN is shown in Figure 4. weights (or vectors) to solve the projection of input data on weights, and then to get coding [35,36]. The basic structure is shown in Figure 3. It consists of an encoder and a decoder. The former one compresses the data and extracts the features, while the later one reconstructs the input signal. AE is trained by the error between the original input and the reconstructed input, and weights of neurons are adjusted by the criterion of minimizing the loss function of Equation (6) in multiple iterations. X is the input signal; W is the encoding; U is the decoding weight; Φ is the nonlinear activation function; J is the neuron weight adjustment function; R is the regularization function; and λ is the regularization coefficient.
The reconstruction error converges to the threshold range, and the features extracted by the encoder can express data characteristics well. The advantage of AE is that it does not depend on any expertise, but only pays attention to the characteristics of the input sample. The extracted features can restore the original sample to the greatest extent. The whole process can be carried out under the premise that the original data has no classification label. Moreover, the whole process does not require any manual intervention to further increase automation.

Self-Supervised Voltage Sag Source Identification Method Based on CNN
Based on the basic principle and main structure of AE, CNN and BP neural network, the convolution layer and pooling layer were used to extract features which were then classified by the BP classification network. No correct labeling of training samples was needed throughout the training process for supervision. Instead, the self-supervised training was carried out by the error between the reconstructed samples and the input samples. The trained system can extract features and identify sag sources on the input voltage sag waveform samples [37,38]. The structure of the selfsupervised voltage sag source identification method based on CNN is shown in Figure 4.

Data Preprocessing
Step 1: The time-domain monitoring signals of various types of voltage sags can be obtained by using the voltage sag monitoring system.
Step 2: Each sag is classified in advance according to the type of voltage sag sources.
Step 3: All signals are resampled and the length of the waveform sequence is unified.
Step 4: Samples are normalized, and each sample is arranged into a two-dimensional matrix in a three-phase order, with each element in [0,1].

Extraction of Voltage Sag Waveform Characteristics
Step 1: Based on the basic principle and main structure of AE, the CNN model is embedded in the encoder, and a convolutional encoder is constructed to extract characteristics of voltage sag waveform.
Step 2: The pre-processed training samples are input into the convolution coder and convoluted with the convolution core in the convolution layer.
Step 3: The SIGMOD function is used as the activation function of CNN to get the corresponding feature mappings.
Step 4: Sizes of features are reduced through the pooling layer to compensate for the depression of the voltage sag data deviating from the center, and the corresponding final features are obtained.
Step 5: The obtained features are reconstructed by a convolutional decoder, and the convolution kernel and convolutional decoder in the convolutional layer are trained in the iterative process by the reconstruction error between the reconstructed sample and the input sample.
Step 6: Finally, good features which can accurately reflect the characteristics of voltage sag waveform are obtained.

Voltage Sag Source Identification
Step 1: BP neural network is constructed as the classification model in this study.
Step 2: An initial classification label of 1 × N (N is the number of BP neural network output layer units) can be obtained by inputting the features obtained by the above method into the classification network.
The number of BP neural network output layer units depends on the total number of categories of expected results. The label is meaningless without supervised iteration.
Step 3: According to the actual waveforms of various voltage sags, the standard sample waveforms of all kinds of sags are fitted. Each element in the label is regarded as the corresponding weight of each standard waveform. The reconstructed samples are constructed by accumulating the standard samples under the corresponding weights. This is shown in Equation (7). RS n is the reconstructed sample waveform, W n is the elements in the tag matrix, S n is the standard sample waveform, N is the total number of voltage sag source categories, and n is the current sag category.
Step 4: According to the reconstruction error between the original input sample and the reconstructed sample, the classification network is trained in the process of reverse propagation. After several iterations, the mean square error between the reconstructed sample and the original sample is minimized.
This transforms the problem of voltage sag source identification into a problem to recognize the optimal weight in the case where a fixed set of standard waveforms is used to represent the waveform to be identified. The back-propagation training process of the network is supervised by the reconstruction error, and the reconstructed samples are updated in each iteration to gradually approach the input samples. Step 5: After training, the classification network determines the type of sag corresponding to the largest element in the classification label matrix, thus realizing the identification of voltage sag source.

Indicators for Model Evaluation
To evaluate the classification performance of CNN after each iteration under different parameters, an index describing the distance between the output results of each type of sag sample is defined in this study, which is named as the Degree of Discreteness of Classification Label (D). It is used to evaluate the impact of η 1 , η 2 and maximum iteration number on CNN model.
Each label can be considered as an N-dimensional vector, and the value of N depends on the total number of label categories. D is defined as the maximum distance between all label-vectors and center label-vectors. The expression of D is shown in Equation (8). The selection principle of center label-vectors (p vc ) is shown in Equation (9).
where u is the total number of categories of data; n is the number of samples of the same type; v indicates that the current label distance of the vth class is being calculated; p vm , p vi , and p vj are the label vectors of the mth, ith, and jth waveforms in the vth sample waveform, respectively; p vc is the center label-vectors in the vth class. The norms used above are described below. If The principle of D is shown in Figure 5. In the figure, p represents the label-vectors; p c is the minimum label vectors according to the sum of all label distances except itself, that is, the center label-vectors. The distance labeled by red lines is the maximum distance between all label-vectors and the center label-vectors, which is equal to D. Generally, it is hoped that the classification labels of the same class of data after classification are as close as possible. Therefore, the value of D is negatively correlated with the classification effect of the network.

Indicators for Model Evaluation
To evaluate the classification performance of CNN after each iteration under different parameters, an index describing the distance between the output results of each type of sag sample is defined in this study, which is named as the Degree of Discreteness of Classification Label (D). It is used to evaluate the impact of η1, η2 and maximum iteration number on CNN model.
Each label can be considered as an N-dimensional vector, and the value of N depends on the total number of label categories. D is defined as the maximum distance between all label-vectors and center label-vectors. The expression of D is shown in Equation (8). The selection principle of center label-vectors (pvc) is shown in Equation (9).
where u is the total number of categories of data; n is the number of samples of the same type; v indicates that the current label distance of the vth class is being calculated; pvm, pvi, and pvj are the label vectors of the mth, ith, and jth waveforms in the vth sample waveform, respectively; pvc is the center label-vectors in the vth class. The norms used above are described below. If The principle of D is shown in Figure 5. In the figure, p represents the label-vectors; pc is the minimum label vectors according to the sum of all label distances except itself, that is, the center labelvectors. The distance labeled by red lines is the maximum distance between all label-vectors and the center label-vectors, which is equal to D. Generally, it is hoped that the classification labels of the same class of data after classification are as close as possible. Therefore, the value of D is negatively correlated with the classification effect of the network.

Voltage Sag Type Classification and Data Preprocessing
Induction motor starting, transformer switching and short-circuit faults are the main causes of voltage sag. The voltage sag waveforms caused by different types of short-circuit faults, such as single-phase grounding, two-phase short-circuit and three-phase short-circuit, are also different. The above voltage sags can be divided into nine types: motor starting, transformer switching, three-phase short circuit, single-phase grounding (including A-phase grounding, B-phase grounding, C-phase grounding) as well as two-phase short circuit (including AB-phase short circuit, BC-phase short circuit and CA-phase short circuit).

Voltage Sag Type Classification and Data Preprocessing
Induction motor starting, transformer switching and short-circuit faults are the main causes of voltage sag. The voltage sag waveforms caused by different types of short-circuit faults, such as single-phase grounding, two-phase short-circuit and three-phase short-circuit, are also different. The above voltage sags can be divided into nine types: motor starting, transformer switching, three-phase short circuit, single-phase grounding (including A-phase grounding, B-phase grounding, C-phase grounding) as well as two-phase short circuit (including AB-phase short circuit, BC-phase short circuit and CA-phase short circuit).
The processor used in this computer was Intel(R) Celeron(R) 2957U @ 1.4G Hz 1.40 GHz. The data used in this example were all real-time monitoring data. According to monitoring data, the standard sample waveforms of nine types of sags were established and recorded as S1, S2 . . . S9, respectively (Table 1). Motor Starting (S1) Transformer Switching (S2) Three-Phase Short Circuit (S3)

Single-phase grounding
Energies 2019, 12, x FOR PEER REVIEW 8 of 14 The processor used in this computer was Intel(R) Celeron(R) 2957U @ 1.4G Hz 1.40 GHz. The data used in this example were all real-time monitoring data. According to monitoring data, the standard sample waveforms of nine types of sags were established and recorded as S1, S2 … S9, respectively (Table 1).

Single-Phase Grounding Two-Phase Short Circuit
After preprocessing, each phase of the three-phase voltage was treated as a row, and each sample was normalized as a 2-D matrix of 3 × 27 with each element in [0,1]. Each row of data was repeated three times to obtain a 9 × 27 sample matrix, aiming to realize more convolution operations on the convolution kernel and extract characteristics of samples fully. Table 2 shows the typical voltage sags after graying. Motor starting (S1) Transformer switching (S2) Three-phase short circuit (S3) Single-phase grounding Two-phase short circuit

Model Establishment and Optimal Parameter Selection
In the present study, a self-supervised CNN model was built on the basis of the classical threelayer CNN model structure. Since the input is a 9 × 27 two-dimensional matrix, the input layer contains 9 × 27 units. The CNN network for feature extraction consists of one convolution layer and one pooling layer. The convolution layer is composed of 16, 3 × 3-section convolution kernels, and the output of the convolutional layer is activated by the SIGMOD function and enters the pooling layer. The pooling layer uses an average pooling of 1 × 5 sub-matrices to output 16, 7 × 5 final feature matrices. The output of the feature extraction network is transformed into a 1 × 560 one-dimensional matrix input transition layer as the input layer of the classification network. The transition layer The processor used in this computer was Intel(R) Celeron(R) 2957U @ 1.4G Hz 1.40 GHz. The data used in this example were all real-time monitoring data. According to monitoring data, the standard sample waveforms of nine types of sags were established and recorded as S1, S2 … S9, respectively (Table 1).

Single-Phase Grounding Two-Phase Short Circuit
After preprocessing, each phase of the three-phase voltage was treated as a row, and each sample was normalized as a 2-D matrix of 3 × 27 with each element in [0,1]. Each row of data was repeated three times to obtain a 9 × 27 sample matrix, aiming to realize more convolution operations on the convolution kernel and extract characteristics of samples fully. Table 2 shows the typical voltage sags after graying.

Model Establishment and Optimal Parameter Selection
In the present study, a self-supervised CNN model was built on the basis of the classical threelayer CNN model structure. Since the input is a 9 × 27 two-dimensional matrix, the input layer contains 9 × 27 units. The CNN network for feature extraction consists of one convolution layer and one pooling layer. The convolution layer is composed of 16, 3 × 3-section convolution kernels, and the output of the convolutional layer is activated by the SIGMOD function and enters the pooling layer. The pooling layer uses an average pooling of 1 × 5 sub-matrices to output 16, 7 × 5 final feature matrices. The output of the feature extraction network is transformed into a 1 × 560 one-dimensional matrix input transition layer as the input layer of the classification network. The transition layer The processor used in this computer was Intel(R) Celeron(R) 2957U @ 1.4G Hz 1.40 GHz. The data used in this example were all real-time monitoring data. According to monitoring data, the standard sample waveforms of nine types of sags were established and recorded as S1, S2 … S9, respectively (Table 1).

Single-Phase Grounding Two-Phase Short Circuit
After preprocessing, each phase of the three-phase voltage was treated as a row, and each sample was normalized as a 2-D matrix of 3 × 27 with each element in [0,1]. Each row of data was repeated three times to obtain a 9 × 27 sample matrix, aiming to realize more convolution operations on the convolution kernel and extract characteristics of samples fully. Table 2 shows the typical voltage sags after graying.

Model Establishment and Optimal Parameter Selection
In the present study, a self-supervised CNN model was built on the basis of the classical threelayer CNN model structure. Since the input is a 9 × 27 two-dimensional matrix, the input layer contains 9 × 27 units. The CNN network for feature extraction consists of one convolution layer and one pooling layer. The convolution layer is composed of 16, 3 × 3-section convolution kernels, and the output of the convolutional layer is activated by the SIGMOD function and enters the pooling layer. The pooling layer uses an average pooling of 1 × 5 sub-matrices to output 16, 7 × 5 final feature matrices. The output of the feature extraction network is transformed into a 1 × 560 one-dimensional matrix input transition layer as the input layer of the classification network. The transition layer The processor used in this computer was Intel(R) Celeron(R) 2957U @ 1.4G Hz 1.40 GHz. The data used in this example were all real-time monitoring data. According to monitoring data, the standard sample waveforms of nine types of sags were established and recorded as S1, S2 … S9, respectively (Table 1).

Single-Phase Grounding Two-Phase Short Circuit
After preprocessing, each phase of the three-phase voltage was treated as a row, and each sample was normalized as a 2-D matrix of 3 × 27 with each element in [0,1]. Each row of data was repeated three times to obtain a 9 × 27 sample matrix, aiming to realize more convolution operations on the convolution kernel and extract characteristics of samples fully. Table 2 shows the typical voltage sags after graying.

Model Establishment and Optimal Parameter Selection
In the present study, a self-supervised CNN model was built on the basis of the classical threelayer CNN model structure. Since the input is a 9 × 27 two-dimensional matrix, the input layer contains 9 × 27 units. The CNN network for feature extraction consists of one convolution layer and one pooling layer. The convolution layer is composed of 16, 3 × 3-section convolution kernels, and the output of the convolutional layer is activated by the SIGMOD function and enters the pooling layer. The pooling layer uses an average pooling of 1 × 5 sub-matrices to output 16, 7 × 5 final feature matrices. The output of the feature extraction network is transformed into a 1 × 560 one-dimensional matrix input transition layer as the input layer of the classification network. The transition layer After preprocessing, each phase of the three-phase voltage was treated as a row, and each sample was normalized as a 2-D matrix of 3 × 27 with each element in [0,1]. Each row of data was repeated three times to obtain a 9 × 27 sample matrix, aiming to realize more convolution operations on the convolution kernel and extract characteristics of samples fully. Table 2 shows the typical voltage sags after graying.

Model Establishment and Optimal Parameter Selection
In the present study, a self-supervised CNN model was built on the basis of the classical three-layer CNN model structure. Since the input is a 9 × 27 two-dimensional matrix, the input layer contains Energies 2019, 12, 1059 9 of 14 9 × 27 units. The CNN network for feature extraction consists of one convolution layer and one pooling layer. The convolution layer is composed of 16, 3 × 3-section convolution kernels, and the output of the convolutional layer is activated by the SIGMOD function and enters the pooling layer. The pooling layer uses an average pooling of 1 × 5 sub-matrices to output 16, 7 × 5 final feature matrices. The output of the feature extraction network is transformed into a 1 × 560 one-dimensional matrix input transition layer as the input layer of the classification network. The transition layer contains no other operations. The classification network of the model consists of one implicit layer and one hidden layer. The hidden layer has 100 units and uses a SIGMOD activation function. The number of output layer units is equal to the number of voltage sag source categories. In the case study, the output layer has nine units since the voltage sag source is divided into nine categories according to the fault phase. The constructed self-supervised CNN model is a six-layer model. The structural framework is shown in Figure 6. The momentum parameters of the convolution feature extraction network and classification network are both taken as 1. contains no other operations. The classification network of the model consists of one implicit layer and one hidden layer. The hidden layer has 100 units and uses a SIGMOD activation function. The number of output layer units is equal to the number of voltage sag source categories. In the case study, the output layer has nine units since the voltage sag source is divided into nine categories according to the fault phase. The constructed self-supervised CNN model is a six-layer model. The structural framework is shown in Figure 6. The momentum parameters of the convolution feature extraction network and classification network are both taken as 1. To select the maximum number of iterations, a simulation experiment based on the learning rate of the CNN feature extraction network (η1) and the learning rate of the BP classification network (η2), which can deal with the problem of voltage sag source identification, was carried out firstly to select the best parameters for the above model. The maximum number of iterations was continuously adjusted to calculate D of the final result under different learning rates. The appropriate maximum number of iterations, η1 andη2, were selected by comparing the size of D. The simulation results are shown in Figure 7. Analysis of the results in the graph showed that as the maximum number of iterations increases, D decreased gradually to a minimum. When the maximum number of iterations was 75, D was less than 1 when η1 ≥ 0.5 and η2 ≥ 0.5. Therefore, it set the maximum number of iterations as 75, η1 = 0.5, and η2 = 0.5 according to the simulation experiment.

Training Process of Feature Extraction
Three-hundred-and-sixty samples of measured voltage sags (including 40 S1, 40 S2, 40 S3, 40 S4, 40 S5, 40 S6, 40 S7, 40 S8, and 40 S9) were pre-processed and inputted into the CNN model for training. The status and output of each layer in a sag sample caused by inputting a phase B ground (Figure 8) are viewed in the case study. To select the maximum number of iterations, a simulation experiment based on the learning rate of the CNN feature extraction network (η 1 ) and the learning rate of the BP classification network (η 2 ), which can deal with the problem of voltage sag source identification, was carried out firstly to select the best parameters for the above model. The maximum number of iterations was continuously adjusted to calculate D of the final result under different learning rates. The appropriate maximum number of iterations, η 1 and η 2 , were selected by comparing the size of D. The simulation results are shown in Figure 7.  Figure 6. The momentum parameters of the convolution feature extraction network and classification network are both taken as 1. To select the maximum number of iterations, a simulation experiment based on the learning rate of the CNN feature extraction network (η1) and the learning rate of the BP classification network (η2), which can deal with the problem of voltage sag source identification, was carried out firstly to select the best parameters for the above model. The maximum number of iterations was continuously adjusted to calculate D of the final result under different learning rates. The appropriate maximum number of iterations, η1 andη2, were selected by comparing the size of D. The simulation results are shown in Figure 7. Analysis of the results in the graph showed that as the maximum number of iterations increases, D decreased gradually to a minimum. When the maximum number of iterations was 75, D was less than 1 when η1 ≥ 0.5 and η2 ≥ 0.5. Therefore, it set the maximum number of iterations as 75, η1 = 0.5, and η2 = 0.5 according to the simulation experiment.

Training Process of Feature Extraction
Three-hundred-and-sixty samples of measured voltage sags (including 40 S1, 40 S2, 40 S3, 40 S4, 40 S5, 40 S6, 40 S7, 40 S8, and 40 S9) were pre-processed and inputted into the CNN model for training. The status and output of each layer in a sag sample caused by inputting a phase B ground (Figure 8) are viewed in the case study. Analysis of the results in the graph showed that as the maximum number of iterations increases, D decreased gradually to a minimum. When the maximum number of iterations was 75, D was less than 1 when η 1 ≥ 0.5 and η 2 ≥ 0.5. Therefore, it set the maximum number of iterations as 75, η 1 = 0.5, and η 2 = 0.5 according to the simulation experiment.

Training Process of Feature Extraction
Three-hundred-and-sixty samples of measured voltage sags (including 40 S1, 40 S2, 40 S3, 40 S4, 40 S5, 40 S6, 40 S7, 40 S8, and 40 S9) were pre-processed and inputted into the CNN model for training. The status and output of each layer in a sag sample caused by inputting a phase B ground (Figure 8) are viewed in the case study.
learning rates.
Analysis of the results in the graph showed that as the maximum number of iterations increases, D decreased gradually to a minimum. When the maximum number of iterations was 75, D was less than 1 when η1 ≥ 0.5 and η2 ≥ 0.5. Therefore, it set the maximum number of iterations as 75, η1 = 0.5, and η2 = 0.5 according to the simulation experiment.

Training Process of Feature Extraction
Three-hundred-and-sixty samples of measured voltage sags (including 40 S1, 40 S2, 40 S3, 40 S4, 40 S5, 40 S6, 40 S7, 40 S8, and 40 S9) were pre-processed and inputted into the CNN model for training. The status and output of each layer in a sag sample caused by inputting a phase B ground (Figure 8) are viewed in the case study.  Both sample and convolutional kernels of the convolutional layer are subjected to a convolution operation and activated by SIGMOD to obtain feature mappings. The 16 feature mappings are shown in Figure 9. The feature mappings are compressed through the pooling layer and their offset centers are compensated. The model in this study used the average pooling to obtain the final features ( Figure 10). Both sample and convolutional kernels of the convolutional layer are subjected to a convolution operation and activated by SIGMOD to obtain feature mappings. The 16 feature mappings are shown in Figure 9. The feature mappings are compressed through the pooling layer and their offset centers are compensated. The model in this study used the average pooling to obtain the final features ( Figure  10).

Results of Classification Labels
After training the classification network with 360 voltage sag waveform samples, 360 corresponding weight labels are obtained. The average weight labels of all kinds of sags are shown in Table 3. The maximum value of each line is bolded. Both sample and convolutional kernels of the convolutional layer are subjected to a convolution operation and activated by SIGMOD to obtain feature mappings. The 16 feature mappings are shown in Figure 9. The feature mappings are compressed through the pooling layer and their offset centers are compensated. The model in this study used the average pooling to obtain the final features ( Figure  10).

Results of Classification Labels
After training the classification network with 360 voltage sag waveform samples, 360 corresponding weight labels are obtained. The average weight labels of all kinds of sags are shown in Table 3. The maximum value of each line is bolded.

Results of Classification Labels
After training the classification network with 360 voltage sag waveform samples, 360 corresponding weight labels are obtained. The average weight labels of all kinds of sags are shown in Table 3. The maximum value of each line is bolded.
The temporary category corresponding to the largest element in the classification label matrix is deduced from Equation (7). The final identification result is determined based on the maximum element in the classification label. For example, the starting voltage sag of the motor corresponding to the standard sample S1 and W 1 in the weight label is the maximum value, which verifies the results in Table 3.

Results and Comparison with Other Methods
The test set 100 test samples (including 15 S1, 15 S2, 10 S3, 10 S4, 10 S5, 10 S6, 10 S7, 10 S8, and 10 S9) which are input into the model. The accuracy of the voltage sag source identification method based on the self-supervised CNN model is shown in Table 4. To verify the superiority of this method, this example is compared with the method based on SVM in Reference [27]. The method in Reference [27] cannot distinguish the fault phase, so the comparison only involves the identification accuracy of sags. Results are shown in Table 5. The experimental results showed that the accuracy of the sag source identification method using the SVM method is 83% and the accuracy of the method based on CNN is 97%. In Table 4, the identification accuracy of S3 to S9 reaches 100%. Among them, S3 represented a three-phase short circuit, S4 to S6 represented single-phase grounding, and S7-S9 represented a two-phase short circuit. They all belong to faults. Since the depression area is more obvious than other fault characteristics, it is easier to be distinguished. S3 is obviously different from other faults due to the lack of temporary rise characteristics. S4-S9 can be divided into different types according to the existence of sags in characteristics of each phase. In the extracted features, the information of each phase can be clearly obtained ( Figure 9). However, the accuracy of motor starting and transformer switching is lower. The reason for this can be seen from Table 2, that S1 and S2 are very similar. Therefore, S1 and S2 can be easily distinguished from other types, but there is some confusion between them. Nevertheless, the total accuracy of the proposed method is still much higher than the method in Reference [27]. The accuracy of each type is lower than 90%. Moreover, the recognition accuracy reaches 100% in the judgment of the fault phase of asymmetric sags.
The time for each step in these two methods is shown in Table 6. The most time-consuming step of SVM is feature extraction. Based on expert experiences, the method in Reference [27] set target features, including 60 equally distanced rms values, the remaining feature components, the 2nd harmonics magnitude, two odd harmonic magnitudes (5th and 9th) and the total harmonics distortion (THD) with respect to the fundamental. Later, the extracted features are input into SVM for training. The proposed method needs neither a label of training samples nor separated feature extraction. Feature extraction and machine training were implemented simultaneously. Therefore, the time for machine training is much longer than that for training SVM. These prove that the proposed method is simpler and claims a significantly shorter running time.

Conclusions
In this paper, a self-supervised voltage sag source identification method based on CNN is proposed. Its superiority in feature extraction and sag source identification are verified by a case study.
According to the analysis of the principle of the self-supervised CNN model and the results of practical examples, the self-supervised voltage sag source identification method based on CNN has some advantages: (1) Based on the structure of the convolutional neural network, the convolution operation of the convolution layer and the pooling operation of the pooling layer are used to extract the voltage sag feature. The problem that artificial feature extraction relies too much on expertise and is highly sensitive to unknown features is solved by transforming manual feature extraction into automatic feature extraction. This has no universality. (2) The self-supervised training process does not need to input a large number of training sets and correct labels in advance. Different from the traditional method, this self-supervised training process can identify unknown sag waveform correctly. It is more applicable to meet requirements of the sag source identification on timeliness, practicability, diversity, and versatility in the context of modern big data. (3) This method transforms three-phase waveform samples into a two-dimensional matrix input classification model instead of a one-dimensional matrix input. This preserves the information of different faults, and can expand the sag classes to nine categories, thus realizing the fault phase identification in three-phase asymmetric faults. (4) Compared with the SVM method, the self-supervised voltage sag source identification method based on CNN is proved to have a higher accuracy (97%) in the identification of measured sag data.