A Study on Deep Neural Network-Based DC O ﬀ set Removal for Phase Estimation in Power Systems

: The purpose of this paper is to remove the exponentially decaying DC o ﬀ set in fault current waveforms using a deep neural network (DNN), even under harmonics and noise distortion. The DNN is implemented using the TensorFlow library based on Python. Autoencoders are utilized to determine the number of neurons in each hidden layer. Then, the number of hidden layers is experimentally decided by comparing the performance of DNNs with di ﬀ erent numbers of hidden layers. Once the optimal DNN size has been determined, intensive training is performed using both the supervised and unsupervised training methodologies. Through various case studies, it was veriﬁed that the DNN is immune to harmonics, noise distortion, and variation of the time constant of the DC o ﬀ set. In addition, it was found that the DNN can be applied to power systems with di ﬀ erent voltage levels.


Introduction
While the consumption of electrical energy is increasing day by day, other factors such as power quality and reliability also need to be improved. Protective devices have been improved dramatically as time has passed by to protect power systems and to provide continuous power supply to users. However, variation of current magnitude, known as DC offset, sometimes confuses the operation of protective devices, which leads to miss-operation or unnecessary power outage due to the inaccuracy of current phasor estimation, which is the fundamental for operating modern protective devices. In order to get better accuracy estimation, this paper proposes a method for constructing, training, and testing a deep neural network (DNN) for removing the exponentially decaying DC offset in the current waveform when a power system fault occurs.
Another approach is to remove the DC offset directly in the time domain [12][13][14][15][16][17]. Most time-domain algorithms significantly remove the DC offset faster than phasor-domain algorithms. Phasor estimation can also be achieved in the time domain, either simultaneously with or directly after removing the DC offset [12][13][14]. These methods provide a fast response, but are prone to errors when the fault current waveform contains high-frequency components. To overcome this drawback, some algorithms remove the DC offset in the time domain, then apply the DFT for phasor estimation [15][16][17]. Similar to phasor-domain algorithms, these new algorithms effectively eliminate DC offset. However, the time delay imposed by performing the DC offset removal and DFT leads to taking more than one cycle to estimate the phasor.
In order to eliminate the time delay while removing the DC offset in the time domain, the DNN method has been proposed in this paper. The DNN output is the signal of current waveform after removing the DC offset without any time delay. The non-DC offset output is then applied as an input for DFT-based phasor estimation algorithm. In addition to the nonlinearity of the DC offset, the phasor estimation accuracy is usually affected by random noise, which is frequently caused by the analog to digital conversion (ADC) process [18]. As DNN is one of the most effective solutions for nonlinear applications [19][20][21][22][23], it will be chosen to deal with the nonlinearity of DC offset. In this study, in order to improve the training speed through parallel computation [24], a CPU NVIDIA GTX 1060 graphics processing unit has been used for training the DNN.
The details of proposed method, experiment procedures, and results will be discussed in the following sections. Section 2 will explain the preparation of the training data, while the structure and DNN training will be demonstrated in Section 3. The results and conclusion are provided in Sections 4 and 5, respectively.

Data Acquisition
There are four stages of training data preparation: data generation, moving window formation, normalization, and random shuffling. As the training datasets have a considerable influence on the performance of DNN, there should be a sufficient amount of data of the required quality, such that the training process can find a good local optimum that satisfies every training dataset given to the DNN. In this study, the theoretical equations for each component (DC offset, harmonics, and noise) were modeled using Python code and training data were acquired directly through iterative simulations.

Data Generation
We modeled a simple single-phase RL circuit (60 Hz) to simulate power system faults with DC offset, harmonics, and noise based on theoretical equations. It was designed to manually determine the time constant, fault inception angle, number of harmonics, and amount of noise. The modeling result is then compared with that of PSCAD/EMTDC simulation. The results show that the modeling single-phase RL (60 Hz) is adequate to apply for simulation based on the theorical condition.

Exponentially Decaying DC Offset
It is well known that DC offset occurs in current waveform during faults in a power system. The fault current with DC offset after fault inception time t 0 is given in (1), and its illustration is shown in Figures 1 and 2.
where φ is the angle of the RL circuit, which is near to 90 • ; and α is the inception angle.
A sudden increase in current causes a discontinuous section in the waveform. However, according to Faraday's Law of Induction, the current flowing in the line cannot change instantaneously in the presence of line inductance. This is the main cause of the DC offset, and the initial value of the DC offset depends on the fault inception angle, while its decay rate depends on the time constant of the power system. Particularly, Figure 2b demonstrates fault current when the inception angle is 90 • and its waveform is similar to fault current waveform without DC offset. Assuming that there is no current just before the fault inception time i t − 0 = 0 , the DC offset given in (1) is almost zero because sin(α − φ) is near to zero.

92
Assuming that there is no current just before the fault inception time ( ( ) = 0), the DC offset given 93 in (1) where n is the harmonic order and An is its amplitude. The second, third, fourth and fifth harmonic 99 components were considered in this paper to maintain an acceptable training speed.

101
There are many types of noise interference; however, the typical noise type in power system 102 measurements is quantization noise [26] in AD conversion. As noise components have no specified 103 frequency, it was approximated as additive Gaussian white noise (AGWN) and the amount of noise 104 was adjusted by setting the signal-to-noise ratio (SNR), which can be expressed as follows:

92
Assuming that there is no current just before the fault inception time ( ( ) = 0), the DC offset given 93 in (1) where n is the harmonic order and An is its amplitude. The second, third, fourth and fifth harmonic 99 components were considered in this paper to maintain an acceptable training speed.

101
There are many types of noise interference; however, the typical noise type in power system 102 measurements is quantization noise [26] in AD conversion. As noise components have no specified 103 frequency, it was approximated as additive Gaussian white noise (AGWN) and the amount of noise 104 was adjusted by setting the signal-to-noise ratio (SNR), which can be expressed as follows:

Harmonics
Harmonics are defined in the works of [25] as sinusoidal components of a periodic wave or quantity having a frequency that is an integral multiple of the fundamental frequency, and can be expressed as follows: where n is the harmonic order and A n is its amplitude. The second, third, fourth and fifth harmonic components were considered in this paper to maintain an acceptable training speed.

Additive Noise
There are many types of noise interference; however, the typical noise type in power system measurements is quantization noise [26] in AD conversion. As noise components have no specified frequency, it was approximated as additive Gaussian white noise (AGWN) and the amount of noise was adjusted by setting the signal-to-noise ratio (SNR), which can be expressed as follows: The noise component will disrupt the normal operation of digital filtering algorithms, as discussed in earlier sections.

Simulation
On the basis of the theories discussed above, we modeled a simple RL circuit in Python code using NumPy for the simulation of current waveforms. Then, we iterated over the parameters shown in Table 1 to obtain a range of situations that can occur in a power system. As shown in Table 1, the total number of performing simulations is 6480. For simple rehearsal training, the parameters are selected less strictly to maintain an acceptable training speed. The validation datasets were generated using the same methodology as the training datasets; however, the validation datasets do not include any of the training datasets. Their generation was conducted by iterating over the parameters given in Table 2.

Pre-Processing
Pre-processing includes moving window formation, normalization, and random shuffling. This process was carried out to generate training data in a form suitable for implementing a DNN. In this paper, the DNN was planned to reconstruct an output of one cycle corresponding to the fundamental current waveform when the DNN received one cycle of the current waveform including DC offset, harmonics, and noise. As 64 samples/cycle was decided as the sampling rate, the size of the input and output layers of the DNN was also determined to be 64. After performing 6480 simulations, we used the moving window technique to prepare the training datasets. Because each simulation lasted 0.4 s with a sampling frequency of 3840 Hz (64 samples/cycle), the total number of training datasets was 9,545,040.
Next, we performed data normalization such that every training data lay between −1.0 and +1.0. This was done so that the final DNN could be generally applied to different power systems under different conditions (voltage level, source impedance, etc.) [27]. For normalization, every value in the input data was divided by the maximum absolute value and the same procedure was applied to the labels. Finally, we randomly shuffled the entire datasets so that we could perform training not focused on specific data. If random shuffling is not performed, divided mini-batches in the stochastic gradient descent (SGD) algorithm would have high variance, leading to the cost minimizing process becoming unstable [28].

Design of a DNN and Its Training
The size of a DNN is decided by the number of hidden layers and the number of neurons in each layer. Generally, DNNs tend to have better performance when their size is larger [29]; however, too many layers and neurons may induce the DNN to overfit on the training datasets, leading to it being inappropriate for applications in different environments. Thus, the size of a DNN should be determined carefully to ensure its generality. In this study, we used autoencoders for determining the number of neurons in each hidden layer. Then, the number of hidden layers was experimentally decided by comparing the performance of DNNs with different numbers of hidden layers.
In addition to contributing to the determination of the number of neurons in each layer, the autoencoder also plays a significant role in pre-training [30]. By using autoencoders for pre-training, we can start training from well-initialized weights, so that it is more likely to reach better local optima faster than the training process without pre-training. In other words, pre-training contributes to an improvement in training efficiency.

Autoencoder
The autoencoder can be described in brief as a self-copying neural network used for unsupervised feature extraction. The autoencoder consists of two parts: the encoding layer and the decoding layer. It is possible to restore the original input using only the features extracted by a well-trained autoencoder. This implies that the extracted features are distinct characteristics representing the original input. Also, we can see that these features may reflect nonlinear characteristics of the input, as the activation function of the encoding layer is selected to be a rectified linear unit (ReLU) function. After one autoencoder is trained sufficiently, we can train the second autoencoder using the extracted features from the first autoencoder as an input. By repeating this process, we can achieve the gradational and multi-dimensional features of the original input. After training several autoencoders in this way, the encoding layers of every trained autoencoder are stacked, known as stacked autoencoders. Figure 3 shows an overall process of the DNN training for this simulation. For every training step, the same cost function, optimizer, and training algorithm are utilized. unstable [28].

137
The size of a DNN is decided by the number of hidden layers and the number of neurons in each 138 layer. Generally, DNNs tend to have better performance when their size is larger [29]; however, too 139 many layers and neurons may induce the DNN to overfit on the training datasets, leading to it being 140 inappropriate for applications in different environments. Thus, the size of a DNN should be 141 determined carefully to ensure its generality. In this study, we used autoencoders for determining the 142 number of neurons in each hidden layer. Then, the number of hidden layers was experimentally 143 decided by comparing the performance of DNNs with different numbers of hidden layers.

144
In addition to contributing to the determination of the number of neurons in each layer, the 145 autoencoder also plays a significant role in pre-training [30]. By using autoencoders for pre-training, 146 we can start training from well-initialized weights, so that it is more likely to reach better local optima 147 faster than the training process without pre-training. In other words, pre-training contributes to an 148 improvement in training efficiency.

156
After one autoencoder is trained sufficiently, we can train the second autoencoder using the extracted 157 features from the first autoencoder as an input. By repeating this process, we can achieve the    Root mean square deviation (RMSD) is used as the cost function, as given in (4): Among several optimizers offered in the TensorFlow library [31], the Adam optimizer was selected. This optimizer is a mixture of the RMSProp optimizer and the Momentum optimizer, which based on a general gradient descent (GD) algorithm. In the training loop, the RMSProp optimizer takes a different number of update values for each weight, which makes the training process more adaptive. This helps accelerate the training speed and refine the training process. The Momentum optimizer adds inertia to the weight update, to accelerate the convergence speed and allow the possibility of escape from bad local minima. The Adam optimizer is considered as one of the most efficient tools for achieving our purpose, as it combines the best features of both optimizers. The Adam optimizer is very effective, straight forward, requires little memory, is well suited for problems that are large in terms of data or parameters, and is appropriate for problems with noisy or sparse gradients [32].

Pre-Training Hidden Layers
Selecting a method to train hidden layers is significant for DNN. There are many techniques available such as an autoencoder, principal component analysis (PCA), missing values ratio, low variance filter, and others. However, the goal of this paper is to increase the accuracy of phase estimation, so the autoencoder is selected as it can replicate the structure of original data, and its performance is very accurate [33]. The key point regarding this process is the necessity of training an autoencoder sufficiently before moving on to the next autoencoder. If an autoencoder is not trained sufficiently, a certain amount of error would exist that would be delivered to the next autoencoder, causing cascading errors. In short, the pre-training is useless if the autoencoders are not properly trained. On this basis, we determined the number of neurons in hidden layers.
In layer-wise training shown in Figure 4, several autoencoders were implemented with different numbers of neurons. After training (unsupervised training using input as label), the training accuracy is compared to select the optimal size of the layer. Although more neurons give better performance in a neural network, there is a limit to the number of such neurons, beyond which the reconstructing performance of the autoencoder would not be improved further. In addition, the number of neurons in each layer may be different from the input size to prevent overfitting. Root mean square deviation (RMSD) is used as the cost function, as given in (4): Among several optimizers offered in the TensorFlow library [31], the Adam optimizer was    After the size of the first autoencoder is determined, the same procedure is repeated on the second autoencoder; the extracted features from the first autoencoder are used as an input. These steps are repeated several times until there is no further noticeable improvement in the reconstruction performance.

Pre-Training Output Layer
The output layer is known as the regression layer in this study, because we are performing regression through the DNN, not classification. Training an output layer is identical to training an autoencoder, except for the activation function. For an autoencoder, the ReLU function was used to reflect nonlinear characteristics in the extracted features. However, ReLU cannot be used as an activation function for the output layer, as ReLU only passes positive values and cannot express negative values. We aim to obtain a sinusoidal current waveform containing negative values, from the output layer; thus, the output layer was set as a linear layer, meaning that it has no activation function. By setting the output layer as a linear layer, it becomes possible to express negative values, and training can proceed. As the signal without interference (DC offset, harmonics, and noise) was used as a label, the training process of the output layer is supervised.

Supervised Fine Tuning
DNN training includes the input layer, hidden layers, and output layer. After completing the pre-training of each layer, every layer is stacked together. To improve the performance of the DNN, it is necessary to connect the pre-trained layers naturally, and to adjust the pre-trained weights precisely while taking every layer into account. The DNN is optimized using this process, which is called supervised fine tuning.

Determination of the DNN Size
To determine the DNN size including the number of neurons in each layer and the number of hidden layers, we used partial datasets instead of the whole datasets, because using 9,545,040 datasets would be time-consuming and ineffective and, as the purpose of this step is investigation of the reconstruction ability of autoencoders with different sizes, there would be no issue in using partial datasets. The validation datasets were used as the partial datasets to determine the DNN size.
The DNN size was decided according to an experimental procedure based on applying the training process repeatedly. We implemented DNNs with different numbers of hidden layers, so that we could determine the optimal number of layers by analyzing the training results, similar to the determination of the number of neurons in each layer. After completing the determination of the DNN size, we re-initialized all weights before the pre-training process.

Number of Neurons in Each Layer
As described in Section 3.2.1, for the first hidden layer, we implemented autoencoders with a different number of neurons and recorded the average cost of each epoch during training. For the first hidden layer, we implemented autoencoders with different sizes and recorded the average cost of each epoch during training. Figure 5 shows the effect of the number of neurons on the cost reduction. In every case, regardless of the number of neurons, there was no significant change in cost after about 150 epochs. However, the point of convergence differed depending on the size. In fact, except for sizes 10 and 20, every other case converged to a similar cost value. This implies that 30 neurons may be an adequate number for AE1. To confirm this, we analyzed the training accuracy of each case by calculating the maximum cost, average cost, and its standard deviation. Figures 6 and 7 show the waveforms corresponding to the maximum RMSD for different sizes of AE1. The results of experimental training for AE1 are summarized in Table 3. From the cost reduction curve shown in Figure 5, we expected that AE1 with 30 neurons would be optimal. However, from the RMSD analysis given in Table 3, it was found that AE1 with 50 neurons would be optimal. With more than 40 neurons, the RMSD started to fall within the acceptable range.
datasets. The validation datasets were used as the partial datasets to determine the DNN size.

224
The DNN size was decided according to an experimental procedure based on applying the 225 training process repeatedly. We implemented DNNs with different numbers of hidden layers, so that 226 we could determine the optimal number of layers by analyzing the training results, similar to the 227 determination of the number of neurons in each layer. After completing the determination of the DNN 228 size, we re-initialized all weights before the pre-training process.   Table 3. From the cost reduction curve shown in 244 Figure 5, we expected that AE1 with 30 neurons would be optimal. However, from the RMSD analysis 245 given in Table 3, it was found that AE1 with 50 neurons would be optimal. With more than 40 neurons,

246
the RMSD started to fall within the acceptable range.

255
As the size of AE1 was determined to be 50, the size of the first hidden layer of the DNN would 256 also be 50. This is the entire procedure for deciding the hidden layer size. We performed three more 257 Figure 6. The waveform with the maximum RMSD for different sizes of AE1.  Figure 5 shows the effect of the number of neurons on the cost reduction.  Table 3. From the cost reduction curve shown in 244 Figure 5, we expected that AE1 with 30 neurons would be optimal. However, from the RMSD analysis 245 given in Table 3, it was found that AE1 with 50 neurons would be optimal. With more than 40 neurons,

246
the RMSD started to fall within the acceptable range.

255
As the size of AE1 was determined to be 50, the size of the first hidden layer of the DNN would 256 also be 50. This is the entire procedure for deciding the hidden layer size. We performed three more This result implies that the performance of the autoencoder does not depend entirely on the final cost of training. Among the eight candidates that fall within the desired range of RMSD, 50 neurons are chosen as the size of AE1 because of a desire to spare neurons for AE1 in order to simulate partial datasets rather than the whole datasets. Table 3. Root mean square deviation (RMSD) analysis (autoencoder 1 (AE1)). As the size of AE1 was determined to be 50, the size of the first hidden layer of the DNN would also be 50. This is the entire procedure for deciding the hidden layer size. We performed three more experiments under the same scenario to determine the size of the next hidden layers and determined the size of hidden layers as follows: Hidden layer 1:50, Hidden layer 2:50, Hidden layer 3:60, and Hidden layer 4:50.

Number of Hidden Layers
After the four different sizes of DNNs in TensorFlow are implemented, as shown in Figure 8, each output layer is trained before performing the fine tuning. 260 Table 3. Root mean square deviation (RMSD) analysis (autoencoder 1 (AE1)).  The training process of the output layer was performed by the layer-wise training method, and the training parameters and the training results for each output layer are shown in Figure 9. Every DNN except DNN4 showed a similar training performance regarding the cost reduction. For more precise comparison between different numbers of hidden layers, we performed RMSD analysis, and the results are shown in Figure 10. In Figure 10, DNN1 had the lowest average RMSD; however, DNN2 and DNN3 performed better in terms of the maximum and standard deviation error. DNN4 showed the worst performance, even though it had the most hidden layers. 260 Table 3. Root mean square deviation (RMSD) analysis (autoencoder 1 (AE1)).   12. The results of RMSD analysis after fine tuning were significantly different to those from before. In 279 the results before fine tuning, the performance of DNN appeared to be independent of the number of 280 hidden layers. However, comparing the results after fine tuning, the maximum and standard 281 deviation error appeared to decrease as the number of hidden layers increased; DNN3 and DNN4

282
were within an acceptable range of error. Finally, DNN3 was chosen because it performed best in 283 terms of the average error and has the simplest structure capable of obtaining the desired accuracy.  292 Figure 10. Maximum, average, and standard deviation of costs depending on the number of hidden layers after output layer training; the two cases with the best performance are marked in red.
Fine tuning was then conducted to confirm the best-performing DNN. Unlike the layer-wise training, fine tuning is a training process that involves all layers of the DNN; thus, we connected every layer in the DNNs, as shown in Figure 8. The results of the fine tuning are shown in Figures 11 and 12. The results of RMSD analysis after fine tuning were significantly different to those from before. In the results before fine tuning, the performance of DNN appeared to be independent of the number of hidden layers. However, comparing the results after fine tuning, the maximum and standard deviation error appeared to decrease as the number of hidden layers increased; DNN3 and DNN4 were within an acceptable range of error. Finally, DNN3 was chosen because it performed best in terms of the average error and has the simplest structure capable of obtaining the desired accuracy. Fine tuning was then conducted to confirm the best-performing DNN. Unlike the layer-wise 276 training, fine tuning is a training process that involves all layers of the DNN; thus, we connected every 277 layer in the DNNs, as shown in Figure 8. The results of the fine tuning are shown in Figures 11 and   278 12. The results of RMSD analysis after fine tuning were significantly different to those from before. In   Figure 13 shows the cost reduction curves of three autoencoders and the output layer, and Figure 14 shows the cost reduction curve of fine tuning. Comparing the average cost between the layer-wise trained DNN and the fine-tuned DNN3, the latter exhibited a lower final average cost. This demonstrates the effectiveness of the fine-tuning step for improving performance.

DNN3 Training Result
14 shows the cost reduction curve of fine tuning. Comparing the average cost between the layer-wise 295 trained DNN and the fine-tuned DNN3, the latter exhibited a lower final average cost. This 296 demonstrates the effectiveness of the fine-tuning step for improving performance.

297
After every training step had been completed, a validation test was performed. The purpose of 298 the validation test was to examine the generality of the DNN3. Even if the training is conducted 299 successfully, the DNN3 may not operate accurately in new situations it has not experienced before, a 300 phenomenon known as overfitting. The maximum, average, and standard deviation of the RMSD 301 were 11.3786, 2.1417, and 1.3150, respectively; there was no sign of overfitting. The average value of 302 RMSD was low, and the standard deviation was also sufficiently low. A more detailed analysis of 303 validation accuracy is provided in Figure 15 and Figure 16. The DNN3 shows good accuracy even in 304 situations that have not been experienced before in the training process. Finally, the training process 305 of DNN3 was considered to have been completed successfully.   trained DNN and the fine-tuned DNN3, the latter exhibited a lower final average cost. This 296 demonstrates the effectiveness of the fine-tuning step for improving performance.

297
After every training step had been completed, a validation test was performed. The purpose of 298 the validation test was to examine the generality of the DNN3. Even if the training is conducted 299 successfully, the DNN3 may not operate accurately in new situations it has not experienced before, a 300 phenomenon known as overfitting. The maximum, average, and standard deviation of the RMSD 301 were 11.3786, 2.1417, and 1.3150, respectively; there was no sign of overfitting. The average value of 302 RMSD was low, and the standard deviation was also sufficiently low. A more detailed analysis of 303 validation accuracy is provided in Figure 15 and Figure 16. The DNN3 shows good accuracy even in 304 situations that have not been experienced before in the training process. Finally, the training process 305 of DNN3 was considered to have been completed successfully.   After every training step had been completed, a validation test was performed. The purpose of the validation test was to examine the generality of the DNN3. Even if the training is conducted successfully, the DNN3 may not operate accurately in new situations it has not experienced before, a phenomenon known as overfitting. The maximum, average, and standard deviation of the RMSD were 11.3786, 2.1417, and 1.3150, respectively; there was no sign of overfitting. The average value of RMSD was low, and the standard deviation was also sufficiently low. A more detailed analysis of validation accuracy is provided in Figures 15 and 16. The DNN3 shows good accuracy even in situations that have not been experienced before in the training process. Finally, the training process of DNN3 was considered to have been completed successfully.

319
After modeling was completed, single line to ground faults were simulated to acquire fault 320 current waveforms with DC offset. By adding harmonic components and additive noise, we were able 321 to prepare test datasets that can verify the performance of the DNN3. For a comparative analysis 322 between the proposed DNN3 and digital filtering algorithms, we implemented a second order 323 Butterworth filter and a mimic filter (known as a DC-offset filter), which is commonly used in the 324 existing digital distance relays [34]. Even if the noise component is removed using an analog low-pass

319
After modeling was completed, single line to ground faults were simulated to acquire fault 320 current waveforms with DC offset. By adding harmonic components and additive noise, we were able 321 to prepare test datasets that can verify the performance of the DNN3. For a comparative analysis 322 between the proposed DNN3 and digital filtering algorithms, we implemented a second order Butterworth filter and a mimic filter (known as a DC-offset filter), which is commonly used in the 324 existing digital distance relays [34]. Even if the noise component is removed using an analog low-pass 325 Figure 16. Validation test case 2.

Performance Tests and Discussions
PSCAD/EMTDC is used to model the power system shown in Figure 17 and generate datasets for performance testing.

319
After modeling was completed, single line to ground faults were simulated to acquire fault 320 current waveforms with DC offset. By adding harmonic components and additive noise, we were able 321 to prepare test datasets that can verify the performance of the DNN3. For a comparative analysis 322 between the proposed DNN3 and digital filtering algorithms, we implemented a second order existing digital distance relays [34]. Even if the noise component is removed using an analog low-pass After modeling was completed, single line to ground faults were simulated to acquire fault current waveforms with DC offset. By adding harmonic components and additive noise, we were able to prepare test datasets that can verify the performance of the DNN3. For a comparative analysis between the proposed DNN3 and digital filtering algorithms, we implemented a second order Butterworth filter and a mimic filter (known as a DC-offset filter), which is commonly used in the existing digital distance relays [34]. Even if the noise component is removed using an analog low-pass filter before ADC, the influence of the noise generated after the ADC cannot be eliminated from the current waveform. We have taken this into account while generating the test datasets.

Response to Curents without Harmonics and Noise
In this test, the current waveform includes only the fundamental current and the DC offset. Regarding the DC offset, three cases were considered: DC offset in a positive direction, DC offset in a negative direction, and a relatively small amount of DC offset. Figures 18 and 19 show the input, DNN3 output, and DC-offset filter output for three different cases. Every DNN3 output (size of 64) overlaps in the graph, and this demonstrates the distinct features of the DNN3 output. As the DNN3 had no opportunity to be trained for transient states, it exhibits unusual behavior near the fault inception time. However, subsequent tests have confirmed that this phenomenon has no adverse effect on the phasor estimation process, based on full-cycle discrete Fourier transform (DFT). As the output of the DNN3 is in the form of a data window, we plotted the most recent values using a red line, as shown in Figure 20. From this point forwards, the instantaneous current value will be plotted in this way.

328
In this test, the current waveform includes only the fundamental current and the DC offset.

329
Regarding the DC offset, three cases were considered: DC offset in a positive direction, DC offset in a 330 negative direction, and a relatively small amount of DC offset.  filter before ADC, the influence of the noise generated after the ADC cannot be eliminated from the 326 current waveform. We have taken this into account while generating the test datasets.

328
In this test, the current waveform includes only the fundamental current and the DC offset.

329
Regarding the DC offset, three cases were considered: DC offset in a positive direction, DC offset in a 330 negative direction, and a relatively small amount of DC offset.

354
According to the table, the DNN method has the best convergence time for all cases studied.  Figure 19 are used to simulate their performances using the full-cycle DFT algorithm.

352
While applying DFT to the three signals, the DNN signal gave the fastest convergence speed.
353 Table 4 summarizes the amplitude convergence time for each method in three different cases.

354
According to the table, the DNN method has the best convergence time for all cases studied.    Figure 19 are used to simulate their performances using the full-cycle DFT algorithm.

352
While applying DFT to the three signals, the DNN signal gave the fastest convergence speed.
353 Table 4 summarizes the amplitude convergence time for each method in three different cases.

354
According to the table, the DNN method has the best convergence time for all cases studied.  While applying DFT to the three signals, the DNN signal gave the fastest convergence speed. Table 4 summarizes the amplitude convergence time for each method in three different cases. According to the table, the DNN method has the best convergence time for all cases studied. We have discussed the test results in the case in which the time constant is predicted accurately; however, when an inaccurate time constant is used in a DC-offset filter, its performance is adversely affected, as shown in Figure 23.     Figure 24 shows the test results of CASE A, which are summarized in Table 5. DNN3 has a faster 386 convergence speed and lower standard deviation compared with the DC-offset filter. Compared with 387 the original input (where DC offset was not removed), the DC-offset filter shows remarkable 388 improvement, though it still performs slightly less effectively than the DNN3. However, Figure 24(a)

389
shows at a glance that the performance of the DC-offset filter is very poor in estimating the 390 instantaneous current waveform. As a result, it was found that the DNN3 performed much better, 391 even though the test was conducted with the DC-offset filter given a precise time constant. Thus, in In the case of the DC-offset filter, some oscillations appear in the current amplitude after fault inception depending on the extent of the inaccuracy in the time constant. On the other hand, the DNN shows robust characteristics and has less convergence time.

Case Study
For further performance analysis of the proposed DNN3, we will discuss three case studies in this section.  Figure 24 shows the test results of CASE A, which are summarized in Table 5. DNN3 has a faster convergence speed and lower standard deviation compared with the DC-offset filter. Compared with the original input (where DC offset was not removed), the DC-offset filter shows remarkable improvement, though it still performs slightly less effectively than the DNN3. However, Figure 24a shows at a glance that the performance of the DC-offset filter is very poor in estimating the instantaneous current waveform. As a result, it was found that the DNN3 performed much better, even though the test was conducted with the DC-offset filter given a precise time constant. Thus, in the case of line to line fault, the proposed method is better than the DC-offset filter in both factors, convergence time and standard deviation.

406
The generality of the proposed DNN3 is verified through this test, as the results are close to those 407 of CASE A, as shown in Figure 25 and summarized in    The generality of the proposed DNN3 is verified through this test, as the results are close to those of CASE A, as shown in Figure 25 and summarized in Table 6. Even under the increased level of harmonics, as compared with former test cases, DNN3 exhibited robust performance with less convergence time and lower standard deviation, while a different power system model is used for the simulations.

406
The generality of the proposed DNN3 is verified through this test, as the results are close to those 407 of CASE A, as shown in Figure 25 and summarized in    In this section, we have observed how the proposed DNN3 and the DC-offset filter responded to the presence of fault resistance. The details of the studied system are given bellow: • In Figure 26a, it is seen that the oscillations appear in the result of the DC-offset filter when a fault resistance exists. This oscillation is caused by the fault resistance, which affects the time constant of the DC offset. As summarized in Table 7, the oscillation had an adverse effect on the convergence time of the filter. In contrast, DNN3 was not influenced by the fault resistance, which has shown its potential to solve this problem.

423
In Figure 26

Conclusions
In this paper, we developed a DNN to effectively remove DC offset. Autoencoders were used to determine the optimal size of the DNN. Subsequently, intensive training for the DNN was performed using both the supervised and unsupervised training methodologies.
Even under harmonics and noise distortion, the DNN showed good and robust accuracy in instantaneous current reconstruction and phasor estimation, compared with the DC-offset filter. It was also confirmed that the errors due to inaccurate time constants of the DC offset were significantly reduced compared with the DC-offset filter. These results confirmed that the method of determining the DNN size using the autoencoder was appropriate. Therefore, the optimal DNN size in other deep learning applications could be determined based on this methodology. As the performance of the DNN is largely affected by the quality of the training datasets, it would be possible to train the DNN more precisely if more sophisticated training datasets could be prepared.
Furthermore, it is expected that it would be possible to reconstruct the secondary current waveform of the current transformer distorted by saturation, by modeling the current transformer mathematically and applying the methodology used in this paper.