Online Measurement Error Detection for the ElectronicTransformer in a Smart Grid

: With the development of smart power grids, electronic transformers have been widely used to monitor the online status of power grids. However, electronic transformers have the drawback of poor long-term stability, leading to a requirement for frequent measurement. Aiming to monitor the online status frequently and conveniently, we proposed an attention mechanism-optimized Seq2Seq network to predict the error state of transformers, which combines an attention mechanism, Seq2Seq network, and bidirectional long short-term memory networks to mine the sequential information from online monitoring data of electronic transformers. We implemented the proposed method on the monitoring data of electronic transformers in a certain electric ﬁeld. Experiments showed that our proposed attention mechanism-optimized Seq2Seq network has high accuracy in the aspect of error prediction. transformer monitoring data set in an electric ﬁeld, we demonstrate that the proposed method not only greatly improves the training efﬁciency of the model but also shows good performance in prediction accuracy. Therefore, the proposed method is more versatile and practical in solving electronic transformer error prediction problems.


Introduction
Currently, modern power grids are experiencing intensification and informatization. Owing to the rapid construction of intelligent power grids, a number of electronic transformers (ETs) are used in applications [1]. An ET consists of sensors and signal processing units, used to measure electronic current and voltage. Compared with traditional magnetic transformers, it has more advantages, such as low cost, high bandwidth, good insulation performance, and adapting to the development trend of digitalization and intelligence. However, the current problem with ETs is their poor long-term stability [2]. As a typical electrical measuring device, online monitoring and the evaluation of measurement error status of electronic transformers are of great importance for the safety and reliability of the power grid.
Due to technical limitations, the evaluation methods for ET error status are mainly divided into three types: the first one is regularly checking transformers that are running [3]. It uses high-accuracy transformers to connect with the measured transformers loop, and then measures the ratio and angular difference of the tested transformers. However, the operation is very complicated, which requires quitting the tested transformers and ceasing 2 of 18 the power line. In addition, examining the error state of transformers regularly is only a short-term procedure, which is unsuitable to implement online evaluation and monitoring of the long-term operational status of transformers. Therefore, it cannot accurately perceive the error change of transformers when the grid runs, and it is difficult to discover its potential error change trend. The second method is to run the high-standard transformers and measure transformers in parallel and monitor the error on the measured transformers for a long time [4,5], which realizes the long-term error state monitoring of transformers. However, it still has the weakness of low traceability of the standard transformer. To complete the accurate monitoring of measured transformers, reliable standard transformers are required so standard transformers also need to be checked regularly. At present, the access and exit of standard transformers require line outage under test, which is quite complicated. Therefore, the long-term parallel operation of standard transformers also brings more complex fixing problems. Additionally, the long-term integration of standard transformers into the power system operation also affects the operating parameters of the grid, increasing the risk for grid security and stability, which decreases the reliability of the power grids. Therefore, it is difficult to achieve this method in production [6][7][8]. The third method is based on data-driven and model analysis of the electronic transformer's condition assessment method. The former is mainly aimed at the evaluation of the variation of transformers but not the long-term and gradual error which is an important performance evaluation index for electronic transformers. Additionally, the latter method must rely on accurately physical and mathematical models [9], which is not suitable in field engineering applications.
In summary, existing methods only support short-term online verification of transformers, which is far from ensuring the performance of transformers [10][11][12]. Therefore, existing methods cannot satisfy the intelligent requirements of the development of smart grids for equipment maintenance [13,14]. There are still several shortcomings in the process of evaluating and monitoring transformers, as follows. (1) Currently, the research is mainly on offline periodic maintenance and short-term online verification, which cannot complete long-term online monitoring and evaluation; (2) there are few studies aimed at the long-term online monitoring of the current transformer status and the accurate acquisition technology of the primary line voltage and current signals; (3) how to accurately evaluate the operating state of the transformer without high-precision standard transformers that are involved in extracting and judging the feature quantities of the operating characterization transformers; (4) how to ensure the accuracy of stateful inspection and evaluation results, where the difficulty lies in the design and implementation of sampling testing methods; (5) it is difficult to accurately evaluate the error state of transformers without exemplary transformers, and how to separate and distinguish the change in the transformer error caused by the grid fault and the transformer error change caused by the transformers' failure.
Based on the evaluation of the error state of electronic transformers, we proposed an attention mechanism-optimized sequence-to-sequence (Seq2Seq) network to predict the error state of transformers that are used for fault location and early warning of transformers. In Section 1.1, we review the long short-term memory networks and Seq2Seq model. Then, we present the attention mechanism-optimized Seq2Seq network for prediction in Section 1.2. In Section 2, we implement extensive experiments to illustrate the effectiveness of the proposed algorithm. Finally, we draw a conclusion in Section 3. LSTM is a kind of cyclic neural network, which aims to solve the problem of long-term dependence in the time series model. In all the cyclic neural network models, there is processing of timing sequence information. The essence of LSTM is an improvement based on the basic cyclic neural network structure. Three important gate control functions are reintroduced into the memory unit module, which are named the forget gate [15], input gate, and output gate. The standard network structure of LSTM is shown in Figure 1 below. are reintroduced into the memory unit module, which are named the forget gate [15], input gate, and output gate. The standard network structure of LSTM is shown in Figure  1 below.

Multiply
Vector concatenation Plus σ σ In the forward propagation process of the LSTM, information preservation and interaction are controlled through the three gate structures of the memory unit of the hidden layer.
(1) Forget gate : The forget gate is used to control the proportion of input information, and when the time sequence information passes through the forget gate, part of the information is discarded so that the time span of each batch of data is the same and the data volume is not too large. This ratio control outputs a value between 0 and 1 through the sigmoid layer, with 0 representing "complete abandonment" and 1 representing "complete retention". The implementation diagram of the forget door is shown in Figure 2. The calculation formula of the forget gate is as follows: where: In the forward propagation process of the LSTM, information preservation and interaction are controlled through the three gate structures of the memory unit of the hidden layer.
(1) Forget gate f t : The forget gate is used to control the proportion of input information, and when the time sequence information passes through the forget gate, part of the information is discarded so that the time span of each batch of data is the same and the data volume is not too large. This ratio control outputs a value between 0 and 1 through the sigmoid layer, with 0 representing "complete abandonment" and 1 representing "complete retention". The implementation diagram of the forget door is shown in Figure 2. are reintroduced into the memory unit module, which are named the forget gate [15], input gate, and output gate. The standard network structure of LSTM is shown in Figure  1 below.

Multiply
Vector concatenation Plus σ σ In the forward propagation process of the LSTM, information preservation and interaction are controlled through the three gate structures of the memory unit of the hidden layer.
(1) Forget gate : The forget gate is used to control the proportion of input information, and when the time sequence information passes through the forget gate, part of the information is discarded so that the time span of each batch of data is the same and the data volume is not too large. This ratio control outputs a value between 0 and 1 through the sigmoid layer, with 0 representing "complete abandonment" and 1 representing "complete retention". The implementation diagram of the forget door is shown in Figure 2. The calculation formula of the forget gate is as follows: where: The calculation formula of the forget gate is as follows: where: h t−1 denotes the output at time of t − 1, x t denotes the information in the network at the time of t, σ is the sigmoid function, W f is the weight matrix for the forget gate, and b f is the bias term for the forget gate.
(2) Input gate i t : The input gate controls the input process of the current moment information. This process includes the input gate completing the updating of the current moment information, and at the same time superimposes the input of a moment on the hidden layer to the current state. The input gate function includes a sigmoid function. The implementation diagram of the input gate is shown in Figure 3.
Energies 2021, 14, x FOR PEER REVIEW 4 of 18 ℎ denotes the output at time of − 1, denotes the information in the network at the time of , is the sigmoid function, is the weight matrix for the forget gate, and is the bias term for the forget gate.
(2) Input gate : The input gate controls the input process of the current moment information. This process includes the input gate completing the updating of the current moment information, and at the same time superimposes the input of a moment on the hidden layer to the current state. The input gate function includes a sigmoid function. The implementation diagram of the input gate is shown in Figure  3. The superposition process of the output and current input at one time on the hidden layer is shown as follows: where: ℎ denotes the output at time of − 1, denotes the information in the network at time of , is the sigmoid function, is the weight matrix for the input gate, and is the bias term for the input gate.
The calculation method for updating the current moment information is shown in the following formula: where: is the candidate for memory unit, ℎ denotes the output at time of − 1 , denotes the information in the network at time of , is the weight matrix for the cell, and is the bias term for the cell. At the current moment of , the memory unit of the hidden layer completes the multiplication of information and can realize the memory output unit through the joint action of the forget gate and the input gate. The output calculation formula is shown in the following formulation: where is the output of the forget gate, is the output of the input gate, is the output of the cell at time of t − 1, and is the candidate for the memory unit.
(3) Output gate : The output gate function controls the output information and the timing information returned to the hidden layer before the memory unit information is output. By using the output gate, the state is updated, while the state of ℎ is The superposition process of the output and current input at one time on the hidden layer is shown as follows: where: h t−1 denotes the output at time of t − 1, x t denotes the information in the network at time of t, σ is the sigmoid function, W i is the weight matrix for the input gate, and b i is the bias term for the input gate. The calculation method for updating the current moment information is shown in the following formula: where: C t is the candidate for memory unit, h t−1 denotes the output at time of t − 1, x t denotes the information in the network at time of t , W C is the weight matrix for the cell, and b C is the bias term for the cell. At the current moment of t, the memory unit of the hidden layer completes the multiplication of information and can realize the memory output unit C t through the joint action of the forget gate and the input gate. The output calculation formula is shown in the following formulation: where f t is the output of the forget gate, i t is the output of the input gate, C t−1 is the output of the cell at time of t − 1, and C t is the candidate for the memory unit.
(3) Output gate O t : The output gate function controls the output information and the timing information returned to the hidden layer before the memory unit information is output. By using the output gate, the state is updated, while the state of h t−1 is retained in the time unit operating under the hidden layer. The implementation diagram of the output gate is shown in Figure 4 below. retained in the time unit operating under the hidden layer. The implementation diagram of the output gate is shown in Figure 4 below. The calculation formula of the output gate output is shown in the following formulation: where is the weight matrix for the output gate and is the bias term for the output gate. ℎ information returned to the hidden layer is computed as follows: where is the output of the output gate, is the output of the cell at time of t. In this way, we get the information returned to the hidden layer.

Seq2Seq Network Model
The Seq2Seq network model is also a kind of cyclic neural network, which can be used to deal with sequence prediction problems. It is often widely used in text abstracts, question answering systema, and other fields. The Seq2Seq network model structure for machine translation is shown in Figure 5. In Figure 5, each rectangular block contains a neural network like LSTM. It can be observed from the above structure diagram that the input of the model to the output of the model is a complete stream, in which ABC is the input of the model encoder part. After that, the sign "end of sentence" (EOS) in the input indicates the termination of the sentence. Then, WXYZ is the input and output of the sequential models in the decoder part. The output state of the last hidden layer in the encoder part is compressed into the expression of fixed dimension word vectors of a specified size, and then the expression of the fixed dimension word vectors is taken as the input of the first hidden layer in the decoder part. The advantage is that the encoder part and decoder part can be regarded as a linear work flow. It is mentioned in [16] that if the model receives input in the order of CBA for The calculation formula of the output gate output is shown in the following formulation: where W O is the weight matrix for the output gate and b O is the bias term for the output gate. h t information returned to the hidden layer is computed as follows: where O t is the output of the output gate, C t is the output of the cell at time of t.
In this way, we get the information returned to the hidden layer.

Seq2Seq Network Model
The Seq2Seq network model is also a kind of cyclic neural network, which can be used to deal with sequence prediction problems. It is often widely used in text abstracts, question answering systema, and other fields. The Seq2Seq network model structure for machine translation is shown in Figure 5.   The calculation formula of the output gate output is shown in the following formulation: where is the weight matrix for the output gate and is the bias term for the output gate. ℎ information returned to the hidden layer is computed as follows: where is the output of the output gate, is the output of the cell at time of t. In this way, we get the information returned to the hidden layer.

Seq2Seq Network Model
The Seq2Seq network model is also a kind of cyclic neural network, which can be used to deal with sequence prediction problems. It is often widely used in text abstracts, question answering systema, and other fields. The Seq2Seq network model structure for machine translation is shown in Figure 5. In Figure 5, each rectangular block contains a neural network like LSTM. It can be observed from the above structure diagram that the input of the model to the output of the model is a complete stream, in which ABC is the input of the model encoder part. After that, the sign "end of sentence" (EOS) in the input indicates the termination of the sentence. Then, WXYZ is the input and output of the sequential models in the decoder part. The output state of the last hidden layer in the encoder part is compressed into the expression of fixed dimension word vectors of a specified size, and then the expression of the fixed dimension word vectors is taken as the input of the first hidden layer in the decoder part. The advantage is that the encoder part and decoder part can be regarded as a linear work flow. It is mentioned in [16] that if the model receives input in the order of CBA for In Figure 5, each rectangular block contains a neural network like LSTM. It can be observed from the above structure diagram that the input of the model to the output of the model is a complete stream, in which ABC is the input of the model encoder part. After that, the sign "end of sentence" (EOS) in the input indicates the termination of the sentence. Then, WXYZ is the input and output of the sequential models in the decoder part. The output state of the last hidden layer in the encoder part is compressed into the expression of fixed dimension word vectors of a specified size, and then the expression of the fixed dimension word vectors is taken as the input of the first hidden layer in the decoder part. The advantage is that the encoder part and decoder part can be regarded as a linear work flow. It is mentioned in [16] that if the model receives input in the order of CBA for calculation, the prediction performance of the model can be improved to some extent because the distance between A and X becomes smaller. The structure diagram of the encoder part is shown in Figure 6, which corresponds to the left part of Figure 5. calculation, the prediction performance of the model can be improved to some extent because the distance between A and X becomes smaller. The structure diagram of the encoder part is shown in Figure 6, which corresponds to the left part of Figure 5. From Figure 6 above, we can see that the encoder section [17] is composed of several stacked long short-term memory network units, each of the LSTM cells accepts the input of a single element from the input sequence, and each LSTM neural unit propagates the collected information about the element at this time forward to the next LSTM neural unit through calculation. The calculation formula of the state of each hidden layer is shown in the following formula: where: ℎ is the current hidden layer state, is the network weight, ℎ is the previous hidden layer state, is the input vector of the current hidden layer.
The structure of the decoder model is shown in Figure 7 below: As can be seen from Figure 7, the decoder section is also several stacked LSTM network units. The LSTM unit of each time step receives the output of the previous LSTM unit and the state of the previously hidden layer. Candidate symbols are obtained through an activation function and softmax layer, and the one with the highest probability is selected as the output of the current time step. From Figure 6 above, we can see that the encoder section [17] is composed of several stacked long short-term memory network units, each of the LSTM cells accepts the input of a single element from the input sequence, and each LSTM neural unit propagates the collected information about the element at this time forward to the next LSTM neural unit through calculation. The calculation formula of the state of each hidden layer is shown in the following formula: where: h t is the current hidden layer state, W is the network weight, h t−1 is the previous hidden layer state, x t is the input vector of the current hidden layer. The structure of the decoder model is shown in Figure 7 below: Energies 2021, 14, x FOR PEER REVIEW 6 of 18 calculation, the prediction performance of the model can be improved to some extent because the distance between A and X becomes smaller. The structure diagram of the encoder part is shown in Figure 6, which corresponds to the left part of Figure 5. From Figure 6 above, we can see that the encoder section [17] is composed of several stacked long short-term memory network units, each of the LSTM cells accepts the input of a single element from the input sequence, and each LSTM neural unit propagates the collected information about the element at this time forward to the next LSTM neural unit through calculation. The calculation formula of the state of each hidden layer is shown in the following formula: where: ℎ is the current hidden layer state, is the network weight, ℎ is the previous hidden layer state, is the input vector of the current hidden layer.
The structure of the decoder model is shown in Figure 7 below: As can be seen from Figure 7, the decoder section is also several stacked LSTM network units. The LSTM unit of each time step receives the output of the previous LSTM unit and the state of the previously hidden layer. Candidate symbols are obtained through an activation function and softmax layer, and the one with the highest probability is selected as the output of the current time step. As can be seen from Figure 7, the decoder section is also several stacked LSTM network units. The LSTM unit of each time step receives the output of the previous LSTM unit and the state of the previously hidden layer. Candidate symbols are obtained through an activation function and softmax layer, and the one with the highest probability is selected as the output of the current time step.
where: h t is the current hidden layer state, W is the network weight, h t−1 is the previous hidden layer state. The output value of the output time step is computed by using the hidden layer state of the current time step and the corresponding weight.

Construction of Prediction Model Based on Optimized Neural Network
When analyzing time series data, the characteristics of each element in the sequence should be calculated. When we extract information from sequence data, we need to pay attention to the characteristics of the entire sequence, while simply adding or averaging the features of each element may decrease the independence of elements in the input sequence. Therefore, the attention mechanism needs to be introduced into the model [18][19][20][21].
The introduction of the attention mechanism can also address the limitations of the model's long-term dependency and lead to efficient usage of memory during computation. As an intermediate layer between the encoding part and decoding part, the attention mechanism aims to capture information from the tag sequence related to the sentence content [22].
The attention mechanism-based network model first computes a set of attention weights and creates a combination of weights by multiplying the vector output by the encoder. The calculated results should contain information about specific parts of the input sequence to help the decoder select the correct representation for the output. Therefore, the decoder can use different parts of the encoder sequence as the context until all sequences are decoded [23]. The framework of the attention mechanism is displayed in Figure 8.
where: ℎ is the current hidden layer state, is the network weight, ℎ is the previous hidden layer state.
= softmax ℎ The output value of the output time step is computed by using the hidden layer state of the current time step and the corresponding weight.

Attention Mechanism (AM)
When analyzing time series data, the characteristics of each element in the sequence should be calculated. When we extract information from sequence data, we need to pay attention to the characteristics of the entire sequence, while simply adding or averaging the features of each element may decrease the independence of elements in the input sequence. Therefore, the attention mechanism needs to be introduced into the model [18][19][20][21].
The introduction of the attention mechanism can also address the limitations of the model's long-term dependency and lead to efficient usage of memory during computation. As an intermediate layer between the encoding part and decoding part, the attention mechanism aims to capture information from the tag sequence related to the sentence content [22].
The attention mechanism-based network model first computes a set of attention weights and creates a combination of weights by multiplying the vector output by the encoder. The calculated results should contain information about specific parts of the input sequence to help the decoder select the correct representation for the output. Therefore, the decoder can use different parts of the encoder sequence as the context until all sequences are decoded [23]. The framework of the attention mechanism is displayed in Figure 8. Unlike the encode-decode model, which uses the same context vector for each hidden layer state of the decoder part, the attention mechanism computes a vector and an output for each time step t in the decoding stage [24]. The corresponding calculation formula is shown as follows: Unlike the encode-decode model, which uses the same context vector for each hidden layer state of the decoder part, the attention mechanism computes a vector C t and an output y t for each time step t in the decoding stage [24]. The corresponding calculation formula is shown as follows: where h j is the hidden layer state of input vector x j , and a tj is the weight of h j prediction y t . Vector C t is also known as the expected attention vector and can usually be calculated by a softmax function, as follows: Among them, the attentionScore function can select the hidden state of the decoding part and the hidden state of the encoding part to calculate a score used to calculate the weight.

Construction of Seq2Seq Model
Based on the basic introduction of the AM above, the introduction of the attention mechanism into the Seq2Seq network model is expected to reveal the potential relationship between sequence data [16,25], so as to improve the prediction accuracy. In the Seq2Seq model, circulating neural networks are generally used in the process of encoding, such as LSTM and a gate recurrent unit (GRU) [26]. In this study, the encoding and decoding stages used bidirectional LSTM [27]. The online operation state prediction of the transformer is based on the time series. The standard circulating neural network often ignores the future context information when dealing with the problems in the time series. If the network can access the past or future time step context information, it is beneficial to network traffic prediction [28,29]. To solve the above problem, a delay is added between the input and the target to add future information of time frame M together to predict the output result of the current time step. Theoretically speaking, the size of M can be increased to capture all information available for current time step prediction in the future. If M is adjusted to be too large, the result of network traffic prediction will be worse. This is because the network model focuses on so much input information that the joint modeling ability of prediction declines. Therefore, the size of M should be adjusted manually during the experiment. Although the introduction of the M time frame in the model is beneficial, the context information of future time steps cannot be obtained, and the training and prediction task of the network model is inefficient due to the manual adjustment of the size of M. Therefore, the network model selected in the process of encoding in this study was the bidirectional LSTM [30,31].
Bidirectional LSTM (BLSTM) is similar in network structure to LSTM networks because it is constructed with LSTM units. The special feature of BLSTM networks is that they improve long-term dependence without retaining redundant context information. Different from LSTM networks, the BLSTM has two network layers that propagate forward in two parallel directions. The forward and back propagation modes of each layer are similar to the basic neural network propagation mode. Meanwhile, these two network layers carry all the information in the two directions before and after the sequence [32,33]. Therefore, the corresponding formula is adjusted as follows: where h f ∈ R d represents the output vector of the forward neural network layer, h b ∈ R d represents the reverse neural network layer. Different from the LSTM, the final output result of the BLSTM is the combination of the two parts of y t = h f t , h b t , y t ∈ R 2d . The structure of the Seq2Seq network model optimized by AM is shown in Figure 9 below:  The bottom half of Figure 9 is the encoder part of the model, and the bottom is a model made by the stacked BLSTM with a length of . In this paper, each BLSTM unit is called Pre_attention Bi_LSTM. Its output is represented by , which means = ⃗ ; ⃖ It combines the activation value of the forward propagation of BLSTM with the activation value of backward propagation and then puts the output of and the decoder of the previous time step together for the calculation of the attention mechanism, to obtain the context variable of each time step. The upper part of Figure 9 is the decoding part. The prediction result of this time step is obtained by inputting the hidden layer state of the previous time step and the context variable of the current time step into the BLSTM unit of the decoder.

Experimental Environment and Data Set
The experiment was carried out with the simulated monitoring data of a capacitor voltage transformer (CVT) in an electric field. The schematic diagram of the CVT is shown in Figure 10. C1 and C2 are the high-voltage capacitance and medium-voltage capacitance of the capacitor voltage divider. The magnetic unit consists of intermediate transformer T1, compensation reactor Z, damping device D, and overvoltage protection device G. After the CVT is connected with the high-voltage system, the capacitor voltage divider transforms the primary high-voltage signal into a lower intermediate-voltage signal, which reduces the insulation requirements of the magnetic unit, and then transforms the intermediate transformer into the required secondary small signal, which is used for metering, measurement and control, protection, communication, and other applications. The secondary output of the CVT has several windings according to different demands, of which laln (2a2n, 3a3n) is the main secondary winding terminal and dadn is the residual voltage winding terminal. The bottom half of Figure 9 is the encoder part of the model, and the bottom is a model made by the stacked BLSTM with a length of T x . In this paper, each BLSTM unit is called Pre_attention Bi_LSTM. Its output is represented by a t , which means It combines the activation value of the forward propagation of BLSTM with the activation value of backward propagation and then puts the output of a t and the decoder of the previous time step together for the calculation of the attention mechanism, to obtain the context variable context t of each time step. The upper part of Figure 9 is the decoding part. The prediction result of this time step is obtained by inputting the hidden layer state of the previous time step and the context variable context t of the current time step into the BLSTM unit of the decoder.

Experimental Environment and Data Set
The experiment was carried out with the simulated monitoring data of a capacitor voltage transformer (CVT) in an electric field. The schematic diagram of the CVT is shown in Figure 10. C1 and C2 are the high-voltage capacitance and medium-voltage capacitance of the capacitor voltage divider. The magnetic unit consists of intermediate transformer T1, compensation reactor Z, damping device D, and overvoltage protection device G. After the CVT is connected with the high-voltage system, the capacitor voltage divider transforms the primary high-voltage signal into a lower intermediate-voltage signal, which reduces the insulation requirements of the magnetic unit, and then transforms the intermediate transformer into the required secondary small signal, which is used for metering, measurement and control, protection, communication, and other applications. The secondary output of the CVT has several windings according to different demands, of which laln (2a2n, 3a3n) is the main secondary winding terminal and dadn is the residual voltage winding terminal. "Amplitude" is extracted from 15 test points: "main variant I group A phase", "main variant I group B phase", "main variant I group C phase", "main variant II group A phase", "main variant II group B phase", "main variant II group C phase", "main variant III group A phase", "main variant III group B phase", "main variant III group C phase", "5449 Line A phase", "5449 Line B phase", "5449 Line C phase", "5450 Line A phase", "5450 Line B phase", "5450 Line C phase". Since the "frequency" is consistent, there are a total of 16 characteristic dimensions, and according to the 15 label data, a total of 35,718 records are obtained. Then, the obtained data are divided into a training set and test set by 8:2. Table 1 shows the experimental environment for our experiment implementation.

Evaluating Indicator
In our study, the prediction performance of the network model was evaluated by using different calculation formulas of prediction deviation. To a certain extent, the size of the prediction deviation can reflect the quality of the prediction performance of the network model. When the prediction deviation value is larger, the prediction performance of the model is worse; when the prediction deviation value is smaller, the prediction effect of the model is better. The commonly used evaluation indexes include mean absolute error, average absolute percentage error, and mean square error.  "Amplitude" is extracted from 15 test points: "main variant I group A phase", "main variant I group B phase", "main variant I group C phase", "main variant II group A phase", "main variant II group B phase", "main variant II group C phase", "main variant III group A phase", "main variant III group B phase", "main variant III group C phase", "5449 Line A phase", "5449 Line B phase", "5449 Line C phase", "5450 Line A phase", "5450 Line B phase", "5450 Line C phase". Since the "frequency" is consistent, there are a total of 16 characteristic dimensions, and according to the 15 label data, a total of 35,718 records are obtained. Then, the obtained data are divided into a training set and test set by 8:2. Table 1 shows the experimental environment for our experiment implementation.

Evaluating Indicator
In our study, the prediction performance of the network model was evaluated by using different calculation formulas of prediction deviation. To a certain extent, the size of the prediction deviation can reflect the quality of the prediction performance of the network model. When the prediction deviation value is larger, the prediction performance of the model is worse; when the prediction deviation value is smaller, the prediction effect of the model is better. The commonly used evaluation indexes include mean absolute error, average absolute percentage error, and mean square error.
(1) Mean absolute error (MAE): Refers to the average of the absolute value of the deviation between the predicted value and the real value. MAE reflects the error of the predicted value of the model to a certain extent. The formula to calculate MAE is shown as follows: where: y i is the predicted value, y i is the actual value.
(2) Average absolute percentage error (MAPE): Represents the average deviation between the predicted results and the actual results. The formula to calculate MAPE is shown as follows: where: y i is the predicted value, y i is the actual value.
(3) Mean square error (MSE): Represents the deviation between each predicted value and the real value is reflected to evaluate the degree of data change. The smaller the MSE is, the higher the accuracy of the experimental data of the prediction model is. The formula to calculate MSE is shown as follows: where: y i is the predicted value, y i is the actual value.
(4) Root mean square error (RMSE): Represents the deviation between each predicted value and the true value is reflected to evaluate the extent of variation in the data, and the smaller the RMSE is, the higher accuracy of the model is. The formula to calculate RMSE is shown as follows: (5) Coefficient of determination (R) 2 : Its range is between [0, 1]. T It represents the deviation between the predicted value and the true value. The formula to calculate (R) 2 is shown as follows: where: y i is the predicted value, y i is the average of the true values, y i is the actual value.

Experimental Process and Analysis
The main flow of online measurement error prediction based on the optimized neural network proposed in this paper is as follows: Step 1: Divide the original data set into a training set, verification set, and test set according to a certain proportion; Step 2: Initialize network model hyperparameters; Step 3: Complete the relevant calculation of Seq2Seq model coding, and work out the attention variable corresponding to the BLSTM unit; Step 4: Calculate the context variable corresponding to each time step according to the calculated attention variable; Step 5: Calculate the predicted value of the current time step according to the calculated context variable and the output value of the decoded part of the previous time step; Step 6: Repeat the above steps until the specified number of iterations is completed, thus ending the training of the network model; Step 7: Test the model and judge the quality of the model by evaluating indicators; Step 8: Reverse normalize the predicted results, compare them with real data, and evaluate the prediction performance.

Parameter Selection
(1) Selection of batchsize and epochs The data set used in this paper is of a large scale. If we choose to input all the data sets into the network model for training at one time, although this can approach the direction of the extreme value more accurately, the memory requirements are generally relatively high. If only one sample is input into the network model for training at a time, it is found through experiments that the loss function is difficult to converge. Therefore, the experiment in this study did not involve the above two methods to determine the batchsize, but featured an appropriate batchsize through batch gradient descent. We tried different combinations of batchsize and epochs while monitoring the mean square error index of the model and, according to the loss function curve drawn by the mean square error index, the most appropriate curve was selected and the batchsize and epochs corresponding to the curve were taken as the combination of batchsize and epochs used. Since the numbers were stored in the binary form in the computer, the batchsizes we designed were 128, 256, 1024, 2048, 3000, and 4096 (the increase in the batchsize was limited by the GPU memory of the experimental equipment). Epochs were selected to avoid model over-fitting and were set to 50 times based on academic experience. The experimental results are shown in Figure 10.
According to the experimental results in Figure 11, although the loss function curve batchsizes 2048, 3000, and 4096 finally converged when the epoch number was 50, the curve of batchsizes 2048 and 3000 decreased sharply, so the loss function could not be guaranteed to reach the minimum value. As for the curve with a batchsize of 4096, it can be observed that the initial learning rate is relatively large at the beginning, but with the increase in iterations, the learning rate first drops significantly and then slowly, so as to ensure that the loss function can reach the minimum value. Therefore, batchsize was 4096 and epoch number was 50. (2) Learning rate The learning rate represents the network model built in this study with the passage of time and the speed of information accumulation, and choosing the optimal learning rate can determine whether the network model built can converge. If the learning rate is set too high, the model may not converge. If the learning rate is set too low in advance, although it can help the model to converge, it will take a lot of time to make the network model converge to the global minimum. In the field of deep learning, some relatively simple first-order convergence algorithms are commonly used, such as a gradient descent algorithm. The basic formula of gradient descent is shown as follows: (2) Learning rate The learning rate represents the network model built in this study with the passage of time and the speed of information accumulation, and choosing the optimal learning rate can determine whether the network model built can converge. If the learning rate is set too high, the model may not converge. If the learning rate is set too low in advance, although it can help the model to converge, it will take a lot of time to make the network model converge to the global minimum. In the field of deep learning, some relatively simple first-order convergence algorithms are commonly used, such as a gradient descent algorithm. The basic formula of gradient descent is shown as follows: where α is the learning rate. According to the above formula, if the learning rate is relatively low, the loss function of the network will decline very slowly. On the contrary, if the learning rate is set to be high, the range of parameter updates of the network model will become very large, which may lead to the convergence of the network to the local optimal point, or the loss function will suddenly increase.
The learning rate changes constantly in network training. For example, in the initial stage of network model training, the network model parameters are relatively random. According to the prior knowledge of the academic community, a relatively high learning rate was selected to ensure that the loss function decreased faster; after the network model was trained for a period of time, the update characteristics of the parameters could not be greatly updated but were small enough to update.
However, there is no fixed discussion on the initial learning rate. The traditional method is the "trial value method". This method is relatively inefficient. In the case of a complex network model, this method will consume a lot of calculation time. Therefore, we attempted to use the dragonfly optimization algorithm to find a suitable learning rate for the initial learning rate. The number of dragonflies selected in this experiment was 20, and the number of iterations was 30, The batchsize of the model was selected to be 4096 as determined in the previous experiment, while 50 epochs were selected.
The experimental results of the learning rate are shown in Table 2. The model reference index of this experiment is the decision coefficient R 2 . The closer the value of the decision coefficient R 2 is to 1, the better the model prediction accuracy and model fitting effect will be. It can be found from the above table that when the learning rate is 0.0459, the decision coefficient of the model is the highest. To speed up the final convergence speed of the loss function of the model, while ensuring the prediction accuracy, we also aimed to speed up the convergence process of the loss function of the model. Therefore, 0.045913361 was selected as the initial learning rate of the network model.

Data Prediction Experiment of Transformer
In the initial stage of the training network model, the original data set should be segmented into a training set and test set. The test set is an independent data set from the training set. It does not participate in the network model training process like the training set, the purpose is to avoid the over-fitting phenomenon of the prediction model on the test set (model over-fitting means that the model can fit the training set well, but cannot fit the test set well). Therefore, to improve the prediction effect of the model, 80% of the original data set was taken as the training sample of the network model, and the remaining 20% of the original data set was taken as the test set of the network model in the problem of transformer operation data prediction studied here. For the training set, a part of the data was taken as the verification set. The verification set does not participate in the process of model training and is only used to objectively measure the training effect of the model. Cross-validation divides the training data into k groups (k-fold). A verification set was made for these k groups of data, and the remaining k-1 groups of data were used as the training set. Then, the mean squared error of the k group was added and averaged to obtain the cross-verification error. Here, to reflect the difference in prediction indexes before and after network model optimization, the Seq2Seq network model and the Seq2Seq network model optimized by AM were selected for comparison experiments, and k group cross-validation was adopted in the experiment (k = 1 in this experiment). The batchsize and epochs values were 4096 and 50, respectively.
As the experimental results show in Table 3 above, for MAPE indexes, the Seq2Seq network model optimized by AM reduces the error by 0.59% compared with the Seq2Seq network model. For the MSE index, the smaller value of the MSE index indicates that the prediction model is more accurate. It can be found that the Seq2Seq network model optimized by AM is 791.53 lower than the Seq2Seq network model. For the MAE, the Seq2Seq network model optimized by AM is 0.71 lower than the Seq2Seq network model. For the RMSE index, the Seq2Seq network model optimized by AM is 3.83 lower than the Seq2Seq network model. For the R 2 index, the closer this value is to 1, the better the prediction performance of the model and the better the fitting of the model. It can be seen that the Seq2Seq network model optimized by AM has a 0.004 improvement in the accuracy of model prediction compared with the Seq2Seq network model. Through the analysis of the above indicators, it is proved that the Seq2Seq network model optimized by AM is superior to the Seq2Seq network model in the prediction performance. Next, the Seq2Seq network model optimized by AM was used to analyze the fitting errors of 15 label data in detail. Figure 12 shows the predicted results of the 15 tag data.  In Figure 12, the red circles represent the 15 actual ratio rates and the black asterisk represents the predicted values. It can be seen that among the 15 points, most of the predicted values are consistent with the actual values, and only two points, #10 ("5449 line A phase") and #11 ("5449 line B phase"), have a certain deviation, with the average absolute error being 0.0057% and 0.0014%, respectively.
The complete error of each point is shown in Table 4. Experimental results show that the prediction error is small.

Conclusions
In this paper, an attention mechanism-optimized Seq2seq network is proposed. By introducing the attention mechanism for network optimization, we address the limitations of long-term dependency and the low efficiency of the usage of memory during computation. In the Seq2Seq model construction process, the proposed method effectively achieves long-term dependence without retaining much redundant context information. Through comparative experiments based on the transformer monitoring data set in an electric field, we demonstrate that the proposed method not only greatly improves the training efficiency of the model but also shows good performance in prediction accuracy. Therefore, the proposed method is more versatile and practical in solving electronic transformer error prediction problems. In Figure 12, the red circles represent the 15 actual ratio rates and the black asterisk represents the predicted values. It can be seen that among the 15 points, most of the predicted values are consistent with the actual values, and only two points, #10 ("5449 line A phase") and #11 ("5449 line B phase"), have a certain deviation, with the average absolute error being 0.0057% and 0.0014%, respectively.
The complete error of each point is shown in Table 4. Experimental results show that the prediction error is small.

Conclusions
In this paper, an attention mechanism-optimized Seq2seq network is proposed. By introducing the attention mechanism for network optimization, we address the limitations of long-term dependency and the low efficiency of the usage of memory during computation. In the Seq2Seq model construction process, the proposed method effectively achieves long-term dependence without retaining much redundant context information. Through comparative experiments based on the transformer monitoring data set in an electric field, we demonstrate that the proposed method not only greatly improves the training efficiency of the model but also shows good performance in prediction accuracy. Therefore, the proposed method is more versatile and practical in solving electronic transformer error prediction problems.