TransFNN: A Novel Overtemperature Prediction Method for HVDC Converter Valves Based on an Improved Transformer and the F-NN Algorithm

Appropriate cooling of the converter valve in a high-voltage direct current (HVDC) transmission system is highly significant for the safety, stability, and economical operation of a power grid. The proper adjustment of cooling measures is based on the accurate perception of the valve’s future overtemperature state, which is characterized by the valve’s cooling water temperature. However, very few previous studies have focused on this need, and the existing Transformer model, which excels in time-series predictions, cannot be directly applied to forecast the valve overtemperature state. In this study, we modified the Transformer and present a hybrid Transformer–FCM–NN (TransFNN) model to predict the future overtemperature state of the converter valve. The TransFNN model decouples the forecast process into two stages: (i) The modified Transformer is used to obtain the future values of the independent parameters; (ii) the relation between the valve cooling water temperature and the six independent operating parameters is fit, and the output of the Transformer is used to calculate the future values of the cooling water temperature. The results of the quantitative experiments showed that the proposed TransFNN model outperformed other models with which it was compared; with TransFNN being applied to predict the overtemperature state of the converter valves, the forecast accuracy was 91.81%, which was improved by 6.85% compared with that of the original Transformer model. Our work provides a novel approach to predicting the valve overtemperature state and acts as a data-driven tool for operation and maintenance personnel to use to adjust valve cooling measures punctually, effectively, and economically.


Introduction
With the increase in global power demand, the security and stability of power systems are more dependent on high-voltage direct current (HVDC) transmission systems [1], in which the converter valve is the core component. If the operating temperature of the converter valve is too high, its internal thyristors and other components will be damaged, thus affecting the normal operation of the converter station and even the regional power grid [2]. The converter valve is equipped with a cooling system. Among its many operating parameters, the cooling water temperature is an important criterion for determining whether the converter valve is in an overtemperature state [3], and different types of cooling systems have diverse overtemperature thresholds. If the temperature of the valve cooling water exceeds the corresponding threshold, the converter valve will operate in the overtemperature state and will be irreversibly damaged if stronger cooling measures are not taken [4].
At present, the operation and maintenance personnel of converter stations mainly judge a valve's future overtemperature state through manual experience and adopt the following emergency cooling measures [5,6]: (i) increasing the cooling power of the water circulation system; (ii) putting the standby auxiliary cooling equipment into operation; (iii) increasing the water level of the cooling water tank; (iv) distributing the load in the area to other converter stations. However, due to the lack of quantitative perception of overtemperature, operation and maintenance personnel are unable to take timely and appropriate cooling measures, which leads to two negative consequences: On one hand, for the sake of safety, they tend to adopt cooling measures that are stronger than actually needed, but this will cause a waste of energy and water resources and impose dispensable pressure on the operation of the regional power grid [7]; on the other hand, manual experience is not reliable enough to address the occasional overtemperature of a converter valve, so the core equipment cannot be protected to the maximum extent. Therefore, research on methods for predicting the overtemperature state of an HVDC converter valve can improve the perception of operation and maintenance personnel of future overtemperature states, which is of great significance for resource conservation and the stable operation of power grids.
Currently, the research on the overtemperature state of HVDC converter valves is insufficient. Berg et al. [8] and Zhou et al. [9] simulated thermodynamic models of two specific types of converter valves by using 3D simulation and numerical calculation methods so as to obtain a novel valve layout for better heat dissipation efficiency, but converter valves of other types need to be re-simulated before analysis, which is not universally applicable. Liu et al. [10] studied the structure of an internal reactor and other components of a converter valve and proposed a method for reducing the heat generated during the operation of a converter valve by reducing the valve's power loss, but this method is still incapable of improving the perception of operation and maintenance personnel of the overtemperature of converter valves. Some researchers proposed a new series structure for cooling systems [11], which can reduce power loss and make the temperature of a converter valve more easily perceived and controlled, but it has no guiding significance for a large number of converter stations that are already equipped with the original models of cooling systems. Ma et al. [12] studied the relation between the cooling water temperature of a valve and the DC load, but the approximated relation was vague, and the study lacked further research on the relation between the cooling water temperature and other operating parameters of the converter valve. To sum up, the studies mentioned above lacked adequate predictive research and complete operating parameter analyses for existing converter valves; therefore, they cannot enable operation and maintenance personnel to punctually and appropriately adjust valve cooling measures by analyzing future overtemperature states of converter valves. Therefore, it is necessary to study methods of time-series prediction for the cooling water temperature of converter valves.
In the field of time-series prediction, many AI methods have relatively excellent performance. They can avoid the difficulty of mastering physical mechanisms and excel in building implicit mappings between input and output values. Many algorithms have been developed for time-series prediction, e.g., the back-propagation neural network (BPNN) [13], random forest (RF) [14], support vector regression (SVR) [15], recurrent neural network (RNN) [16], long short-term memory (LSTM) [17], and gated recurrent unit (GRU) [18]. Generally speaking, more recent algorithms have certain improvements in prediction accuracy and training complexity compared with earlier ones. The attention mechanism [19] is the basis of the novel Transformer encoder-decoder architecture [20], which was originally applied in the field of natural language processing (NLP). This Transformer model has powerful sequence-sequence prediction abilities; it can establish the global dependence between the input and output, and it has an excellent prediction effect on time series [21].
To the best of our knowledge, the Transformer model has different performance when processing time series in different scenarios [21]. Under the application background of this study, the valve cooling water temperature is not a typical time series, as it is influenced by multi-source parameters under different operating conditions. Therefore, we need to modify the Transformer model in order to equip it with better competence in valve overtemperature prediction. In this context, we propose a novel TransFNN model that coordinates an improved Transformer model and the F-NN framework (composed of the FCM algorithm and artificial neural networks). On one hand, the Transformer model can forecast future values of the independent parameters of a converter valve. On the other hand, F-NN can accurately fit the relations between several independent parameters and one dependent parameter of the converter valve, and it can calculate the future values of the valve's cooling water temperature, i.e., the dependent parameter, by using future values of the independent parameters obtained by the Transformer. The major contributions and innovations of this study can be stated from three perspectives: (i) The proposed TransFNN model is capable of more accurately forecasting the overtemperature state of a converter valve, and it can act as a data-driven tool for operation and maintenance personnel to use to effectively and economically adjust cooling measures. (ii) We improved the positional encoding method in the Transformer to make it suitable for predictions of time series with strong periodicity in different time dimensions. (iii) Our research provides an effective framework for time-series predictions with multisource data, and it has satisfying performance in scenarios in which the investigated object has a variety of complex operating conditions.
The remainder of this paper is organized as follows. Section 2 presents the mathematical model for the prediction of a valve's cooling water temperature. In Section 3, the Transformer-based model is modified to address the problem of independent operating parameter prediction. Section 4 is devoted to introducing the method for the calculation of a valve's cooling water temperature based on the F-NN algorithm. Section 5 involves quantitative experiments, as well as an evaluation of the result and an analysis. Section 6 summarizes the crucial conclusions and declares the research prospects.

Mathematical Model for Valve Cooling Water Temperature Predictions
The prediction of a valve's cooling water temperature is essentially a combination of time-series prediction and function approximation, and the main stages of model construction are as follows: Firstly, the influencing factors of a valve's cooling water temperature are identified as the independent parameters; secondly, a time-series prediction model is established to forecast future values of the independent variables; thirdly, the relation between the valve's cooling water temperature and the independent variables is approximated by using the historical operating data of the converter valve; finally, the two models established in the second and third stages are combined to form the complete model. The model is trained with historical operating data so that it can be used to predict the future values of the valve's cooling water temperature. A flowchart of the overall scheme is shown in Figure 1. A valve cooling system can be divided into its internal and external parts. The internal cooling water system is a closed cycle and has the functions of heat absorption, water circulation, water treatment, etc. The cooling water in the internal system absorbs the heat generated during the operation of the converter valve and is sent to the external cooling water system. The external cooling water system is an open cycle with the functions of water softening, water purification, water make-up, etc. A cooling tower sprays the internal cooling water pipe for heat dissipation, and the heat of the internal cooling water system is discharged into the atmospheric environment through a fan. The internal and external cooling water systems work together to perform continuous cooling for the converter valve.
Generally, when the operation and maintenance personnel of a converter station patrol the cooling system of a converter valve, eight operating parameters are recorded, and they can be divided into three categories: (i) Electrical data: DC load; (ii) Environmental data: external ambient temperature, valve cooling water temperature;  A valve cooling system can be divided into its internal and external parts. The internal cooling water system is a closed cycle and has the functions of heat absorption, water circulation, water treatment, etc. The cooling water in the internal system absorbs the heat generated during the operation of the converter valve and is sent to the external cooling water system. The external cooling water system is an open cycle with the functions of water softening, water purification, water make-up, etc. A cooling tower sprays the internal cooling water pipe for heat dissipation, and the heat of the internal cooling water system is discharged into the atmospheric environment through a fan. The internal and external cooling water systems work together to perform continuous cooling for the converter valve.
Generally, when the operation and maintenance personnel of a converter station patrol the cooling system of a converter valve, eight operating parameters are recorded, and they can be divided into three categories: (i) Electrical data: DC load; (ii) Environmental data: external ambient temperature, valve cooling water temperature; (iii) Water cooling system data: inlet water temperature of the valve's cooling system, outlet water temperature of the valve's cooling system.
Both the inlet and outlet water temperatures can be regarded as parameters of the valve's cooling water temperature, which indicates whether the converter valve is overheated. The inlet water temperature and the cooling measures are two main factors that determine the outlet water temperature. Since the cooling measures of a converter valve are not the main focus of this study and the inlet and outlet water temperatures show a high correlation (the correlation analysis is presented in Section 5), we ignore the parameter of outlet water temperature and use the inlet water temperature to represent the valve's cooling water temperature, i.e., the dependent parameter. The other six operating parameters are defined as independent variables, and they have a decisive impact on the dependent variable, as shown in Table 1.  Both the inlet and outlet water temperatures can be regarded as parameters of the valve's cooling water temperature, which indicates whether the converter valve is overheated. The inlet water temperature and the cooling measures are two main factors that determine the outlet water temperature. Since the cooling measures of a converter valve are not the main focus of this study and the inlet and outlet water temperatures show a high correlation (the correlation analysis is presented in Section 5), we ignore the parameter of outlet water temperature and use the inlet water temperature to represent the valve's cooling water temperature, i.e., the dependent parameter. The other six operating parameters are defined as independent variables, and they have a decisive impact on the dependent variable, as shown in Table 1. For parameter predictions, we can take the DC load (P) as an example. We denote the historical sequence of P at time t as x t , and our target is to predict the future P time series y t = y t+1 , y t+2 , · · · , y t+q at time t, where q represents the maximum number of prediction steps and y t+1 is the value of the DC load at time (t + 1). The prediction model for the time series is expressed by Equation (1): where θ represents model's parameter vector; f (·) is the implicit function of the time-series prediction model to be introduced in Section 3; x t represents the input historical sequence, as shown in Equation (2): where n and x t represent the length of the historical sequence and the value of the DC load at time t, respectively. The prediction model is trained on the historical dataset. In a given dataset consisting of N input-output pairs {(x 1 , y 1 ), (x 2 , y 2 ), · · · , (x N , y N )}, the optimal parameter vectorθ of the trained model is obtained by minimizing the loss function as follows: In addition to the DC power, the above model can also be used to predict the future values of the external ambient temperature and the valve hall temperature. However, some independent parameters, i.e., the water level of the expansion tank, the water level of the spray tank, and the conductivity of the main water circuit, are directly controlled by the operation and maintenance personnel [5], so they do not have typical temporal characteristics. Therefore, when predicting the future values of these three parameters, their future values that were preset by the operation and maintenance personnel are generally accepted, and the other three with typical time-series characteristics can be predicted by using the improved Transformer model, which will be introduced in Section 3.
For this multi-input and single-output prediction problem, we also need a model to approximate the relation between the several inputs and the one output. The value of the dependent variable at time t is denoted as o t , and the vector of independent variables is denoted as i t , which includes g different types of features. Then, the implicit function between the independent and dependent variables can be represented by F(·), as shown in Equation (5): All independent and dependent parameters are substituted into Equation (5): where g = 6, indicating that there are 6 independent variables; F(·) is the implicit function representing the relation between the 6 independent variables and the dependent variable, which is to be introduced in Section 4. It is worth mentioning that the simple time-series prediction of a valve's cooling water temperature is of little reference value. It can be seen in Equation (6) that the valve's cooling water temperature is determined by the six independent variables, but only P, T S , and T P are typical time-series parameters, while L P ,L S , and σ have few temporal characteristics. Therefore, if the valve's cooling water temperature is directly predicted in time series, it is the default that all the corresponding independent variables have typical time-series characteristics. This will lead to relatively large errors in the prediction results because this forecasting method runs contrary to the temporal characteristics of the operating parameters; a detailed verification of this will be carried out in Section 5. Hence, we adopt the following prediction strategies:P, T S , and T P , which have obvious time-series characteristics, will participate in the time-series prediction, while L P , L S , and σ, which are directly controlled by the operation and maintenance personnel, will be preset before the forecasting process; when the values of the independent variables at time t are given, the predicted value of the corresponding valve cooling water temperature can be calculated by using Equation (6). The process of predicting a valve's cooling water temperature introduced in this section is demonstrated in Figure 2. adopt the following prediction strategies: , , and , which have obvious time-series characteristics, will participate in the time-series prediction, while P , S , and , which are directly controlled by the operation and maintenance personnel, will be preset before the forecasting process; when the values of the independent variables at time are given, the predicted value of the corresponding valve cooling water temperature can be calculated by using Equation (6). The process of predicting a valve's cooling water temperature introduced in this section is demonstrated in Figure 2.

Time-Series Prediction Model Based on the Improved Transformer
The encoder-decoder architecture is based on the attention mechanism, and it was originally used in the field of NLP. The Transformer is a more novel model based on the encoder-decoder architecture, and it has great application prospects in the domain of time-series prediction. In this section, the attention mechanism, the modified Transformer model, and the process of time-series prediction by the Transformer model will be introduced.

Fundamentals of the Attention Mechanism
The attention mechanism fundamentally maps a query to a series of key-value pairs, which are regarded as the constituent elements within a source. There are three steps in computing attention [20]: Firstly, the similarity of the Query and Key is calculated by using a perceptron, dot product, or concat operation; secondly, the similarity calculated above is normalized by a softmax function; finally, the sum of the weights multiplied by the corresponding Value is calculated so that the attention value can be obtained. The three steps can be summarized as follows: where represents the length of the source. The scaled dot-product attention model [20], as shown in Figure 3a, provides a more practical attention calculation method. Its input contains a -dimensional , -dimensional Key, and v -dimensional , and its output is

Time-Series Prediction Model Based on the Improved Transformer
The encoder-decoder architecture is based on the attention mechanism, and it was originally used in the field of NLP. The Transformer is a more novel model based on the encoder-decoder architecture, and it has great application prospects in the domain of timeseries prediction. In this section, the attention mechanism, the modified Transformer model, and the process of time-series prediction by the Transformer model will be introduced.

Fundamentals of the Attention Mechanism
The attention mechanism fundamentally maps a query to a series of key-value pairs, which are regarded as the constituent elements within a source. There are three steps in computing attention [20]: Firstly, the similarity of the Query and Key is calculated by using a perceptron, dot product, or concat operation; secondly, the similarity calculated above is normalized by a softmax function; finally, the sum of the weights multiplied by the corresponding Value is calculated so that the attention value can be obtained. The three steps can be summarized as follows: where L x represents the length of the source. The scaled dot-product attention model [20], as shown in Figure 3a, provides a more practical attention calculation method. Its input contains a d k -dimensional Query, d kdimensional Key, and d v -dimensional Value, and its output is where √ d k is used to minimize the inner product; Q ∈ R n×d k , K ∈ R m×d k , and V ∈ R m×d v represent the Query, Key, and Value, respectively; QK T is used to evaluate the similarity between the Query and Key. In the attention mechanism, K is generally identical to V, and the normalized similarity can be obtained after QK T is processed with a softmax function. In the scaled dot-product attention mechanism, Q, K, and V are calculated as follows: Q = W q X, K = W k X, and V = W v X, where X is the input matrix of the attention calculation, and W q , W k , and W v are the weight matrices, which are randomly initialized and updated during the training of the model, so the values of the output of Equation (8) can better reflect the importance of each element within the sequence. similarity between the and . In the attention mechanism, is generally identical to , and the normalized similarity can be obtained after is processed with a softmax function. In the scaled dot-product attention mechanism, , , and are calculated as follows: = , = , and = , where is the input matrix of the attention calculation, and , , and are the weight matrices, which are randomly initialized and updated during the training of the model, so the values of the output of Equation (8)  The multi-head attention model [20] is an improvement of the scaled dot-product attention model, as shown in Figure 3b. In this model, , , and are mapped through the projection matrices, after which they are used to perform the attention calculation. Then, there are repetitive processes, whose results are finally concatenated, as shown in Equation (9): where ∈ ℝ × , ∈ ℝ × , and ∈ ℝ × are projection matrices that project , , and into different dimensions, so different attention models can be trained, leading to an improvement in the accuracy of the attention calculation. Then, the value of attention of r heads is concatenated into one, as shown in Equation (10): where ∈ ℝ × .

The Transformer Model
The Transformer model is a novel architecture based on the attention mechanism, and it excels in processing sequence-to-sequence problems. It is composed of an encoder and a decoder, and the architecture is shown in Figure 4. The multi-head attention model [20] is an improvement of the scaled dot-product attention model, as shown in Figure 3b. In this model, Q, K, and V are mapped through the projection matrices, after which they are used to perform the attention calculation. Then, there are r repetitive processes, whose results are finally concatenated, as shown in Equation (9): where W Q p ∈ R d model ×d k ,W K p ∈ R d model ×d k , and W V p ∈ R d model ×d v are projection matrices that project Q, K, and V into r different dimensions, so different attention models can be trained, leading to an improvement in the accuracy of the attention calculation. Then, the value of attention of r heads is concatenated into one, as shown in Equation (10): where W O ∈ R rd v ×d model .

The Transformer Model
The Transformer model is a novel architecture based on the attention mechanism, and it excels in processing sequence-to-sequence problems. It is composed of an encoder and a decoder, and the architecture is shown in Figure 4.
The encoder consists of an input layer, a positional encoding layer, and several stacked encoder layers that share the same structure and parameters. Each encoder layer contains a multi-head attention layer and a feed-forward layer, both of which are followed by a normalization layer. The input sequence is mapped to be d model -dimensional and coded to get the order information in the positional encoding (PE) layer. The PE method of our proposed model is different from the absolute positional encoding method adopted in the original Transformer model [20]. The DC load, external ambient temperature, and valve hall temperature are the three independent parameters to be predicted, and they all have obvious periodicity in the fluctuations in their values over 24 h per day, 30 days per month, and 12 months per year. Therefore, it is necessary to separately perform positional  (11) and (12).
where PE 1 sin (·) and PE 1 cos (·) represent the sine and cosine forms of PE in the first dimension (hourly) of a certain element from the input sequence, respectively; hour represents the hour of the day corresponding to the time when this element was sampled, and max(hour) = 24. When positional encoding needs to be performed daily or monthly, max(day) and max(month) are 30 and 12, respectively. Thus, the input sequence is mapped into 7 dimensions, with the original value in the first dimension, and PE 1 sin (·), PE 1 cos (·), PE 2 sin (·), PE 2 cos (·), PE 3 sin (·), and PE 3 cos (·) are the elements of the other six dimensions. By improving the PE, the periodicity of the independent parameters of the converter valve in different time dimensions can be better learned by the Transformer model.  The encoder consists of an input layer, a positional encoding layer, and several stacked encoder layers that share the same structure and parameters. Each encoder layer contains a multi-head attention layer and a feed-forward layer, both of which are followed by a normalization layer. The input sequence is mapped to be -dimensional and coded to get the order information in the positional encoding (PE) layer. The PE method of our proposed model is different from the absolute positional encoding method adopted in the original Transformer model [20]. The DC load, external ambient temperature, and valve hall temperature are the three independent parameters to be predicted, and they all have obvious periodicity in the fluctuations in their values over 24 h per day, 30 days per month, and 12 months per year. Therefore, it is necessary to separately perform positional encoding in the three dimensions, and an example of hourly positional encoding is shown The output sequence of the encoder contains abundant information, and it is fed into the decoder for further processing. In order to use the encoder's output for the attention mechanism, the decoder layer inserts an encoder-decoder multi-head attention layer between the multi-head layer and the feed-forward layer. After each sublayer in the decoder, there is also a normalization layer, and the linear mapping layer processes the output of the last decoder layer to obtain the decoder's output.
In Figure 4, the input of the encoder is a sequence x t = (x t−n+1 , x t−n+2 , · · · , x t ), i.e., historical DC load data, where t and n represent the current time step and the length of the input sequence, respectively. The decoder's input is y t−1 = y t , y t+1 , · · · , y t+q−1 , i.e., future DC load data, where q is the maximum prediction step. The decoder's output is y t = y t+1 , y t+2 , · · · , y t+q , which lags the input sequence of the decoder by one step. The parameters of the Transformer model are obtained through training, e.g., with the weights and biases of the feed-forward layers and the values of W q , W k , and W v in the attention layers.

Process of Parameter Prediction Based on the Transformer
The process of independent parameter predictions with the Transformer model is very similar to that of other time-series prediction models, such as LSTM [22]. The overall prediction process is shown in Figure 5, and all variables to be predicted, including the DC load, ambient temperature, and valve hall temperature, need to go through the same process, but by using three trained Transformer models with different parameter vectorsθ.
The output sequence of the encoder contains abundant information, and it is fed into the decoder for further processing. In order to use the encoder's output for the attention mechanism, the decoder layer inserts an encoder-decoder multi-head attention layer between the multi-head layer and the feed-forward layer. After each sublayer in the decoder, there is also a normalization layer, and the linear mapping layer processes the output of the last decoder layer to obtain the decoder's output.
In Figure 4, the input of the encoder is a sequence = ( − +1 , − +2 , ⋯ , ), i.e., historical DC load data, where and represent the current time step and the length of the input sequence, respectively. The decoder's input is −1 = ( , +1 , ⋯ , + −1 ), i.e., future DC load data, where is the maximum prediction step. The decoder's output is = ( +1 , +2 , ⋯ , + ), which lags the input sequence of the decoder by one step. The parameters of the Transformer model are obtained through training, e.g., with the weights and biases of the feed-forward layers and the values of q , k , and v in the attention layers.

Process of Parameter Prediction Based on the Transformer
The process of independent parameter predictions with the Transformer model is very similar to that of other time-series prediction models, such as LSTM [22]. The overall prediction process is shown in Figure 5, and all variables to be predicted, including the DC load, ambient temperature, and valve hall temperature, need to go through the same process, but by using three trained Transformer models with different parameter vectors ̂.  Taking the prediction of the DC load as an example, the steps are as follows: (i) The historical DC load data are collected as the input, and they are mapped to [0, 1]. The normalization formula is in shown in (13): where x is the original sample data, and x N is the normalized form; x max and x min represent the maximum and minimum values of the sample dataset, respectively. (ii) The sample dataset is divided into a training set, validation set, and test set. (iii) Candidate Transformer models with different hyperparameters are obtained by using manual experience and grid search methods [23]. (iv) The candidate models are trained on the training set.
(v) The trained candidate models are evaluated on the validation set and the optimal Transformer model that produces the lowest error is selected.
(vi) The optimal Transformer model is tested on the test set. The output of the optimal Transformer model is denormalized to obtain the predictive values, and they are analyzed with some evaluation indicators [24].

Calculation of a Valve's Cooling Water Temperature Based on F-NN
In Equation (6), the implicit function F(·) represents the relation between the valve's cooling water temperature and the six independent parameters. These seven variables involve complex thermodynamic and electromagnetic relations; therefore, it is difficult to obtain analytical expressions. However, deep learning methods can be used for function approximation so as to fit the relation between the valve's cooling water temperature and the six independent variables. A converter station operates in different seasons and time periods, and it may be in different operating conditions due to the influence of the external environment and the peaks and valleys of power consumption [1]. Therefore, we put forward the F-NN algorithm, which coordinates the fuzzy C-means clustering (FCM) algorithm and artificial neural networks (ANNs). Firstly, the FCM algorithm is used to cluster the data of the valve's operating parameters and to form several typical databases belonging to different typical operating conditions of the converter valve. Then, a neural network is applied to fit the relationship between the independent variables and dependent variables under different operating conditions obtained by FCM. By using F-NN, we can approximate the relation represented by the implicit function F(·) and then realize the lean calculation of the cooling water temperature of the converter valve.

Fundamentals of the FCM Algorithm
FCM is a partition-based clustering algorithm that ensures that the samples clustered in the same class have the highest similarity and that the samples between different classes have the lowest similarity [25]. The FCM algorithm introduces the concept of the membership degree, making it distinct from the K-means clustering algorithm. The membership degree matrix is used to judge the degree of belonging of each sample to each category. The probability of samples belonging to each category is between 0 and 1, so the FCM clustering algorithm is deemed to be soft clustering. The core idea of the FCM algorithm is to constantly update the membership matrix and clustering center matrix until the objective function reaches the minimum, and then the clustering center of the final iteration will be output.
The original operating database of the converter valve can be divided into an input matrix and an output sequence. We denote the historical data matrix of independent parameters in the valve cooling system as the input matrix I, as shown in Equation (14): where is the input sequence of the k type of independent parameter, and N is the length of the historical data sequence. The diagram of the original operating database of the converter valve is shown in Figure 6.
The objective function of the FCM algorithm is: where c represents the total clustering categories that should be set in advance, and it has a great impact on the validity of clustering; h = 1, 2, · · · , c is the category order number; U = µ hj c×N is the membership matrix, in which all elements in a column add up to 1, T ∈ ℝ ×1 is the input sequence of the type of independent parameter, and is the length of the historical data sequence. The diagram of the original operating database of the converter valve is shown in Figure 6.
Output sequence where represents the total clustering categories that should be set in advance, and it has a great impact on the validity of clustering; ℎ = 1,2, ⋯ , is the category order number; = [ ℎ ] × is the membership matrix, in which all elements in a column add up to 1, and every element ℎ ∈ [0,1] stands for the membership of the sample in the ℎ category; = [ ℎ ] × is the clustering center matrix, in which any element ℎ stands for the clustering center of the ℎ category for the variable feature; ∈ [1, +∞] is the fuzzy index, and the clustering effect is best when = 2 [26]; ℎ = ‖ − ℎ ‖ is the Euclidean distance of the input vector of the sample and the ℎ clustering center. Both and are randomly initialized. Lagrange multiplication is used to optimize the objective function to obtain the following formula: The necessary first-order conditions for optimization are expressed as follows: Lagrange multiplication is used to optimize the objective function to obtain the following formula: The necessary first-order conditions for optimization are expressed as follows: According to Equations (17)- (19), the relationship between the membership and Euclidean distance of each element can be obtained, as shown in Equation (20).
Hence, the clustering center can be derived from (19) and (20), Thus, by Equations (20) and (21), the clustering center matrix and its corresponding membership matrix can be obtained. l iterations are performed to update U and V with Equations (20) and (21) until the value change of the objective function is smaller than the preset iteration accuracy ε(ε > 0), as shown in (22): where U (l) and V (l) represent the membership matrix and the clustering center matrix in the l th iteration, respectively. FCM is an unsupervised algorithm, and it is necessary to preset a specific clustering category number c so that the sample data will be clustered into c categories. Based on the fundamental principle of the clustering algorithm, the smaller the overall distance within a cluster is, and the larger the distances between clusters are, and the more valid the clustering result will be. The clustering category number c has a decisive influence on the validity of the clustering results; therefore, a function of c is required to characterize the clustering validity, which is named the clustering validity function [26], as shown in Equation (23): The greater the value of P(c, U, V) is, the more effective the clustering results are. Before determining the clustering category number, we can let c traverse c max integers: c ∈ {1, 2, · · · , c max }. Then, the FCM calculation process is performed based on different values of c so as to obtain the optimal c * corresponding to the maximum value of P(c, U, V), as shown in Equation (24): The pseudocode of the FCM algorithm is illustrated in Algorithm 1.

Algorithm 1: Pseudocode of the FCM algorithm
Set the fuzzy index m, iteration accuracy ε, and maximum category number c max for c = 1, 2, · · · , c max l = 0 Initialize V (l) ∈ R c×g and U (l) ∈ R c×N randomly while True l = l + 1 Update U (l) with Equation (20) Update V (l) with Equation (21) if [In Equation (22) is satisfied] break while end if end while U = U (l) , V = V (l) Calculate the clustering validity function P(c, U, V) with Equation (23) end for Choose the optimal c(c * ) with Equation (24) return c * and the corresponding V, U

Construction of Typical Operating Databases
By using the optimal c that was obtained, N samples of historical operating data can be clustered into c categories corresponding to c typical databases, each of which contains N 1 , N 2 , · · · , N c samples consisting of six independent variables and one dependent variable from Table 1. The h typical database belongs to the h typical operating condition, which is mathematically characterized by the clustering center vector v h . The data vectors in a typical operating database have the highest similarity with each other and the greatest distinction from those in other typical operating databases. A diagram of the construction of a typical database is shown in Figure 7. tains 1 , 2 , ⋯ , samples consisting of six independent variables and one dependent variable from Table 1. The ℎ typical database belongs to the ℎ typical operating condition, which is mathematically characterized by the clustering center vector ℎ . The data vectors in a typical operating database have the highest similarity with each other and the greatest distinction from those in other typical operating databases. A diagram of the construction of a typical database is shown in Figure 7.
The c typical operating database: Nc samples 6 Input matrix

FCM
Using the optimal clustering category number c

Relation Approximation on Different Operating Conditions Based on an ANN
For each database representing a typical operating condition, we need to fit the relation between the independent parameters and the dependent parameter so that when the input vector is given under a certain operating condition, we can accurately calculate its corresponding output value, i.e., the valve's cooling water temperature. In this context, of the many types of artificial neural networks, the back-propagation neural network (BPNN) algorithm is of high applicability.
Since each operating condition indicates a distinct relation between the dependent and independent parameters, neural networks should be separately trained to approximate the relations. They share the same network structure and hyperparameters, e.g., the

Relation Approximation on Different Operating Conditions Based on an ANN
For each database representing a typical operating condition, we need to fit the relation between the independent parameters and the dependent parameter so that when the input vector is given under a certain operating condition, we can accurately calculate its corresponding output value, i.e., the valve's cooling water temperature. In this context, of the many types of artificial neural networks, the back-propagation neural network (BPNN) algorithm is of high applicability.
Since each operating condition indicates a distinct relation between the dependent and independent parameters, c neural networks should be separately trained to approximate the relations. They share the same network structure and hyperparameters, e.g., the number of hidden layers and the number of neurons in each hidden layer, and they all have 6 inputs and 1 output, but the weight and bias parameters of each neural network are different. When the hyperparameters and network parameters of c neural networks are determined, we can use the FCM-based division of typical operating conditions and the ANN-based relation approximation of different operating conditions to calculate the valve's cooling water temperature if a set of independent parameters are given.
When we need to calculate the value of a valve's cooling water temperature T j = o j corresponding to a new input vector i j , we should first determine the type of typical operating condition to which i j belongs by calculating the optimal h(h * ). The Euclidean distance between i j and the h * clustering center v h is minimized, as shown in Equation (25): i j is fed into an NN trained on the h * typical database, and the output of the NN is the valve's cooling water temperature that is to be predicted. So far, the implicit function F(·) Sensors 2023, 23, 4110 14 of 22 in Equation (6) has been realized with F-NN, and we can use TransFNN, which coordinates the improved Transformer model and F-NN, to predict future values of the valve's cooling water temperature.

Experiments
In this section, we use the proposed TransFNN model to predict a valve's cooling water temperature. The dataset division, correlation analysis, model hyperparameter settings, quantitative experiments, and result analysis are involved.

Dataset Division
This experiment used the 2012-2014 operating data from 12 converter valves of three HVDC converter stations in different regions of China. The historical operating data included eight types of feature quantities, which were introduced in Section 2, and the sampling interval of the data was 4 h. The 12 converter valves were denoted as No. 1-12. For each converter valve, the dataset produced by the operation and maintenance personnel contained 6576 records (from 1 January 2012 at 00:00 a.m. to 31 December 2014 at 20:00 p.m.). In the case of failure or maintenance, the converter valve was shut down, so the data of each converter valve within 3 years had several breakpoints in time, and the data of 3 years were, thereby, cut into several continuous data sequences.
There were 12 prediction tasks corresponding to 12 converter valves. Taking the time breakpoints of the dataset of each converter valve as the dividing points, each dataset was divided into a training set, a validation set, and a test set, with a ratio of approximately 3:1:1. All training sets were used for model training, the validation sets were devoted to selecting the candidate models with the optimal hyperparameters, and the test sets were used to evaluate the prediction results.

Correlation Analysis of Independent Variables
In Section 2, we introduced eight operating parameters that are generally recorded by the operation and maintenance personnel of converter stations, i.e., the DC load, external ambient temperature, valve hall temperature, water level of the expansion tank, water level of the spray tank, main water circuit conductivity, valve inlet water temperature, and valve outlet water temperature. However, there can be some correlations between one parameter and another. If there are parameters that are highly correlated, the complexity of the model will be pointlessly magnified, and the training speed of the model will, thus, be reduced. Therefore, we used the Pearson correlation coefficient [27] to perform a correlation analysis of the operating parameters, and the parameters with high correlation coefficients with others were eliminated.
The results of Pearson's correlation analysis are shown in Figure 8. It was apparent that the inlet water temperature and the outlet water temperature were highly correlated, with the correlation coefficient being as high as 0.96. Therefore, the output water temperature of the converter valve was eliminated from the list of operating parameters that we studied, and the inlet water temperature was used to represent the valve's cooling water temperature. In addition, all independent parameters appeared competent in acting as the input variables of the proposed prediction method; on one hand, the Pearson correlation coefficient between any two of the independent parameters was less than 0.5 or greater than −0.5, which meant that all six independent parameters mentioned in Table 1 were not highly relevant; on the other hand, each of the independent parameters showed a strong link with the inlet water temperature, as their Pearson correlation coefficients were all greater than 0.5 or smaller than −0.5. Therefore, the six independent parameters were used to predict the valve's cooling water temperature.
temperature. In addition, all independent parameters appeared competent in acting as the input variables of the proposed prediction method; on one hand, the Pearson correlation coefficient between any two of the independent parameters was less than 0.5 or greater than −0.5, which meant that all six independent parameters mentioned in Table 1 were not highly relevant; on the other hand, each of the independent parameters showed a strong link with the inlet water temperature, as their Pearson correlation coefficients were all greater than 0.5 or smaller than −0.5. Therefore, the six independent parameters were used to predict the valve's cooling water temperature.

Model Hyperparameter Settings
In the proposed TransFNN model, there are three types of hyperparameters to be set: (i) the clustering category number c of the FCM algorithm, (ii) the length of the historical data sequence and the maximum prediction step , (iii) and the number of hidden layers, the number of neurons in each hidden layer of the BPNN, and the structural and learning parameters of the improved Transformer model.
We first used the 12 training sets of the 12 converter valves to calculate the value of the clustering validity function according to Equation (25). We let the category number c traverse from 1 to 10, and the results for each converter valve are shown in Figure 9. For valves No. 6 and No. 12, the values of the clustering validity function corresponding to = 4 were marginally greater than those for = 3, and for the other converter valves, the validity of the clustering results was the highest when = 3. Therefore, the optimal clustering number was set to 3 for this case.

Model Hyperparameter Settings
In the proposed TransFNN model, there are three types of hyperparameters to be set: (i) the clustering category number c of the FCM algorithm, (ii) the length of the historical data sequence n and the maximum prediction step q, (iii) and the number of hidden layers, the number of neurons in each hidden layer of the BPNN, and the structural and learning parameters of the improved Transformer model.
We first used the 12 training sets of the 12 converter valves to calculate the value of the clustering validity function according to Equation (25). We let the category number c traverse from 1 to 10, and the results for each converter valve are shown in Figure 9. For valves No. 6 and No. 12, the values of the clustering validity function corresponding to c = 4 were marginally greater than those for c = 3, and for the other converter valves, the validity of the clustering results was the highest when c = 3. Therefore, the optimal clustering number c was set to 3 for this case.
The length of the historical data sequence n is of great significance in the Transformer model. To determine n, we should take both the completeness of the temporal information and the training complexity into consideration. If n is determined to be too large, the training time will be exceedingly long; if it is too small, the input data sequence will not contain sufficient temporal information, which will cause a reduction in prediction accuracy. Therefore, the principle of determining n is that the training complexity should be reduced as much as possible on the premise of ensuring the sufficiency of temporal information in the input historical data sequence. Trappenberg et al. [28] proposed a method for calculating the correlation of the mutual information between x(t) and x(t + δ), where x(t) and x(t + δ) represent a time series and its corresponding delay sequence, respectively, and δ is the length of the lag. In the case that we studied, the mutual information correlation (MIC) between x(t) and x(t + δ) showed a downward trend overall, indicating that the two sequences contained more different temporal information as δ increased. When δ was between 40 and 112, the value of MIC was almost constant. For the Transformer, the value of n was determined by the preset threshold of MIC. When MIC was set to approximately 8, the corresponding n was between 40 and 112, and the setting of the MIC threshold was based on manual experience [23]. Given that n ∈ [40, 112] will cause the input historical data sequence to have the same degree of temporal information sufficiency, and considering the training complexity, n should be as small as possible, so we set n to 40. In addition, the larger q is, the longer the overtemperature state of the converter valve can be monitored, but the predicted values have decreasing reference value as the time step increases. In practical engineering applications, 2 days are adequate for the operation and maintenance personnel of a converter station to adjust the future cooling measures, and there were six samples within a day in this case, so we set q to 12.  The length of the historical data sequence is of great significance in the Transformer model. To determine , we should take both the completeness of the temporal information and the training complexity into consideration. If is determined to be too large, the training time will be exceedingly long; if it is too small, the input data sequence will not contain sufficient temporal information, which will cause a reduction in prediction accuracy. Therefore, the principle of determining is that the training complexity should be reduced as much as possible on the premise of ensuring the sufficiency of temporal information in the input historical data sequence. Trappenberg et al. [28] proposed a method for calculating the correlation of the mutual information between ( ) and ( + ), where ( ) and ( + ) represent a time series and its corresponding delay sequence, respectively, and is the length of the lag. In the case that we studied, the mutual information correlation ( ) between ( ) and ( + ) showed a downward trend overall, indicating that the two sequences contained more different temporal information as increased. When was between 40 and 112, the value of MIC was almost constant. For the Transformer, the value of was determined by the preset threshold of . When was set to approximately 8, the corresponding was between 40 and 112, and the setting of the threshold was based on manual experience [23]. Given that ∈ [40, 112] will cause the input historical data sequence to have the same degree of temporal information sufficiency, and considering the training complexity, n should be as small as possible, so we set n to 40. In addition, the larger is, the longer the overtemperature state of the converter valve can be monitored, but the predicted values have decreasing reference value as the time step increases. In practical engineering applications, 2 days are adequate for the operation and maintenance personnel of a converter station to adjust the future cooling measures, and there were six samples within a day in this case, so we set to 12. In the BPNN and the Transformer model, there are eight main hyperparameters to be determined. Using manual experience and grid search methods, the eight structural and learning parameters were set as follows: (i) The number of hidden layers in the BPNN was 2; (ii) the numbers of neurons in each hidden layer were 10 and 5, respectively; (iii) In the BPNN and the Transformer model, there are eight main hyperparameters to be determined. Using manual experience and grid search methods, the eight structural and learning parameters were set as follows: (i) The number of hidden layers in the BPNN was 2; (ii) the numbers of neurons in each hidden layer were 10 and 5, respectively; (iii) the number of encoding/decoding layers was 3; (iv) the number of neurons in each hidden layer of the feed-forward layer was 32; (v) the learning rate was 0.01; (vi) the size of the epochs was 32; (vii) the batch size was 16; (viii) the dropout was 0.5.

Quantitative Analysis of TransFNN: Evaluating the Impacts of Individual Modules
In this experiment, historical data sequences of 40 time steps were used to predict future sequences of 12 time steps. It should be noted that DC power, external ambient temperature, and valve hall temperature have obvious time-series characteristics, so their future values were given by the output of the Transformer model; the water level of the expansion tank, water level of the spray tank, and main water circuit conductivity were directly controlled by the operation and maintenance personnel, so their future values in the experiment were obtained by adding Gaussian white noise with an average of 0 to the actual values, where the Gaussian white noise was used to simulate the error of reading the indications of the equipment.
The quantitative experiments were performed with the following models: (v) The Transformer model only: equivalent to the TransFNN model without the FCM and ANN modules. The valve's cooling water temperature was directly predicted with the Transformer.
Each converter valve required a single prediction task. We evaluated the five models on all test sets of every converter valve with statistical measures, i.e., RMSE, MAE, sMAPE, and R 2 . The results are shown in Table 2, which indicates that the rankings in every prediction task were not identical, but the TransFNN model always performed the best among the five models in the 12 prediction tasks. For the sake of clarity, converter valve No. 7 was chosen to display the illustrative results, as shown in Figure 10, and we also present a scatterplot of the measured and predicted values of the valves' cooling water temperatures in Figure 11.
It was indicated after comparing the deviation between the actual values and the predicted values that the proposed TransFNN was the best model for predicting the future values of valves' cooling water temperatures among the five models. The reasons for the reductions in prediction accuracy of the other four models were as follows: (i) TransFNN − FCM: A single neural network needed to process the data of all operating conditions because the valve operating conditions were not clustered into several typical operating conditions due to the absence of the FCM module. The learning and fitting abilities of the single NN were insufficient. (ii) TransFNN − ANN: The complex nonlinear relation between the six independent variable parameters and the valves' cooling water temperatures was too complicated for simpler models to approximate. (iii) TransFNN − PE: Without the improved positional encoding method, the model was incapable of effectively learning the periodicity of the predicted variables in the dimensions of hours, days, and months. (iv) Only Transformer: Not all impact factors of the valves' cooling water temperature had temporal characteristics because some factors were directly controlled by the valve operation and maintenance personnel. Therefore, the data sequence of valves' cooling water temperatures cannot be regarded as a fully typical time series; otherwise, there would be a relatively large error in the predicted values. This also confirmed the discussion of the temporal characteristics of valves' cooling water temperatures in Section 2. For the sake of clarity, converter valve No. 7 was chosen to display the illustrative results, as shown in Figure 10, and we also present a scatterplot of the measured and predicted values of the valves' cooling water temperatures in Figure 11.  A converter valve will be in an overtemperature state if the valve's cooling water temperature exceeds a threshold. In order to evaluate the accuracy of the five models in predicting the overtemperature state of a converter valve, we preset the threshold of the cooling water temperature to 38 • C. When the converter valve was actually in the overtemperature state but the cooling water temperature of the valve did not exceed 38 • C or when the converter valve was not in the overtemperature state but the predicted value indicated an overtemperature state, the predicted result was considered to be wrong. We counted the error rate of overtemperature prediction of each model on the test sets of the 12 converter valves, as shown in Figure 12. We can see that removing any module from the proposed TransFNN model led to an increase in the error rate of overtemperature prediction, and the proposed TransFNN model improved the prediction accuracy by 6.58% compared with the original Transformer model.  Figure 11. Scatterplot of the actual and predicted values.
It was indicated after comparing the deviation between the actual values and the predicted values that the proposed TransFNN was the best model for predicting the future values of valves' cooling water temperatures among the five models. The reasons for the reductions in prediction accuracy of the other four models were as follows: (i) TransFNN − FCM: A single neural network needed to process the data of all operating conditions because the valve operating conditions were not clustered into several typical operating conditions due to the absence of the FCM module. The learning and fitting abilities of the single NN were insufficient. (ii) TransFNN − ANN: The complex nonlinear relation between the six independent variable parameters and the valves' cooling water temperatures was too complicated for simpler models to approximate. (iii) TransFNN − PE: Without the improved positional encoding method, the model was incapable of effectively learning the periodicity of the predicted variables in the dimensions of hours, days, and months. (iv) Only Transformer: Not all impact factors of the valves' cooling water temperature had temporal characteristics because some factors were directly controlled by the valve operation and maintenance personnel. Therefore, the data sequence of valves' cooling water temperatures cannot be regarded as a fully typical time series; otherwise, there would be a relatively large error in the predicted values. This also confirmed the discussion of the temporal characteristics of valves' cooling water temperatures in Section 2.
A converter valve will be in an overtemperature state if the valve's cooling water temperature exceeds a threshold. In order to evaluate the accuracy of the five models in predicting the overtemperature state of a converter valve, we preset the threshold of the cooling water temperature to 38 °C. When the converter valve was actually in the overtemperature state but the cooling water temperature of the valve did not exceed 38 °C or when the converter valve was not in the overtemperature state but the predicted value  indicated an overtemperature state, the predicted result was considered to be wrong. We counted the error rate of overtemperature prediction of each model on the test sets of the 12 converter valves, as shown in Figure 12. We can see that removing any module from the proposed TransFNN model led to an increase in the error rate of overtemperature prediction, and the proposed TransFNN model improved the prediction accuracy by 6.58% compared with the original Transformer model.

Comparative Analysis of TransFNN: Assessing the Performance against Other Time-Series Prediction Algorithms
The Transformer model has demonstrated excellent performance in time-series prediction, but there are other traditional machine learning algorithms and deep learning models that offer certain advantages in this task, such as faster computing speeds or higher prediction accuracy. Compared to traditional time-series prediction models, such as random forest (RF), RNN, LSTM, and GRU, the Transformer model has two main advantages. Firstly, it uses a unique self-attention mechanism that balances the importance of different time positions in the sequence, thereby better capturing long-term dependencies in the data. This is particularly useful in time-series forecasting, as past values strongly influence future values. Secondly, the Transformer model is fully parallelizable, enabling it to process all elements of an input sequence simultaneously, making it much faster than a model that processes data sequentially.

Comparative Analysis of TransFNN: Assessing the Performance against Other Time-Series Prediction Algorithms
The Transformer model has demonstrated excellent performance in time-series prediction, but there are other traditional machine learning algorithms and deep learning models that offer certain advantages in this task, such as faster computing speeds or higher prediction accuracy. Compared to traditional time-series prediction models, such as random forest (RF), RNN, LSTM, and GRU, the Transformer model has two main advantages. Firstly, it uses a unique self-attention mechanism that balances the importance of different time positions in the sequence, thereby better capturing long-term dependencies in the data. This is particularly useful in time-series forecasting, as past values strongly influence future values. Secondly, the Transformer model is fully parallelizable, enabling it to process all elements of an input sequence simultaneously, making it much faster than a model that processes data sequentially.
Based on these advantages, we used the Transformer model as the module for implementing time-series prediction in our proposed TransFNN model. To evaluate the performance of TransFNN, we compared it with other time-series models, including RF, RNN, LSTM, and GRU, which represent a comprehensive range of hybrid models.
We conducted the experiment with the same dataset and experimental setup as in the previous section. For the evaluation index of RMSE, we calculated the average RMSE of the cooling water temperatures of all converter valves in the prediction test set for each model. For the evaluation index of the prediction time, we calculated the average time taken by each model to predict the valves' cooling water temperatures at a single time point. We then calculated the RMSE reduction for each model compared to RF and the prediction time reduction for each model compared to the RNN, as shown in Table 3. We added the percentage of RMSE reduction and the percentage of prediction time reduction for the different models to obtain the total optimization percentage for these models. It should be noted that the weight given to RMSE reduction and prediction time reduction in evaluating time-series prediction models can depend on the specific application and requirements. In some cases, such as in real-time systems or applications in which speed is critical, the prediction time may be more important than the RMSE. In other cases, such as in scientific research or financial forecasting, the RMSE may be more important than the prediction time. Note: For RMSE reduction, the benchmark is ''RF", and for calculation time reduction, the benchmark is ''RNN". The value of total reduction is the sum of the RMSE reduction and calculation time reduction. Bold values denote the best performance among the models. Table 3 summarizes the performance of the different time-series prediction models. As shown, when the weights given to RMSE reduction and prediction time reduction were the same, the Transformer model outperformed the other models in our proposed framework for predicting the cooling water temperature of a converter valve. However, since the real-time prediction of the overtemperature state of the converter valve was not required in this scenario, we gave a greater weight to RMSE reduction. This further decreased the overall performance ranking of the prediction model that used RF. Nevertheless, the Transformer model remained the best-performing module in this hybrid model. Therefore, the comprehensive performance of the TransFNN model proposed in this paper achieved satisfactory results.

Conclusions
This study proposed a novel approach to predicting the overtemperature state of converter valves, which was achieved by implementing the TransFNN model. The model combined an F-NN module, which coordinated the FCM algorithm and an ANN, with an improved Transformer model. This approach overcame the issue of low prediction accuracy in the original Transformer model when processing the cooling water temperature of a converter valve. The TransFNN model was able to accurately predict the overtemperature state of converter valves, which is essential for the effective maintenance and operation of converter stations.
The experimental results showed that the proposed TransFNN model outperformed four other models in terms of prediction accuracy. By removing some modules from TransFNN and applying it to predict the overtemperature state of 12 converter valves, the accuracy was 91.81%, which was 6.85% higher than that of the original Transformer model. The improved accuracy of the TransFNN model provides valuable information to the operation and maintenance personnel of converter stations, enabling them to punctually and appropriately adjust power dispatching and cooling measures, thereby avoiding equipment damage and resource waste. Additionally, the study enhanced the positional encoding method in the Transformer, making it suitable for predicting time series with strong periodicity in different time dimensions. This study also provided a new framework for predicting the time series of multi-source data, with data clustering as one of the preprocessing steps for improving the similarity of the data in the same operating conditions. This resulted in a deep learning model with improved learning abilities, leading to an overall improvement in prediction accuracy.