An Interval Forecasting Model Based on Phase Space Reconstruction and Weighted Least Squares Support Vector Machine for Time Series of Dissolved Gas Content in Transformer Oil

: Transformer state forecasting and fault forecasting are important for the stable operation of power equipment and the normal operation of power systems. Forecasting of the dissolved gas content in oil is widely conducted for transformer faults, but its accuracy is a ﬀ ected by data scale and data characteristics. Based on phase space reconstruction (PSR) and weighted least squares support vector machine (WLSSVM), a forecasting model of time series of dissolved gas content in transformer oil is proposed in this paper. The phase spaces of time series of the dissolved gas content sequence are reconstructed by chaos theory, and the delay time and dimension are obtained by the C-C method. The WLSSVM model is used to forecast time series of dissolved gas content, the chemical reaction optimization (CRO) algorithm is used to optimize training parameters, the bootstrap method is used to build forecasting intervals. Finally, the accuracy and generalization ability of the forecasting model are veriﬁed by the analysis of actual case and the comparison of di ﬀ erent models.


Introduction
A power transformer is among the key equipment of a power system. During the long operation of the transformer, due to equipment aging, discharge fault, thermal fault, and other reasons, a small amount of gas will be produced in the insulation oil, and the content of various components of dissolved gas and the proportion of components in the oil are closely related to the operation condition of the transformer. Through dissolved gas analysis (DGA) [1][2][3][4][5], some latent faults in the transformer and their development degree can be found. DGA is an internationally recognized and effective method of diagnosing early transformer faults that has been proven in practice by many fault diagnoses. According to the changing trend of dissolved gas content in the oil, the operation state of the transformer can be tracked, and any abnormality in the equipment can be further determined. Then, the fault type can be inferred, and the early faults in the transformer can be found in time. If the transformer is forecasted to fail, the maintenance plan can be arranged in advance to ensure reliable and stable operation of the power system. prediction of dissolved gas content in oil in terms of model optimization and data processing, and achieved good results, but ignored the sample size problem and the lack of related factors in the actual engineering application. Support vector machine (SVM) relies on the principle of minimizing structural risk, which considers both empirical risk and the complexity of the learning machine, and therefore it is good for solving small samples and optimal problems, and has good generalization ability [41,42], so it is an effective approach for solving forecasting problems [43]. Least squares support vector machine (LSSVM) was introduced in [44] as a reformulation of the standard SVM [45,46] that simplifies the model to a great extent by applying linear least squares criteria to the loss function instead of the traditional quadratic programming method. The simplicity and inherited advantages of SVM, such as its being based on the principle of minimizing structural risk and its kernel mapping, promote the application of LSSVM in many pattern recognition and regression problems [47][48][49][50]. Zheng et al. [51] introduced LSSVM to dissolved gas content forecasting and made forecasts for the five gases. However, while improving the standard SVM model, LSSVM has lost its robustness.
In terms of data preprocessing, some scholars decompose the original sequence based on the nonlinear and nonstationary characteristics of the time series. Zeng et al. [52] used the empirical mode decomposition (EMD) method to process the DGA data and decompose nonstationary signals into characteristic frequency function components of different frequencies, and modeled each subsequence component separately, reconstructing its prediction by superposition. As a result, the combined prediction results, which meet the accuracy requirements, were obtained. The method is simple, convenient, and easy to operate, but it is prone to modal aliasing problems [53][54][55]. Lin et al. [56] used the kernel principal component analysis (KPCA) method to select the characteristic parameters and used generalized regression neural network (GRNN) to forecast the gas concentration in transformer oil, which improved the accuracy compared with the model without pretreatment. According to the above research, we know that data preprocessing and feature extraction are necessary for forecasting of time series of dissolved gas content, because the signal contains a large amount of information effectively. Therefore, if more linear and nonlinear features are mined through the historical time series of dissolved gas content, it will be more helpful for the forecasting. Chaos theory has been applied in power load forecasting, wind speed forecasting, equipment fault diagnosis, and other fields, but it is seldom used in forecasting of dissolved gas content in transformer oil. Liu et al. [57] analyzed the chaotic characteristics of power load through phase space reconstruction to extract the effective information on power load, and applied the reconstructed results to the forecasting model. Sun et al. [58] applied phase space reconstruction technology to inverter fault diagnosis, in which the phase space reconstruction method was used to obtain current characteristic trajectories of inverters in various operating states. Zhang et al. [59] applied chaos theory to recognition of partial discharge of Geographic Information System (GIS) equipment, extracted the chaotic features of partial discharge signals with four typical defects as the feature quantity, and then applied these chaotic features to pattern recognition and obtained good recognition performance. Qi et al. [60] applied chaos theory to the selection of data length and collection period of dissolved gas content in oil monitoring equipment, and gave suggestions on the collection period, which was of great help to the life of monitoring equipment and effective use of storage space.
Above all, the current research on forecasting methods of time series of dissolved gas content focuses on the selection of artificial intelligence models and optimization methods, which directly or indirectly improve the effectiveness of power transformer fault forecasting. However, the problem of small sample size and the inability to obtain associated parameters have not been well solved. At the same time, the traditional point forecasting method has difficulty reflecting the uncertainty of forecasting results, and these problems limit the application of artificial intelligence algorithms or models in dissolved gas content forecasting. Therefore, a model based on bootstrap and PSR-CRO-WLSSVM is proposed to solve these problems.

Phase Space Reconstruction
A chaotic system is a nonlinear system under the action of certain evolution law. It is universal, and its long-term behavior is reflected in spatial distribution with certain regularity [61,62]. The components of a chaotic system interact with each other and develop together, so they show the phenomenon of partial information containing other components. Theoretically, due to the correlation between components, the analyzing of the time series of any variable can restore the basic dynamic characteristics of the system.
In the 1980s, Packard and Takens [63] proposed the theory of phase space reconstruction. Its main purpose is to restore time series of chaotic attractors, and its principle is to choose the appropriate delay time and embedding dimension. Using chaos theory with the phase space reconstruction method, low-dimensional time series can be mapped to high-dimensional phase space and keep the differential homeomorphism while using less information, extracting more abundant structural characteristics.
Phase space reconstruction is an effective method to analyze nonlinear time series. The basic idea is to take a time series as a component of a nonlinear dynamical system. The variation law of this component can be used to reconstruct the equivalent high-dimensional phase space of the dynamic system, and the time series can be projected into the variable point trajectories of the high-dimensional phase space. If there is a time series of dissolved gas content in oil {x i }(i = 1, 2, . . . , N), where N is the number of data points in the time series, then the time series set reconstructed by phase space can be expressed by Equation (1): where, m is the embedding dimension of the time series, τ is the delay time, M is the number of vectors that the time series embedded in the phase space and N = M + (m − 1)τ. For a certain time series, in theory the existence of an optimal τ and m, can neither be too big nor too small, otherwise they will be unable to accurately capture the dynamic characteristics of the original signal.
At present, there are mainly the following methods to obtain the delay time: autocorrelation, average displacement, complex autocorrelation, mutual information, and C-C method. The C-C method is a method that can not only keep its nonlinear characteristics, but also calculate a small amount of delay time. Besides, it is easy to operate and has strong anti-noise ability. In this paper, the optimal delay time is obtained by the C-C method. The detailed description is as follows: Investigate a pair of phase points in phase space: Assuming the distance between them is r ij (m), it is obvious that r ij (m) is a function of the embedding dimension m of phase space: Energies 2020, 13, 1687 6 of 28 Given a critical distance r, the proportion of the logarithm of points whose distance is less than r in the logarithm of points is denoted as the associated integral: where r is the neighborhood radius, and H(·) is the Heaviside function and is described by Equation (6): The time series {X 1 , X 2 , . . . , X N } is divided into τ non-intersecting time series, whose length is INT(N/τ), and INT is an integer. The statistics of each subsequence can be calculated by Equation (7): where C l is the correlation integral of the l-th subsequence. For N → ∞ , Equation (7) can be deformed to Equation (8): The deviation is defined as Equation (9): ∆S(m, τ) measures the maximum deviation of ∆S(m, τ) from radius r. When the maximum deviation ∆S(m, τ) is the minimum value, the points in the reconstructed phase space are closest to uniform distribution, the reconstructed dynamical system orbit is fully displayed in the phase space, and the time series correlation is closest to zero. Therefore, the ∆S(m, τ) ∼ τ curve also reflects the autocorrelation of the original time series (parameter m is fixed). Since ∆S(m, τ) is always positive, the optimal delay time can be the time point corresponding to the first local minimum value of ∆S(m, τ).
According to the BDS statistical conclusion, when N ≥ 3000, 2 ≤ m ≤ 5, and σ/2 ≤ r ≤ 2σ, the S(m, r, τ) ∼ τ curve better reflects the autocorrelation of the original time series, where σ is the standard deviation of the time series. Take m = 2, 3, 4, 5 r i = i * 0.5σ, i = 1, 2, 3, 4, S(τ) and ∆S(τ) can be described by Equations (10) and (11): It can be seen from the above two expressions that, S(τ), ∆S(τ) reflect the autocorrelation characteristics of the original time series. Considering that S(τ) values can be positive or negative, ∆S(τ) values are always positive; finding the first zero crossing of S(τ) or the first local minimum point of ∆S(τ) is the optimal delay. The index S cor (τ) can be defined by Equation (12): Energies 2020, 13, 1687 7 of 28 Looking for S cor (τ) of the global minimum value of τ can get the best embedding window τ w , and according to τ w = (m − 1) τ, embedded figures m can be obtained.

Least Squares Support Vector Machine
Suppose there is a training set x k, y k N k=1 , k = 1, 2, . . . , N, where x k ∈ R n is n-dimensional input data, and y k ∈ R n is output data. Therefore, in the original weight space, the prediction model can be considered as the following optimal problem in Equations (13) and (14): where, w ∈ R n is the weight vector of the original weight space, e k ∈ R n is an error variable, ϕ(·) : R n → R n h is a nonlinear mapping function that maps input spatial data to higher-dimensional (possibly infinite dimensional) feature spaces, b ∈ R n is the offset value, and γ > 0 is the regularization parameter (also called penalty coefficient). Thus, in the original weight space, there is the following nonlinear model: Lagrange multipliers are introduced for Equation (15); a k ∈ R n , Lagrange function defined as Equation (16): By taking partial derivatives of the variables in Equation (16) and sorting out and eliminating w and e, the optimal problem is transformed into a linear system by Equation (17): where, y = [y 1 , y 2 , . . . , y N ] T , 1 = [1, 1, . . . , 1] T , a = [a 1 , a 2 , . . . , a N ] T , Ω ∈ R N×N , Ω i,j = ϕ(x i ) T ϕ x j = K x i , x j , i, j = 1, 2, . . . , N, and K(·, ·) is a kernel function satisfying Mercer's theorem.
Commonly used kernel functions are linear kernel function K x i , The kernel function of RBF is used in this paper, a and b can be calculated by Equation (17), so the nonlinear prediction model of LSSVM can be obtained as Equation (18):

Weighted LSSVM
While improving the standard SVM model, LSSVM loses its robustness. As a result, therefore, the weight of all training data in the objective function is γ and all samples play the same role in training, which is inconsistent with the actual situation.
For each piece of sample data, due to its position in the whole sample and the degree of influence by noise, its importance differs. In order to regain robustness by treating different training data differently, weighted LSSVM is used [64].
Based on the LSSVM model, each error quantity e k = a k /γ is given a different weight factor v k , and the optimization problem is described as Equation (19): The Lagrange function can be described as Equation (20): Similarly, according to the KKT condition, the system of linear Equation (21) can be obtained.
where, V γ = diag 1 γv 1 , . . . , 1 γv 2 is a diagonal matrix, and the weight V k is determined by the error variable e k= a k /γ.
where, the robust estimate is the standard deviation of the amount of error,ŝ = IQR 2×0.6745 , IQR is the quartile spacing of error e k , which is the difference between the 0.75n value and the 0.25n value if arranged in numerical order, andŝ measures how far e k deviates from the Gaussian. The constants c 1 and c 2 are generally 2.5 and 3 [65].
The specific steps of the weighted LSSVM algorithm are as follows: Step 1: Given training data set x k, y k N k=1 , k = 1, 2, . . . , N find the optimal parameters (through the following chemical reaction optimization algorithm). For the optimal parameter, e k = a k /γ is calculated by Equation (20).
Step 2: The robust estimateŝ is calculated based on the distribution of error e k Step 3: According to Equation (21), the corresponding weight V γ value is determined by e k .
Step 4: According to Equation (21), a* and b* are solved, and the final nonlinear prediction model is given as Equation (23): The LSSVM model solved by Equation (17) is the optimal solution under the assumption that error e k obeys Gaussian distribution, and WLSSVM corrects the deviation of e in the case of non-Gaussian fractions by the weight defined in Equation (22), which makes the WLSSVM regression robust.

Chemical Reaction Optimization
Chemical reaction optimization (CRO) is a meta-heuristic algorithm proposed by Lam in 2010. Inspired by the interaction between molecules in chemical reactions to find the lowest potential energy in potential energy surface, the algorithm adopts four primary reactions and follows the law of energy conservation [66]. From the perspective of microscopic analysis of chemical reactions, we can see that at the initial stage of a chemical reaction, the state of the molecules in the container is unstable due to Energies 2020, 13, 1687 9 of 28 excessive energy in the molecules. In order to reach a stable state, each molecule will be led to the lowest possible energy state by the collision between molecules and the chemical reaction after the collision. The result is the product of the chemical reaction, and the formation process is the process of gradual reduction of reaction potential energy. In simple terms, CRO is an optimization process to search for the minimum potential energy of the system [67][68][69][70].
The basic operation units involved in the CRO algorithm are composed of molecules (ω) and container walls (buffer), where the molecules possess both potential energy (PE) and kinetic energy (KE) as the container walls create the environment in which the reaction occurs. The molecular PE is the ultimate criterion for evaluating a chemical reaction and thus becomes the objective function of the question of interest while KE is a quantized value for determining whether the system can initiate a molecular reaction. In a chemical reaction, there are four basic reaction operators: single-molecule collision, single-molecule decomposition, intermolecular collision, and molecular synthesis.
Single-molecule collision is a process that changes the molecular KE and PE due to collisions between molecules. The energy change in the process can be described by Equation (24): where ω is the original molecule, ω' is the new molecule after structural change, δ ∈ [KELossRate, 1] is a random number, KELossRate is the upper limit (in percentage) of the monomolecular collision loss rate, a constant, and PE ω = f (ω) is molecular potential energy, with f (·) as the objective function of the problem considered. Monomolecular collision enables local searching in the problem space. Single molecule decomposition is a reaction process in which a molecule collides with a wall, and breaks down or splits into two new molecules. The energy difference E dec before and after the collision is passed to the two new molecules in a random manner. The energy change in the reaction process can be expressed by Equation (25): where δ 1 , δ 2 are of uniform distribution in [0, 1] and δ 3 is a random value in [0, 1]. Compared to monomolecular collision, monomolecular decomposition is capable of local search in a larger scope. Intermolecular collision describes the process of two new molecules after collision and energy exchange. A new molecule can be taken out of the original molecular structure domain. Since the molecule does not collide with the wall, there is no energy loss, so the total energy is unchanged after the reaction. The two new molecules have a total KE of E inter , which is distributed between them randomly, and the energy change can be described by Equation (26).
where δ 4 is a random value in the range of [0, 1]. Molecular synthesis is the phenomenon of two molecules colliding to produce a new molecule. It is a wall-free collision, so the energy remains constant before and after the collision. The energy change is shown in Equation (27): Energies 2020, 13, 1687 10 of 28 Molecular synthesis greatly increases the diversity of molecules, and the synthesized new molecules are obviously different from the originals, which usually have higher molecular activity. Molecular synthesis improves the searching ability of molecules in new regions, thus improving the global searching performance of CRO.
In this paper, the CRO algorithm is used to optimize the parameters of WLSSVM, and the optimal penalty coefficient γ and kernel width σ are obtained. The optimization process of CRO-WLSSVM is as follows: Step 1: Initializes the chemical reaction optimization algorithm. It is necessary to determine the number of initial molecules in the container (PopSize), the upper limit (KELossRate) of the percentage of KE loss in the wall-hitting reaction, the determinants of molecular reaction type (MoleColl), the determinants of monomolecular reaction type (α), the determinants of multi-molecular reaction type (β), the maximum iteration times (Iteration), etc.
Step 2: Calculate the initial potential energy of each molecule, and take the molecular KE initial value as the initial kinetic energy.
Step 3: Iteratively optimize the molecules in the container through four basic reaction operators. Only one basic reaction operator is executed in each iteration. The optimization process of each iteration consists of three judgment processes, which are reaction type, monomolecular reaction type, and intermolecular reaction type.
Step 4: Set the objective function. If the molecule meets the stop condition of the algorithm, the optimization calculation will be terminated. The smallest PE molecule is the global optimal solution, and the corresponding solution is the initial kernel width and penalty coefficient of the optimized WLSSVM, which can be assigned to WLSSVM to obtain the forecasting model of dissolved gas content.

Bootstrap
The bootstrap method is a statistical inference method of simulated sampling based on original data proposed by professor Bradley in 1979 [68] for statistical inference under small sample conditions. It belongs to one of the nonparametric estimation methods in statistics and can be used for statistical inference under small sample conditions.
Bootstrap is generally used for statistical inference when the model is difficult to obtain or difficult to assume. For example, when the data distribution is unknown, the bootstrap method does not need to make any distribution assumptions, and can indirectly obtain the distribution from the data generated by the original sample.
Therefore, the bootstrap method was used to resample the original test samples of dissolved gas in oil. Considering that the dissolved gas data in oil has time self-correlation, the resampling method of a single data point will destroy this dependent structure. However the block bootstrap method can preprocess the original time series data in blocks which are used as the basic unit for resampling to generate several new subsamples, thereby avoid the failure of the general self-service method to the greatest extent [69,70], and achieve a forecasting effect closer to the true distribution [71].
According to different block methods, nonoverlapping block bootstrap (NBB) method [72], moving block bootstrap (NBB) method [73], circular block bootstrap (CBB) method [74], and stationary bootstrap (SB) method [75] have been successively generated. Among them, the first three methods are relatively similar, they all use fixed-length block lengths and there are only slight differences in whether the subblocks overlap and whether the original sequence is processed end-to-end, meanwhile, compared with the simplest NBB method, MBB and CBB have higher estimation accuracy. The SB method uses the block length with geometric distribution and the randomized starting point of the block to ensure the stability of each subblock. However, some scholars have shown [76] that under the condition of similar block lengths, the variance estimated by the SB method is twice that of the NBB method, and three times that of the MBB method and the CBB method. Therefore, the MBB method is selected as the resampling method for the dissolved gas. The resampling process of the MBB method is as follows: Step 1: Assume that the original sample is X = {X 1 , X 2 , . . . , X N } the original sample is divided into N − L + 1 partially overlapping blocks which can be expressed as Z j = X j , X j+1 . . . , X j+L−1 , j = 1, 2, . . . , N − L + 1, where the length of the block is L, and the generated block set can be expressed as Z = Z 1 , Z 2 , . . . , Z q , q = N − L + 1. The selection of L will affect the estimation result. According to [77], L is related to the number of samples N, and the empirical calculation method of L [78] is shown as Equation (28): Step 2: Extract R blocks from Z with replacement, and construct a sample Step 3: Define the newly generated subsample as Step 4: Repeat steps 1 to 3 M times to construct M subsamples in sequence.
The correct choice of resampling number M is the key to ensure the accuracy of model prediction and the efficiency of equilibrium model calculation. A. Khosravi et al. [79] showed that there is no significant positive correlation between the size of the resampling number M and the width of the prediction interval, that is, an excessively large number of resampling M cannot significantly improve the quality of the prediction interval. E. Zio [80] believes that, in general, setting the resampling number M between 20 to 200 can meet the application requirements of most practical projects.
Based on the resampled subsamples, a prediction interval can be constructed. Many engineering examples have proved that the forecasting interval constructed by the bootstrap method can not only accurately represent the uncertainty degree of the forecasting results, but also analyze the range of future time series, which can effectively compensate for the limitation that the forecasting method of points can only obtain the deterministic forecasting value.
In view of the above advantages, bootstrap was introduced in this paper to construct the forecasting interval of dissolved gas in oil. In the process of forecasting, subjective factors such as the setting of forecasting model parameters and objective factors such as the noise of data collection are the main reasons affecting the uncertainty of forecasting results [81]. The bootstrap method takes into account the uncertainty caused by model error and data noise error, and constructs the forecasting interval based on the above factors.
The construction principle of forecasting interval is summarized as follows: Suppose the sample set of dissolved gas in transformer oil is D r = {x i , t i } N i=1 : x i is the input variable, t i is the ith actual sample value, the error between results of forecasting modelŷ(x i ) and actual sampling value t i can be described as Equation (29): where y(x i ) is the true regression value of the forecasting model, t i −ŷ(x i ) is total error between actual values and forecasting values, (x i ) −ŷ(x i ) represents model error, and ε(x i ) represents data noise. M subsamples were obtained by means of heavy sampling, and the subsamples were used for training in the corresponding forecasting model of dissolved gas.
The real value of output of forecasting modelŷ(x i ) is the average of all WLSSVM models forecasting results, which can be described by Equation (30): where,ŷ l (x i ) is forecasting value obtained by the lth data set and the corresponding WLSSVM model, and σ 2 y (x i ) is the variance of the forecasting model error, which can be expressed by Equation (31): is the variance of data noise, which represents the uncertainty of samples caused by random noise in the measurement process, and can be described by Equation (32): In order to effectively forecast σ 2 ε (x i ), the square residual sequence of the model needs to be constructed and combined into a new sample set . By training the new sample set with the corresponding model, the forecasting result of σ 2 ε (x i ) can be obtained. The square residual sequence r 2 i can be obtained by Equation (33): In order to maximize the probability of noise variance in the sample, the maximum likelihood estimation method is used to train the new model to ensure the confidence level of the forecasting [70]. The objective function is shown in Equation (34): After training of the model, the forecasting interval of dissolved gas in transformer oil with a significance level of a can be constructed as Equation (35): where z 1−a/2 is the 1 − a/2 quantile in a standard Gaussian.

Parameter Selection
The purpose of forecasting dissolved gases is to support fault warning and fault forecasting of transformers. Transformer fault forecasting usually combines the results of gas forecasting with fault diagnosis theory or fault diagnosis model to realize fault forecasting at a certain time in the future. Therefore, the selection of forecasting parameters depends on the parameters needed for transformer fault diagnosis [82].
During normal operation of the transformer, due to aging and cracking of insulation oil and solid insulation, a very small amount of gas will be decomposed, mainly including hydrogen (H 2 ), methane (CH 4 ), ethane (C 2 H 6 ), ethylene (C 2 H 4 ), acetylene (C 2 H 2 ), carbon monoxide (CO), carbon dioxide (CO 2 ), etc. [2][3][4]. When a fault or abnormality occurs inside the transformer, the contents of some components in these gases will increase rapidly. For example, when the insulating oil is overheated, CH 4 and C 2 H 4 are the main increased gas components and show a strong correlation. In the case of high energy discharge, the content of H 2 and C 2 H 2 increases and shows a strong correlation. Based on the correlation between dissolved gas contents and variation and transformer faults, fault diagnosis methods or models recommended by International Electrotechnical Commission (IEC) and Institute of Electrical and Electronics Engineers (IEEE) have been produced, which are divided into two categories. The input parameters of the IEC method, Roger method, and Doernenburg method are gas ratios and of the key gas method and David triangle method are gas content, and the input parameters of these fault diagnosis methods are all based on the above gas parameters.
Therefore, H 2 , CH 4 , C 2 H 6 , C 2 H 4 , C 2 H 2, CO, and CO 2 were selected as the forecasting objects in this paper.

Evaluation Index
In order to verify the forecasting accuracy of the model, the mean absolute scale error (MASE), was used to evaluate the forecasting results of the proposed algorithm. The MASE was proposed by Hyndman and Koehler (2006) as a generally applicable measurement of forecast accuracy without the problems seen in the other measurements. The MASE can be used to compare forecast methods on a single series, and, because it is scale-free, to compare forecast accuracy across series [83,84].
Prediction interval coverage probability (PICP), prediction intervals normalized averaged width (PINAW), and coverage width-based criterion (CWC) were used to evaluate the results of the forecasting interval [85,86].
a. PICP PICP can be used represent the accuracy of interval forecasting and effectively avoid gas exceeding the forecasted upper and lower limits. The higher the value of PICP becomes, the more real values the predicted interval contains, and the more reliable the constructed interval will be.
where N is the number of target values, t i is the ith target value, and L i and U i are the lower bound and upper bound, respectively, of the forecasting interval corresponding to the ith target value. b. PINAW The quality of the forecasting interval is generally evaluated by PICP. If the target values are within the forecasting interval, the corresponding PICP will reach 100% coverage. The width of prediction intervals determines the amount of information they contain. In practice, it makes no sense to set a too wide prediction interval. PINAW is used to represent the degree of uncertainty of the prediction interval. The narrower the width, the lower the uncertainty of the interval and the better the accuracy of the model.
where R is the range of the target value and represents the difference between the maximum and minimum value of test sample of time series of the dissolved gas. c. CWC PICP and PINAW can only evaluate one aspect of the prediction interval, and these two indices are gain and loss indices, respectively. CWC is a comprehensive evaluation index, which contradicts Energies 2020, 13, 1687 14 of 28 the two monotonicity evaluation indices. Therefore, CWC is considered to comprehensively evaluate the quality of the forecast interval and can be expressed by Equations (40) and (41): where, µ is the confidence interval and η is a super parameter, which is used to enlarge the difference between PICP and µ, and η can be assigned 30 [87].

Modeling Process of bootstrap and PSR-CRO-WLSSVM
The construction process of the bootstrap and PSR-CRO-WLSSVM interval forecasting model constructed in this paper is shown in Figure 1.
The algorithms involved in the model are summarized as follows: the bootstrap method was used to construct the data set and the prediction interval, the PSR method was used to transform the feature space of the sample data, and the CRO algorithm was used to optimize the penalty coefficient and the core width of WLSSVM, and the optimal parameters were obtained through iterative optimization of the four reaction operators. The samples were divided into training samples and test samples. The training samples were used for model training, to generate the global optimal interval forecasting model, and the test samples were used for verification, to obtain the forecasting results and ultimately point forecasting and interval forecasting of the dissolved gas in oil were realized finally.
Step 1: Normalize the data. The original DGA samples were normalized to convert the content of dissolved gas into the relative content within the range of [0,1], which is conducive to reducing the mutual exclusion between gases and avoiding the order of magnitude difference of input parameter values. The normalized treatment is shown in Equation (42): Meanwhile, the original sample set was divided into a training set and test set, and M pseudo-sample sets were constructed by the bootstrap method.
Step 2: Construct and train the PSR-CRO-WLSSVM model, as shown in Figure 2. PSR was used to preprocess the sample data, and the C-C method was used to calculate the optimal delay time t and embedded dimension m of the dissolved gas time series, as shown in Equations (9) and (10). Then the sample time series was reconstructed according to the embedded dimension and the delay time was obtained, as shown in Equation (12).
Then the phase space matrix was used as the training set and CRO was used as the optimization algorithm to train WLSSVM. Global optimization of kernel width and penalty factor is obtained by iterative optimization of the four reaction operators. The details of parameter optimization are introduced in Section 3.2.
In order to avoid model overfitting, the five-fold cross-validation method was adopted, and the predicted mean root mean square error under the five-fold cross-validation method was taken as the objective function of the model [6,38].

Modeling Process of bootstrap and PSR-CRO-WLSSVM
The construction process of the bootstrap and PSR-CRO-WLSSVM interval forecasting model constructed in this paper is shown in Figure 1.  Energies 2020, 13, x; doi: FOR PEER REVIEW www.mdpi.com/journal/energies mutual exclusion between gases and avoiding the order of magnitude difference of input parameter values. The normalized treatment is shown in Equation (42): Meanwhile, the original sample set was divided into a training set and test set, and M pseudosample sets were constructed by the bootstrap method.
Step 2: Construct and train the PSR-CRO-WLSSVM model, as shown in Figure 2.  Figure 2. Construction process of PSR-CRO-WLSSVM model.
Step 3: Calculate the variance of model error. The WLSSVM model was constructed with optimal super parameters, and the model was imported into M pseudo-sample sets for training. Equation (27) was used to construct the point forecasting results of the model with the expected value of the pseudo-sample set. According to Equation (28), the error variance sequence of the model was constructed.
Step 4: Calculate the variance of data noise error. Equation (29) was used to construct the variance sequence and variance data set of data noise. Equation (31) was used as the fitness function of noise variance, and a new forecasting model of noise variance was trained by the CRO algorithm.
Step 5: Construct the forecasting interval. According to Equation (32), the forecasting interval of dissolved gas in transformer oil was constructed.
Step 6: Evaluate the results. Finally, the forecasting model was used to forecast the test set, and the forecasting values were compared with actual values to analyze the accuracy and uncertainty of the model by the evaluation indices of point forecasting and interval forecasting.

Forecasting Examples
The training and test sample set used in the bootstrap and PSR-CRO-WLSSVM model were taken from the DGA data of a transformer with the voltage of 750 kV from a substation of the state grid of China, and the transformer model was ODFPS-700000/750GY. The time series of the data samples was from 23 May 2011 to 8 August 2012 and the monitoring period was once a day. Monitoring data from 23 May 2011 to 9 July 2012 were selected as training samples, and monitoring data from 10 July 2012 to Energies 2020, 13, 1687 17 of 28 8 August 2012 were selected as test samples. Considering the similarity of each gas, the forecasting results of H2 are shown as an example, the special treatment of each gas is also described below. The original data of H 2 is shown in Figure 3.
Energies 2020, 13, x FOR PEER REVIEW 3 of 28 Training sample Test sample A test sample of H2 with a length of 415 from 23 May 2011 to 9 July 2012 was selected as the original sample, and based on the MBB resampling method, a subsample set was set, the number of subsamples was set to 100, and the block length was 7, the number of resampling was 59, and 100 subsamples with a length of 413 were formed. The construction method of the sample set is in Section 3.4.
Before model training, the data were normalized and transformed by PSR. The C-C method was used to calculate the delay time and embedding dimension of the reconstructed sample phase space. The results are shown in Figure 4. ∆ S (τ) of the first minimum point corresponding τ is 11, so the delay time is 11. Figure 5 shows that when = 20, the (τ) is equal to the global minimum, because = ( − 1) , = 3 can be obtained.
The calculation results of reconstruction parameters of different gases are shown in Table 1, where S1 to S3 represent ∆ (τ), (τ), (τ), respectively. The results of optimal values of τω, τ, and m are summarized in Table 2. It is important to note that the phase space reconstruction of the optimal embedding dimension and delay time of dissolved gases are different, because although all gases are produced by the transformer oil, but the rate of change, tendency of change, influence factors and chaos characteristic are not the same, so the dissolved gases cannot use the same set of parameters for the phase space reconstruction, instead, they need to be calculated separately. A test sample of H 2 with a length of 415 from 23 May 2011 to 9 July 2012 was selected as the original sample, and based on the MBB resampling method, a subsample set was set, the number of subsamples M was set to 100, and the block length L was 7, the number of resampling R was 59, and 100 subsamples with a length of 413 were formed. The construction method of the sample set is in Section 3.4.
Before model training, the data were normalized and transformed by PSR. The C-C method was used to calculate the delay time and embedding dimension of the reconstructed sample phase space. The results are shown in Figure 4. ∆ S (τ) of the first minimum point corresponding τ is 11, so the delay time is 11. Figure 5 shows that when t w = 20, the S cor (τ) is equal to the global minimum, because t w = (m − 1) τ, m = 3 can be obtained.
The calculation results of reconstruction parameters of different gases are shown in Table 1, where S1 to S3 represent ∆S(τ), S(τ), S cor (τ), respectively. The results of optimal values of τ ω , τ, and m are summarized in Table 2. It is important to note that the phase space reconstruction of the optimal embedding dimension and delay time of dissolved gases are different, because although all gases are produced by the transformer oil, but the rate of change, tendency of change, influence factors and chaos characteristic are not the same, so the dissolved gases cannot use the same set of parameters for the phase space reconstruction, instead, they need to be calculated separately.   The preliminary analysis result of the case is as follows: (1) The forecasting model proposed in this paper can effectively predict the gas change process and development trend of dissolved gas.
The test samples selected in this paper were from 10 July 2012 to 8 August 2012, and the content of the H2 was in a state of great overall fluctuation, with the minimum and maximum values ranging from 36.32 to 48.41 μL/L. Figure 5 shows that the point forecasting results of the model for this time   The reconstructed data set was input into the WLSSVM model for prediction, and the CRO optimization algorithm was used for optimization to obtain the optimal hyperparameters of the WLSSVM. The key parameter settings of the prediction model are shown in Table 3. The optimal parameter arrangement of WLSSVM is shown in Table 4, where D 1 represents the original training Energies 2020, 13, 1687 20 of 28 data set of H 2 . The results of its optimal parameters were used to forecast the M subsample sets. D 2 represents the squared residual set, and its optimal parameters were used for the forecasting of data noise in M + 1th model.  For transformer condition monitoring, high reliability information is usually used for decision-making and analysis to ensure accurate judgment of the results. According to the principles of security and reliability, a 95% confidence level was selected for modeling. The results of point forecasting and interval forecasting of H 2 are shown in Figure 5.
The preliminary analysis result of the case is as follows: (1) The forecasting model proposed in this paper can effectively predict the gas change process and development trend of dissolved gas.
The test samples selected in this paper were from 10 July 2012 to 8 August 2012, and the content of the H 2 was in a state of great overall fluctuation, with the minimum and maximum values ranging from 36.32 to 48.41 µL/L. Figure 5 shows that the point forecasting results of the model for this time period are basically consistent with the actual gas change trend, which can effectively forecast the future change of dissolved gas in transformer oil.
(2) The forecasting interval constructed by the model proposed in this paper can effectively reflect the degree of uncertainty of the forecasting results.
The gas trend and change in the two periods from 10 July 2012 to 15 July 2012 and 26 July 2012 to 29 July 2012 were relatively stable and in a slow and stable change process. The uncertainty of gas change was small, and the range width of the forecasting interval was generally small, with the minimum value of 0.88 µL/L and the average value of 1.17 µL/L. Since 24 July 2012 and 2 August 2012, the gas has been in a state of great fluctuation, and the gas value continues to rise rapidly and fluctuates repeatedly, this phenomenon indicates that in this time period, the uncertainty of gas change is large, because there might be some external disturbance, such as temperature mutation or internal change in harsh environment, partial discharge, or local high temperature. Meanwhile, the average width of the forecasting interval for these two periods was 1.94 µL/L, exceeding the overall average width of 1.63 µL/L and reaching a maximum of 2.64 µL/L. Therefore, the decision maker should be prompted to pay more attention to the transformer status.
It can be seen that the change process of interval width is related to the uncertainty of gas prediction and the uncertainty of the result can be analyzed according to the interval. When the interval width increases, it means that the forecasting value of the dissolved gas of the transformer oil indicates the risk, and the reason for the change should be analyzed in time as external interference, latent faults of the transformer, problems with the model itself, or other conditions. The correlation does not directly diagnose and locate faults, but it provides a new way of thinking for the early warning and analysis of transformer fault risk.
(3) The forecasting results of this model can be used as input of fault diagnosis models to support transformer fault forecasting.
The 30 sets of forecasted results of dissolved gas in oil were input into the transformer fault diagnosis model in [88] for fault diagnosis. As a result, the diagnosis results of 24 sets of data were low-energy discharge fault, and the other six sets of data were normal state, in which five of these six sets of normal data appear in the first eight sets of data at an earlier time, and this shows that the fault state of the transformer is still in the incubation period and the fault characteristics are not obvious.
According to the analysis report after the transformer is disassembled, it is found that the cause of the fault was the intermittent discharge caused by the looseness between the fasteners of the grounding system caused by vibration, which is consistent with the phenomenon reflected in the prediction of this paper.

Comparison Results of Different Models
In order to further test the accuracy and generalization performance of the forecasting model in this paper, BPNN, LSSVM, WLSSVM, PSO-WLSSVM [12], CRO-WLSSVM were selected for testing and comparing.
Back propagation neural network (BPNN) and LSSVM are classical and widely used machine learning methods, which can be used as the basis and reference for prediction accuracy and generalization ability indexes. The WLSSVM model was selected to verify the effect of the model on the performance improvement of LSSVM. The CRO-WLSSM model was selected for comparison with PSR-CRO-WLSSVM to verify the performance of PSR, and the parameters of the model are shown in Table 3. The particle swarm optimization (PSO)-WLSSVM was selected to compared with CRO-WLSSVM to verify the performance of CRO, in which the population is 40, the maximum speed is 1.0, the learning factors C 1 and C 2 are 1.5, and the number of iterations is 5000.
The above six models were constructed by bootstrap method and trained by five-fold cross validation method. Data samples were selected from H 2 samples in Section 5.1 for forecasting analysis, and corresponding point forecasting and interval forecasting results were obtained as shown below.
The comparison results of the performance indicators predicted by the six models are shown in Figure 7. The MASE of the model in this paper is 1.861, which is lower than the other five models, indicating that the model in this paper has better prediction accuracy. For further analysis and comparison, Figure 6 shows the predicted results of six models. For example, at two time points of 24 July 2012 and 2 August 2012, when the dissolved gas content of H 2 showed a large fluctuation, compared with other models, the model in this paper could change the trend rapidly and maintain a small degree of deviation, which verified that the model in this paper could effectively predict the dynamic change process of dissolved gas. model in this paper reached more than 95%, which met with the performance requirements of the forecasting interval. This shows that the model in this paper is consistent with the five comparison models in the prediction of dissolved gas in oil. At the same time, in the aspect of point forecasting and interval forecasting, the model presented in this paper has the best performance in the five indexes such as MASE, CWC, etc., which indicates that the model presented in this paper has certain advantages for the forecasting of dissolved gas with small samples.  In terms of interval prediction, Figure 8 shows the statistical results of interval forecasting performance indexes of each model. From the perspective of the PICP index, BPNN and the model in this paper reached 96.67% coverage, which is the highest among the six models and meets the requirement of 95% confidence. However, from the PINAW index, the interval width of BPNN is obviously too large, which indicates that the uncertainty of BPNN model is high. The reason is that in the modeling process of BPNN, due to the large randomness of network initialization parameters, the results of multiple calculations under the same parameters and data are different, and the uncertainty degree of prediction results is large, which leads to the widening of the interval and the increase of coverage.
According to the indexes of CWC, the forecasting interval constructed by the method in this paper is optimal in terms of interval coverage and interval width, which not only meets the reliability requirements of confidence level, but also has the lowest uncertainty degree and the highest quality.
From the perspective of single factor comparison, the analysis is as follows: According to the comparison between WLSSVM and LSSVM in point forecasting and interval forecasting results, WLSSVM in this paper improved by 12.1% and 56.7% in MASE and CWC.
According to the comparison between PSO-WLSSVM and CRO-WLSSVM in point forecasting and interval forecasting results, CRO improved by 4.3% and 66.7% in MASE and CWC. At the same time, through the statistics of 5000 iterations of training, the training time of CRO is 4326 ms and the optimal number of iterations is 1125, training time of PSO is 4817 ms and the optimal number of iterations is 1407. The performance of CRO is better on the convergence speed and convergence accuracy.
According to the comparison results between the model in this paper and the CRO-WLSSVM model in point forecasting and interval forecasting results, the model in this paper improved by 18.2% and 39% on two indices of MASE and CWC, which indicates that for the time series of dissolved gas in oil with chaotic characteristics, the phase space reconstruction method can effectively extract features to improve the prediction performance of the model.
Comparing the six models as a whole, in terms of point forecasting, it can be seen from Figure 6 that all the six models can accurately forecast the variation trend of dissolved gas in transformer oil and according to Figure 7, most of the models have good performance on MASE. In terms of interval forecasting, Figure 8 shows that the interval coverage of the six models all reached more than 90%, which was close to the performance requirements of the forecasting interval and among them, the model in this paper reached more than 95%, which met with the performance requirements of the forecasting interval. This shows that the model in this paper is consistent with the five comparison models in the prediction of dissolved gas in oil. At the same time, in the aspect of point forecasting and interval forecasting, the model presented in this paper has the best performance in the five indexes such as MASE, CWC, etc., which indicates that the model presented in this paper has certain advantages for the forecasting of dissolved gas with small samples.

Conclusion
In order to overcome the problem that the data volume of transformer samples is small, and the uncertainty of prediction results cannot be represented by the traditional point forecasting models, A forecasting model of dissolved gas content in transformer oil based on bootstrap and PSR-CRO-WLSSVM was proposed and can be summarized as follows: (1) WLSSVM is used as a forecasting model to predict small samples, and PSR method is introduced into the forecasting of oil-dissolved gas of transformer. The PSR method based on chaos theory considers the autocorrelation of gas time, fully excavates the inherent laws and characteristics

Conclusion
In order to overcome the problem that the data volume of transformer samples is small, and the uncertainty of prediction results cannot be represented by the traditional point forecasting models, A forecasting model of dissolved gas content in transformer oil based on bootstrap and PSR-CRO-WLSSVM was proposed and can be summarized as follows: (1) WLSSVM is used as a forecasting model to predict small samples, and PSR method is introduced into the forecasting of oil-dissolved gas of transformer. The PSR method based on chaos theory considers the autocorrelation of gas time, fully excavates the inherent laws and characteristics contained in historical data, and realizes the preprocessing and feature extraction of gas data. Meanwhile, the global search advantage of CRO is used to optimize the forecasting model. The results

Conclusions
In order to overcome the problem that the data volume of transformer samples is small, and the uncertainty of prediction results cannot be represented by the traditional point forecasting models, A forecasting model of dissolved gas content in transformer oil based on bootstrap and PSR-CRO-WLSSVM was proposed and can be summarized as follows: (1) WLSSVM is used as a forecasting model to predict small samples, and PSR method is introduced into the forecasting of oil-dissolved gas of transformer. The PSR method based on chaos theory considers the autocorrelation of gas time, fully excavates the inherent laws and characteristics contained in historical data, and realizes the preprocessing and feature extraction of gas data. Meanwhile, the global search advantage of CRO is used to optimize the forecasting model. The results show that the above method can effectively help the WLSSVM model to improve the forecasting accuracy of dissolved gas in oil. (2) By combining the bootstrap method with the PSR-CRO-WLLSVM, a model for both point forecasting and interval forecasting was constructed. This method considers the data noise error and model error, which can describe the accuracy of the forecasting and the uncertainty of the forecasting. Compared with BPNN and LSSVM in the aspect of point forecasting and interval forecasting, the model presented in this paper has the best performance in five indexes such as MASE, CWC, etc. (3) The actual case analysis proves that by combining the results of point forecasting and interval forecasting, the model in this paper can closely follow the change trend of dissolved gas, and discover potential risks through the change of uncertainty of interval. At the same time, the output result of the model can be used as the input parameter of fault diagnosis method for real-time fault forecasting, which provides more comprehensive decision support for the development trend, hidden risks, and fault analysis of dissolved gas.
There are still shortcomings and new opportunities in the current research. Our future work will focus on further study on the application of the forecasting interval in transformer fault forecasting. Meanwhile, the influence of the selection of the oil chromatographic data acquisition interval and the selection of prediction data length on the forecasting accuracy will also be the next research focus.