Research on the Self-Repairing Model of Outliers in Energy Data Based on Regional Convergence

: The need for the statistical stability of data is increasing nowadays as the data resource has become a more and more important production factor. In this study, a set of general identiﬁcation and correction models are established for data outlier modiﬁcation. The research object we chose is the data of per capita energy consumption. Based on the joint diagnosis method of outliers and the regional convergence theory, the abrupt outliers are identiﬁed and corrected. The study ﬁnds that there is an outlier in the data of the Ningxia Hui Autonomous Region. According to the club grouping method, 30 provinces in China are divided into two clubs and the Ningxia Hui Autonomous Region is determined to be in the ﬁrst club. We calculate the convergence rate and obtain the correction results combining the half-life cycle model.


Introduction
With the rapid development of information technology, the data resource has become an important factor of production, with more attention being paid to its stability. The energy industry is the foundation of national economic and social development, which is also deeply influenced by big data. Energy statistics are the cornerstone of energy research and policymaking. People have been trying to make the energy data as complete as possible, and few attempts have been made to verify and correct abnormal data. In this paper, we utilize regional convergence theory and the half-life cycle model to correct outlier in energy statistic data. For the identification process of outliers, the joint diagnosis method based on the autoregressive moving average (ARMA) model of a time series is used.
The study of regional convergence theory began in the mid-20th century, developed from Solow's neoclassical growth model [1]. The main idea is that economies tend to develop into the same steady-state from time to time due to the nature of diminishing marginal returns. In the 1980s, the study on convergence began to develop and the types of convergence were divided into four main types: σ convergence, β convergence, γ convergence and club convergence. Baumol (1986) [2] conducted an empirical study on convergence first and showed that the growth rate of productivity in industrialized countries is negatively corrected with its productivity level. Barro and Sala-i-Martin (1991) [3] proposed the econometric definitions of sigma convergence and beta convergence. Sigma convergence refers to the narrowing of the average income gap over time. Beta convergence means the growth rate of per capita income in underdeveloped regions is higher than that in developed regions, which shows a identification method of structural changes in time series, namely taking the ARMA model as the basic framework. Chen and Liu (1993) [27] proposed a joint estimation method for the identification of outliers. Based on the Bayesian framework, Chen (2015) [28] proposed a method to comprehensively detect the location of outliers in the time-series model, and they demonstrate the effectiveness of the method through simulation research. Yan (2018) [29] proposed a method based on the Gaussian process prediction model to detect outliers in time-series data with time-varying disturbances. Sathe (2018) [30] proposed a simple subspace outlier detection algorithm; the results show that the method is much faster and the accuracy of detection is higher than traditional algorithms. This paper introduces the theory of regional convergence and the concept of big data into the study of the energy model, combing the time-series analysis method and the half-life cycle method to establish a methodology system of self-correction of energy mutation data. The paper is mainly composed of the following aspects: the second chapter introduces the main research methods of this paper, the third chapter shows the data we use in the study, the fourth chapter analyzes the research results, and the fifth chapter is the conclusion of this paper.

Methodology
This paper uses a joint estimation method to identify outliers and regional convergence combining the half-life cycle method to correct outliers.

Definition and Classification of Outliers
Time-series analysis refers to analyzing time series systematically and establishing reasonable models. Outliers often appear in time series, which have a serious impact on the application of data analysis. Outliers are usually the points that deviate from other observation data and therefore are obviously different from others. The first definition and classification of outliers are pointed out by Fox (1972), who divided data outliers into AO and IO. AO affects only one observation, which means that the time series will return to the normal path after passing through it, while IO affects several data points continuously.

ARMA Model
In order to avoid the biases in energy statistics, it is necessary to identify the outliers. The identification process is based on the ARMA model of time series. The ARMA model, known as the autoregressive moving average model, can be subdivided into three categories: autoregressive (AR) model, moving average (MA) model and ARMA model.
The structure of the ARMA (p, q) model is as follows: when ϕ 0 = 0, delay operator B is introduced: . − θ q B q is the polynomial of q-order moving average coefficient.

Outliers Joint Estimation Method
Assuming {Y t } is a stationary time series and follows the ARMA process, its definition is: where n is the number of observed values of the sequence.
Considering that time series are affected by non-repeatable events, the model becomes: where Y t is the same as the above equation, when t = t 1 , I t (t 1 ) = 1; in other cases, I t (t 1 ) = 0. IO : AO : Y t is supposed that there are m different types of outliers in time series, then the model containing all outliers is expressed as: which is called joint estimation diagnostic model. The residual sequence is:ê whereω AO (t 1 ) = ω AO (t 1 )/σ a n t=t 1

Outliers Identification Process
After the establishment of the time-series ARMA model, the identification process of outliers can be further carried out.
The first stage: preliminary parameter estimation and outlier test.
Obtain residuals based on original or adjusted sequences and calculate model parameters of time series through maximum likelihood estimation.
For t = 1, . . . , n, according to the residuals obtained by 1.1, calculateτ IO (t),τ AO (t), η t = max τ IO (t) , τ AO (t) respectively. C is the critical value, if max t η t = τ tp (t 1 ) > c, then there may be an outlier at time t 1 , whose type may be IO or AO.
If no outlier is found, go to step 1.4. Otherwise, remove the identified outliers from the sequence and return to 1.2 to check additional outliers.
If there is no outlier in the first iteration process, it shows the original sequence is not affected by outliers. The set of identified outliers go to the second stage.
The second stage: joint estimation of outlier effect and model parameters. Suppose there are m outliers of various types, which are located at t 1 , t 2 , . . . ,t m . ω j is estimated by multiple regression using Equation (8). Calculate theτ statistic byτ j =ω j /std ω j , j = 1, . . . , m. If min j τ j =τ v ≤ c (c is the critical value in 1.2), then delete the outlier at t ν in m outliers. The remaining m − 1 outliers are returned to 2.1 to continue the iteration. Otherwise, enter 2.3.
The set of outliers obtained through calculation is considered to be real outliers of time series. Remove them and return to 1.1 to estimate model parameters and calculate the residual sequence.

Regional Convergence Theory
Unconditional β Convergence Unconditional β convergence refers to the fact that the growth rate of per capita income of the underdeveloped region is higher than that of developed regions. Therefore, the lower the economic development level, the higher the growth rate of per capita income; and developing regions show a "catch-up effect" compared with developed regions. The unconditional β convergence model is expressed as follows: where i represents the region, t is the year, t 0 is the starting year. In this paper 1 means the average growth rate of energy consumption per capita; β is the rate of convergence. The sign of β is used to measure the existence of unconditional β convergence. When β > 0, it means the energy consumption per capita in China is converged, conversely when β < 0 means there is no convergence.

Conditional β Convergence
Conditional β convergence theory adds conditional quantity as an influencing factor under the model of unconditional convergence, the model is: where each indicator has the same meaning as the above equation. IF i is the influencing factor of convergence, c i is the coefficient of influencing factor.

Club Convergence
Club convergence means that with similar initial conditions and structural characteristics, the economic growth of a group of regions will converge to the same steady state. The regions in the same club have homogeneity, so it is of practical significance to examine the club's convergence. Grouping by club convergence theory can make a more reasonable analysis of the similarities among different provinces within the club. This paper utilizes a log-t method to study club convergence.

Nonlinear Time-Varying Factor Model
Phillips and Sul (2007) proposed a nonlinear single-factor model: where δ i is fixed, δ it is independent identically distributed in cross-section on time series. L(t) is a slowly varying function, as t→∞, L(t)→∞.

Log-T Regression
Phillips and Sul proposed a simple method for detecting convergence. The steps are as follows: (1) calculate the cross-sectional variance ratio H 1 /H t (2) regression: In the Equation, t = [rT], [rT] + 1, . . . , T, T is the time span, r = 0.3 according to recommendation. Slowly changing function L(t) =log(t+1). HACt statistic tb ofb is obtained by OLS regression.

Club Grouping
Based on the log-t method, Phillips and Sul (2007) proposed an endogenous identification algorithm for convergence club. The specific steps are as follows: Sort: the panel data are sorted from large to small according to the cross-section data of last year.
(1) Form core group: calculate from the section element with the first order, add one element at a time in log-t regression. The calculated value tb is compared with −1.65 until it is less than −1.65 for the first time. Assuming that k (2 ≤ k< N) cross-section elements fit the bill, the calculation criteria of the number of members k* (k* ≤ k)in the core group are as follows: If k* = N, the convergence club does not exist and the entire panel converges. When k = 2, the constraint condition min tb(k) > −1.65 is not valid, then remove the highest ordered unit and repeat the above steps for the remaining units. (2) Club members: the cross-section elements outside the core group are added into the core group for log-t regression successively, and the value tb is calculated. When it is greater than the critical value c (usually 0), the cross-section element is added into the convergence club. (3) Stop rule: after the first convergence club is formed, perform the log-t-test on the remaining units.
If null Hypothesis H 0 is not rejected, the remaining units will be another convergence club. When the null hypothesis is rejected, repeat steps (1)-(3) for the remaining units.

Half-Life Cycle
The half-life cycle model is put forward by Art Schneiderman, and the main content is that the failure rate level presents a negative linear correlation in a certain time range. If the initial level defect rate and the lowest level defect rate is given, the time required to reduce the defect rate by half is half of the life cycle. The half-life cycle can be calculated by the rate of convergence, which means the number of days that a country reduces its income gap by half.
Y t represents the defect action level at any time, γ is the steady value, Y t0 shows the initial defect action level, and t 0.5 is the half-life cycle. Then the half-life cycle model can be described by the mathematical model: After the first half-life cycle, the descending space is: After the number of half-life cycles i (i = t−t 0 t 0.5 , i ∈ R + 0 ), the remaining improvement space is: Steady-state calculation equation in the classic β convergence theory: By combining the half-life cycle and convergence rate, the equation can be obtained: Therefore, the time required to reduce the gap by half can be calculated based on the rate of convergence.

Data and Resources
To make the results more convincing, we need to use data from more provinces and more years.

Identification of Outliers
The data for the time-series analysis in this paper are the PEC of 30 provinces. After the preliminary judgment of all the sequences, the Ningxia Hui Autonomous Region is taken as an example for analysis. The per capita energy consumption data of the Ningxia Hui Autonomous Region from 1995 to 2016 are shown in Table 1.
First, a sequence diagram is made according to the data as in Figure 1. It can be seen that there is an obvious abrupt change in 2003, which can be used to judge if there is an outlier based on further analysis. The identification process of outliers mainly adopts the joint estimation diagnosis method proposed by Chen and Liu (1993). The good statistical and anti-interference properties of the model can make it identify outliers under normal circumstances correctly. The first step is to establish the ARMA model by preliminarily judging if the data fit the AR (1) or MA (1) model based on the autocorrelation function and partial autocorrelation function. The result of establishing the ARMA model is shown in Table 2.   The identification process of outliers mainly adopts the joint estimation diagnosis method proposed by Chen and Liu (1993). The good statistical and anti-interference properties of the model can make it identify outliers under normal circumstances correctly. The first step is to establish the ARMA model by preliminarily judging if the data fit the AR (1) or MA (1) model based on the autocorrelation function and partial autocorrelation function. The result of establishing the ARMA model is shown in Table 2. According to the Akaike Information Criterion (AIC), the AR (1) model is selected and it is a first-order autoregressive model. The parameters of the model are estimated as follows: According to the joint estimation method, the location and type of outlier are determined. The selection of critical value proposed by Chang (1988): c = 4 when the sensitivity is low, c = 3.5 when the sensitivity is medium and c = 3 when the sensitivity is high. In general, we name the critical value c as 3 when the sample size is less than 200.
Through the inspection, there is an outlier in Ningxia's per capita energy consumption in 2003, the Ningxia Hui Autonomous Region's τ value (3.9683) is greater than the critical value 3. Therefore, there is an outlier in Ningxia per capita energy consumption of 2003 and the type of it is an innovative outlier. After the identification of outliers is completed, the next section enters the data correction step.

Data Correction
In view of the data anomalies identified above, this paper adopts the regional convergence theory and then makes corrections according to the spatial correlation and geographical distribution characteristics between provinces and regions.

Club Grouping
The log-t method is used to divide the energy consumption per capita of 30 provinces into convergence clubs. Firstly, the per capita energy consumption of 30 provinces in 2016 is ranked from large to small. Ningxia ranked the highest. Taking Ningxia as the benchmark, we group Ningxia, Inner Mongolia, Qinghai, Xinjiang, Tianjin, Shanxi and Shanghai as the first core group through the core group test. Using the critical value c = 0, Liaoning, Hebei, Shandong, Jiangsu, Fujian, Shaanxi and Chongqing are added to the first club. The remaining provinces are gathered as the second group. The division and order of clubs in 30 provinces are shown in Table 3. And the grouping of clubs is shown in Table 4. Table 3. Division and order of clubs in 30 provinces.

Serial Number Province
First Club Second Club Step 1 Step 2 Step 1 Step 2 The club grouping is shown in Figure 2.    Based on the β convergence theory in 2.2.1, we conduct a regional convergence test within the divided clubs. The corresponding data are substituted into the equation for calculation, the results are shown in Table 5. As the above results show, when the whole country is taken as the research object and clubs are not divided, the convergence rate of 30 provinces is 3.05%. When the clubs are grouped, the β value of the first club is 0.0450, so the convergence rate is 4.50%. The β value of the second club is 0.0612, so the convergence rate is 6.12%. Compared with the whole country, the convergence rate of two clubs was improved and the goodness of fit R2 value was also improved.

Half-Life Cycle Correction
Based on the per capita energy consumption data of 30 provinces and cities in China from 1997 to 2016, the convergence rate of the whole and each club is calculated, and then the semi-life cycle is further calculated. The results are shown in Table 6.  15 11 The two clubs both have unconditional β convergence. The convergence rate of club 1 is 4.50% and the half-life cycle is 15 years, indicating that the gaps between different regions decrease at the convergence rate of 4.50% and the time required for the initial gap to be reduced to half size is 15 years. Similarly, for club 2, the convergence rate is 6.12%, and the half-life cycle is 11 years, indicating that regional differences decrease at the convergence rate of 6.12% and it takes 11 years to reduce the initial gap by half.
The outlier identified in 4.1 is the data point of the Ningxia Hui Autonomous Region in 2003, which is in the first club with a convergence rate of 4.50%. The steady-state value calculated is 9.3628.
For the Ningxia Hui Autonomous Region, the per capita energy consumption in 2003 is revised to 2490.53 kce/person according to the half-life cycle equation with data from 1997. The data change condition after the correction is shown in Figure 3.
The two clubs both have unconditional β convergence. The convergence rate of club 1 is 4.50% and the half-life cycle is 15 years, indicating that the gaps between different regions decrease at the convergence rate of 4.50% and the time required for the initial gap to be reduced to half size is 15 years. Similarly, for club 2, the convergence rate is 6.12%, and the half-life cycle is 11 years, indicating that regional differences decrease at the convergence rate of 6.12% and it takes 11 years to reduce the initial gap by half.
The outlier identified in 4.1 is the data point of the Ningxia Hui Autonomous Region in 2003, which is in the first club with a convergence rate of 4.50%. The steady-state value calculated is 9.3628.
For the Ningxia Hui Autonomous Region, the per capita energy consumption in 2003 is revised to 2490.53 kce/person according to the half-life cycle equation with data from 1997. The data change condition after the correction is shown in Figure 3. The data fluctuations become stable and no sharp changes occur. The identification and correction process of outliers used in this study is a general system. For the identification process of the ARMA model, the larger the data volume is, the stronger its applicability and the better the data fitting effect will be. The theory of regional convergence is to divide the regions with similar initial conditions and structural characteristics into a group through geographical distribution characteristics and analyze the development status within the group which has good applicability. The data fluctuations become stable and no sharp changes occur. The identification and correction process of outliers used in this study is a general system. For the identification process of the ARMA model, the larger the data volume is, the stronger its applicability and the better the data fitting effect will be. The theory of regional convergence is to divide the regions with similar initial conditions and structural characteristics into a group through geographical distribution characteristics and analyze the development status within the group which has good applicability.

Conclusions
Based on the theory of regional convergence in combination with a half-life cycle theory and time-series model of outlier's joint estimation method, this paper firstly identifies and corrects outliers of per capita energy consumption data of the Ningxia Hui Autonomous Region in 2003, then presents a reference energy outlier correction model according to regional convergence theory and characteristics of spatial distribution. The main conclusions of this paper are as follows: (1) For the time-series data fitting AR (1) model and through the outlier joint estimation diagnostic method, we calculate the τ value of Ningxia Hui Autonomous Region in 2003 and that is 3.97, which is greater than the critical value and identified as a mutational outlier. (2) β convergence exists nationwide with a convergence rate of 3.05%. According to the nonlinear time-varying factor model (log-t method), 30 provinces are divided into two convergence clubs. The convergence rate of the first club (high per capita energy consumption) is 4.5%, and that of the second club (low per capita energy consumption) is 6.12%. The convergence rate of the two clubs is higher than that of the whole country. (3) Based on the half-life cycle model and the convergence rate, the half-life cycle of the first club is 15 years and that of the second club is 11 years. By constructing the half-life cycle model of β convergence theory, the revised data of the Ningxia Hui Autonomous Region represent 2490.53 kce/person. (4) The outliers identified in the paper are likely to be caused by human error, instrument failure and other errors in the process of data collection. This reminds us to attach importance to data collection, strengthen the supervision of data collection, and take measures such as multiple calculations to make the data real and effective.
This study takes China's per capita energy consumption as the research object, then identifies the outlier and corrects it based on regional convergence theory. Considering the characteristics of geographical distribution in the theory of regional convergence, the similarity within the club is used to correct the process, so as to provide a reference method for improving the stability of data. At the same time, the identification and correction process established in this paper is a set of general theories and has good applicability to other data. However, the correction model proposed in this paper is just a reference method for outliers based on regional convergence theory, indicating that it is not an absolute process. Meanwhile, the stability of specific correction results needs to be further studied.