Stream Data Cleaning for Dynamic Line Rating Application

The maximum current that an overhead transmission line can continuously carry depends on external weather conditions, most commonly obtained from real-time streaming weather sensors. The accuracy of the sensor data is very important in order to avoid problems such as overheating. Furthermore, faulty sensor readings may cause operators to limit or even stop the energy production from renewable sources in radial networks. This paper presents a method for detecting and replacing sequences of consecutive faulty data originating from streaming weather sensors. The method is based on a combination of (a) a set of constraints obtained from derivatives in consecutive data, and (b) association rules that are automatically generated from historical data. In smart grids, a large amount of historical data from different weather stations are available but rarely used. In this work, we show that mining and analyzing this historical data provides valuable information that can be used for detecting and replacing faulty sensor readings. We compare the result of the proposed method against the exponentially weighted moving average and vector autoregression models. Experiments on data sets with real and synthetic errors demonstrate the good performance of the proposed method for monitoring weather sensors.


Introduction
In smart grids, large renewable sources are integrated into the grid by using overhead transmission lines.As renewable sources have a variable production capacity, this often causes the transmission lines to operate close to their maximum current limit.Reaching the maximum limit increases the temperature in the conductor [1], and out of control high conductor temperatures can weaken the line and decrease the cable's elongation.
The thermal current limit (TCL) is defined as the maximum amount of electrical current that a cable's conductor can carry before deterioration [2].In calculation of the static rating (SR) or static thermal current limit, the conductor is considered to be operating under presumed atmospheric conditions; in dynamic line rating (DLR), the conductor is considered to be operating under real atmospheric conditions.
DLR is a technique that allows the increase of the TCL in power transmission lines without damaging their conductor [3].In DLR, the current transmission capability (also known as ampacity) of a line is calculated in real time using weather information.This dynamic management of ampacity considers the physical and electrical properties of power cables to estimate the maximum allowable conductor temperature for a particular set of weather parameters [4,5].
DLR is commonly used when renewable sources with variable production capacity (such as solar power stations, wind farms, tidal or wave power plants) have to be integrated into the grid [3,6,7].The production of these sources is not constant and depends on environmental conditions, such as wind speed or solar radiation.Employing dynamic ampacity management allows power companies to temporarily, as needed, increase the lines' transmission capability over the limits defined by static design in order to reduce the curtailment of electricity supplied from renewable energy sources.
One of the main problems in implementing such dynamic management is the presence of erroneous or missing weather sensor data.A lack of data forces the operator to return to static ampacity management, and it may result in reducing or even disconnecting renewable energy production.Faulty data may deceive the operator into allowing the current to increase over the maximum temperature that the cable was designed for.This could lead to serious damage to the conductor.Being able to detect and correct this faulty data in real time will allow power companies to use the resources efficiently, without putting the smart grids' components at risk.
Figure 1 shows examples of temperature sensor readings with two types of faulty observations: spike errors, and a sequence of consecutive errors.Spike errors are the faulty sensor readings that happen in a short time, e.g., a single time stamp.A sequence of consecutive errors refers to continuous faulty sensor readings, e.g., over several hours.Real-world data is usually incomplete, noisy, and inconsistent [8].Employing faulty data can lead to incorrect decisions and unreliable analysis.Therefore, data cleaning (data cleansing) has been a key area in data analysis and machine learning.
Most of the existing faulty data detection techniques are designed for offline applications where the entire data set is available.Moreover, the methods which are proposed for online data cleaning are mainly effective for repairing spike errors [9].
Smoothing techniques are widely used for online data cleaning to eliminate noisy data [9].Moving average (MA) [10], weighted moving average (WMA) [11], exponentially weighted moving average (EWMA) [11], and sliding window bottom-up (SWAB) [12] are examples of smoothing methods.MA smooths time series data by computing the unweighted mean of the last k points.In WMA, the data at different timestamps are given different weights.EWMA assigns exponentially decreasing weights over time.SWAB uses linear interpolation or linear regression to find the approximating model for the data.
The detection and replacement of faulty data can also be done by autoregressive methods, such as AR [13][14][15], ARX [13,16], or ARIMA [13,17].ARIMA consists of an autoregressive process and a moving average process.A data point is considered faulty if its prediction is significantly different from the observation.
Constraint-based replacement is also one of the techniques used for detecting and cleaning faulty data [18][19][20].These methods are based on defining some constraints that the data should satisfy.In [20], a method for stream data cleaning based on speed constraints was proposed.According to this method, the derivative of the signal (change in consecutive values over time difference) should be bounded.
The above-mentioned methods are commonly used for a univariate time series analysis, i.e., the models are based on considering only one variable at a time.In many applications, such as environmental monitoring, several variables are continuously collected.In these applications, the univariate methods are limited by their inability to capture and model important dynamic interrelationships between variables of interest.
Multidimensional models take into account the correlation and dependencies between different variables to improve data replacement.A multivariate EWMA control chart [21] is a technique commonly used to simultaneously monitor several correlated variables.However, this control chart is based on only the most recent observation.Another example of multidimensional models is the vector autoregression (VAR) model [22,23], which is a generalization of the univariate autoregressive model for forecasting a collection of variables, i.e., a vector of time series.VAR is used to capture only the linear interdependencies among multiple variables.
The detection and replacement of faulty data only based on the change in a recent sensor reading is a useful method when dealing with spike errors.However, in the case of sequences of faulty data (e.g., several hours), this method fails [9].In this case, the replaced values for the faulty data gradually increases/decreases and eventually causes a big offset error after several inputs.
In smart grids, a large amount of historical data related to measurement readings are available but rarely used.In this work, we propose a method for exploiting historical data to detect and replace sequences of consecutive faulty observations originating from streaming weather sensors.The proposed data-driven method does not require any assumption about the underlying population from which the data are obtained.It is a combination of: (a) a set of constraints in derivatives of sensor data (locally), and (b) a set of association rules automatically generated from historical data (globally).Therefore, it is not only based on the recent sensor readings.To generate association rules, 3 years of historical data from weather sensors of a power station in the north of Spain were used.
To evaluate the proposed method, experiments on data sets with real and synthetic errors were performed.From Figure 1b, we can observe that there are two types of consecutive errors: (1) sequence of erroneous samples with an offset increase, and (2) zero-value samples.In this work, we considered the faulty data of type 1 and, in generating synthetic faults, we created a sequence of samples with an offset error.
The rest of the paper is structured as follows: Section 2 describes the calculation of ampacity and the proposed method; Section 3 demonstrates the results; Section 4 is devoted to the discussion; and Section 5 concludes the paper.

Weather Information and Ampacity
The ampacity limit is computed using both the physical characteristics of the cables and the weather conditions surrounding the conductor.According to CIGRE TB601 [5], the maximum current that can be transmitted by a cable is computed considering the steady-state heat balance equation that is shown in Equation (1).CIGRE TB601 steady-state heat balance where P J is the Joule heating, P M is the magnetic heating, P S is the solar heating, P I is the corona heating, P C is the convective cooling, P R is the radiative cooling, and P W is the evaporative cooling.
The heat balance equation according to IEEE 738 [4] does not consider magnetic heating, corona heating, and evaporative cooling, because their impact is usually insignificant compared to the other terms.The simplified IEEE 738 equation for non-steady-state (transient) heat balance includes the total heat capacity of the conductor mC p .IEEE 738 transient heat balance In addition, in steady-state conditions, dT c dt = 0, and, therefore, the current rating I DR can be computed by the following.IEEE 738 steady-state heat balance The computation of both CIGRE TB601 and IEEE 738 is based on the weather conditions surrounding the overhead transmission lines.Figure 2 illustrates the basic architecture of a weather measurement system for ampacity computation.
The weather stations include temperature, humidity, wind, radiation, rain, and atmospheric pressure sensors.However, prototype versions of the stations usually lack rain and pressure sensors, as they are not used for the calculation of ampacity, and so it is not present in long-term data.It is interesting to point out the need to use ultrasonic wind sensors, as recommended by CIGRE TB299 [24].The reason is the high precision required at very low wind speeds, as the change from natural convection cooling and forced convection cooling happens when wind speed is lower than 1 m/s [4].Usually, the weather sensors are installed in substations to facilitate the access to the hardware if needed.However, sometimes it is required to install weather stations at midpoint locations along the power lines.Past experiences show that locating the sensors along the lines can cause problems, such as loss of signal, SIM card deterioration, and sensor power failure.An example of this type of installation can be seen in Figure 3.The accuracy of the ampacity calculation depends on the accuracy of the weather sensor readings.Figure 4 illustrates the relative connection between the accuracy of input data and ampacity calculation.In this figure, the uncertainty of the ampacity computation process is shown by the gray area, which depends on the weather parameter that is faulty or unavailable.In this work, we had access to the data of only one substation in the north of Spain.For this station, we assumed that sensors fail independently of each other.Furthermore, problems such as communication issues or power loss, which lead to many or all data corruption at the same time, were not considered.

Approach
Figure 5 illustrates the system architecture of our proposed method.The method requires historical environmental conditions from the location where the weather sensors are installed.The historical data are used for two purposes: (a) calculating the derivative in every two consecutive data inputs, and (b) generating a set of rules to estimate the correlation between the sensor readings under different weather conditions.Furthermore, to combine the results of (a) and (b), we accounted for the confidence of the rules.

Signal Derivative Constraints
To identify which data points are faulty, we noticed that the derivative is often bounded-so-called speed constraints [18,20].Given that weather variables are slow changing, the difference between two consecutive samples should not be very large.
Consider the historical data Z with columns j ∈ [1, ..., k] representing the numerical attributes of observations i ∈ [1, ..., n].Each z i,j has a timestamp t i .For any consecutive observations z i−1,j , z i,j , the derivative is defined as the absolute ratio of the change in the value z over time t as A derivative constraint for attribute k (δ k ) is defined as the maximal change in the derivative within all of its consecutive observations δ k = max S k 2 , S k 3 , ..., S k n .The derivative constraint was calculated separately for each attribute j ∈ [1, 2, ..., k], i.e., ∆ = {δ 1 , δ 2 , ..., δ k }.

Association Rule Mining
Usually, association rule mining is applied to discrete data [25].Therefore, the historical data need to be transformed into discrete values before generating the rules.In data discretization, the numerical attributes are replaced by interval labels (e.g., 0-3) or conceptual labels (e.g., Spring).There are several techniques for data discretization, such as using experts' knowledge, binning (equal-width or equal-frequency), histogram analysis, clustering, decision tree, and correlation analysis [8].
Figure 6 illustrates part of the data set before and after discretization.Here, we used the same notation as before for representing the historical data Z, which is a two-dimensional real-valued matrix with size n × k.After the data discretization, matrix Z was transformed into a two-dimensional Boolean matrix Z with size n × m, where m > k.A value of 1 in the Boolean matrix Z indicates the presence of a feature (item) in an observation, and a value of 0 indicates the absence of the feature.
We define the set of items of Z as I = {I 1 , I 2 , ..., I m }.Each observation zi in the data set may or may not contain a specific item, e.g., z1 = {I 1 , I 2 , I 5 } means that observation z1 only contains items I 1 , I 2 , and I 5 .
The objective of mining association rules is to find the most frequently occurring combinations of items in a data set [26].An association rule is an implication of the form A ⇒ B, where A ⊂ I, B ⊂ I, and A, B are disjoint itemsets, i.e., A ∩ B = ∅.In this case, the itemset A = {a 1 , a 2 , ...} is the prior and the itemset B = {b 1 , b 2 , ...} is the posterior of the rule.We define X A as X A = { zi ∈ Z that contains items A}.In this case, the support of itemset A, represented by S(A), is the ratio of the cardinality of X A over the cardinality of the data set Z (number of observations) [26]: The support of a rule, denoted as S(A ⇒ B), is the percentage of observations in the data set that contain both A and B: The confidence of an association rule is the percentage of examples containing A that also contain B; or, in other words, a fraction that shows how frequently B occurs among all the observations containing A. The confidence value indicates how reliable the rule is.
The lift of an association rule is the ratio of the confidence of the rule to the frequency of observations containing B. It is a value between 0 and infinity that measures the deviation of a rule from statistical independence: A lift value smaller than 1 indicates negative correlation, equal to 1 indicates no correlation, and greater than 1 indicates a positive correlation between features A and B among all the observations.

Detection of Faulty Data
We used both the derivative constraints and the association rules for detecting faulty data.For the derivative constraint, only the last two observations are considered, i.e., each attribute in the new observation O j t is evaluated "locally" against the value of the same attribute from the previous observation.If, for any feature (e.g., feature j where j ∈ [1, ..., k]), the difference between the two values exceeds the constraint (δ j ), the current observation will be labeled as faulty.Furthermore, each attribute in the new observation is also compared against the association rules generated from all the historical data (globally).If the current observation from streaming data stays out of the intervals which are identified by the rules, it will be labeled as a faulty observation.

Replacing of Faulty Data
When a new observation of one attribute is labeled as faulty, the "estimation" of the correct value needs to be performed.First, we define C = {c 1 , c 2 , ...} to be the combination of all items except the faulty attribute.In the list of association rules, we search for the rules which have C as prior, i.e., X C = { zi ∈ Z that contains items C}.The rule with the highest confidence specifies the intervals for the faulty observation.
Then, we search for the corresponding observations in the historical data Z that contain C in Z.Let us call this Y C = {z i ∈ Z that corresponds to C in Z}.The average of derivatives in Y C for the faulty attribute j specifies the changes between previous and current observations.For example, assume the attribute temperature is faulty, and the following is the generated rule with the highest confidence: According to this rule, the correct value for the temperature sensor would be within the interval [24.99, 36.30].In this example, the terms before the arrow sign "⇒" belong to the itemset C. In the historical data Z, all the observations that are within the intervals of C correspond to Y C .
The faulty observation O j t will be replaced by Ôj t using the following formula.
In power companies, if there is a missing value or faulty data input from weather sensors, the experts manually replace them by looking into previous observations (for example, the last 10 days).Formula (8) automatically applies the same concept while using the historical data for 3 years, searching for similar conditions, and considering all the attributes at the same time.
This estimated value of Ôj t should be within the intervals captured from the rules.However, in some cases, the estimated value is not within the limits.In these cases, the confidence of the rule is considered.The confidence value indicates how reliable a rule is and is used as a weight to modify the limits by using the following equation.
In this equation, I max is the maximum of the rule interval, I min is the minimum of the rule interval, C rule is the confidence of the rule, and N_ Ôj t is the new estimation of the faulty observation.In the next section, we demonstrate that, by using the confidence value as a weight to modify the limits, we can improve the accuracy of replacing faulty data.

Results
The available dataset contains 337, 035 weather sensor readings from 10 December 2014 until 16 August 2017 for one weather station.The data set does not contain any faulty data; however, there are some missing values in the data.From this data set, 310, 000 observations were used for generating association rules and calculating speed constraints.The last part of the data (test set) was used for evaluation.This test data set contains multivariate real-time measurements.
The methodology was developed using R programming language and RStudio as the graphical front-ends.The R code was run on a PC which was configured with a 2.50 GHz Intel (R) Core (TM) i5-4300U CPU and 8 GB memory.
Based on 310, 000 observations, the maximum change constraints ∆ (within 1 min) for the attributes solar radiation, ambient temperature, humidity, wind speed, and wind direction are δ S rad = 50, δ T amb = 0.25, δ H = 1.5, δ W mag = 0.75, and δ W dir = 45 (see Figure 7).In the calculation of the derivative constraints, only the absolute value of the changes were considered important.In order to generate association rules, first we discretized the values in the training data set.We used two methods for data discretization: (1) consulting with experts, and (2) using k-means clustering [8].Using k-means, the numerical attributes ambient temperature, solar radiation, humidity, and wind speed were clustered into 10, 7, 7, and 7 categories, respectively.The number of clusters were chosen based on the quality measures of the final estimations.
In addition to the available attributes, we added Hour and Month based on the timestamps in the data set.The Hour corresponds to the time of the day, which was also discretized into equal-width intervals.The Month corresponds to the month number.
We refer to our proposed association rule-based method as: • A − rule(EC), when using Experts' knowledge for data discretization and the Confidence is used as a weight to modify the limits based on Equation (9).• A − rule(E), when using Experts' knowledge for data discretization and confidence is not considered.• A − rule(K), when using K-means for data discretization and confidence is not considered.
After data discretization, we applied association rule mining using an "apriori" function implemented in the R arules library.The thresholds for confidence and lift were set to 60% and 1, respectively.For each attribute as posterior (e.g., ambient temperature), the rules were generated separately.Accordingly, 1235 rules were generated for predicting the ambient temperature, and 329 rules for predicting wind speed.Some of these rules are presented in Table 1.
Table 1.Example of the generated association rules with confidence greater than 60% and lift greater than 1 for ambient temperature and wind speed as posterior.The real faulty samples shown in Figure 1b, which were collected from another power station, were used to visually evaluate the performance of the A-rule(EC) method.These data contained several faulty sensor readings corresponding to several hours.In these data, only the temperature sensor was faulty, and the other sensors were correct.The results of applying our method are shown in Figure 8.Since we did not have access to the real temperature, we used historical data from a nearby weather API station as the ground truth.According to the figure, the estimation of faulty observations is very close to the ones we captured from the API.

Discussion
Unfortunately, we did not have access to a sufficient number of real faulty sensor readings to evaluate our method with the real data.Therefore, we generated synthetic errors and evaluated the performance of the proposed method using artificial faulty observations.To generate synthetic faults, a sequence of samples from the test set was selected and an offset error was added to them.
Figures 9 and 10 show four examples with synthetic errors.In these figures, in addition to repairing the faulty data using our proposed A-rule(EC) method, we also repaired the data by utilizing EWMA and VAR techniques.The parameters for a VAR model were calculated using the "VAR" function in the R vars library.
According to Figures 9 and 10, the EWMA technique could not correctly replace the faulty observations.After a few faulty samples, the estimation based on the EWMA became very close to the faulty observations.The VAR method was better than EWMA in the replacement of faulty data, but the estimation was still very far from the actual values.On the other hand, our proposed association rule method outperformed the other two methods.In addition to the examples presented in Figures 9 and 10, we picked several other samples.To evaluate our proposed method, we selected 100 series of 200 observations from the test set.The samples between 70 and 170 in the series were changed by adding an offset to create a sequence of 100 consecutive faulty observations.The offset error for the temperature sensor was set to 4 degrees Celsius.These 100 observations correspond to 400 minutes.Then, these data sets were examined with our association rule-based method and the VAR and EWMA methods.
For all 100 data sets, the average of the difference between the real observations and corrected values was calculated as the estimation error.Figure 11 shows the results of replacing temperature data in all 100 data sets.Table 2 presents the mean value with 95% confidence interval (CI) and minimum and maximum error when the number of faulty samples is 25, 50, 75, and 100.
According to the results presented in Figure 11 and Table 2, in all the methods, the estimation error was rising as the number of faulty samples increased.The EWMA could not correctly replace the faulty data and the error reached the maximum value 4 (corresponding to the 4-degree Celsius offset) after a few observations.The VAR method worked well if the number of faulty samples was small, e.g., less than 10 samples.However, for more than 25 faulty samples, the error was higher than the association rule-based methods.Within all three rule-based methods, the error in A-rule(EC) -when we were using the confidence of the rules and applying experts' knowledge for data discretization-was the lowest.Furthermore, the 95% CI in estimating the correct temperature was lower than for all other methods.This shows that the A-rule(EC) method is also more robust in detecting and replacing faulty data compared to the other methods.The gray colored rows correspond to the lowest mean in the estimation error.
In this work, we only present the results for cleaning two attributes, ambient temperature and wind speed, but the methodology can be applied to other attributes as well.
In general, for most of the examples, our method outperformed EWMA and VAR.There are some situations wherein our method failed in replacing faults.Figure 12 demonstrates an example where the A-rule(EC) failed to correctly replace the faulty samples, especially after several observations (after time 21:00 in the figure).Further analysis showed that these situations were happening because of the lack of the rules generated from historical data.Since we were considering the rules with confidence greater than 60%, not all the conditions were included.One way to improve this is to use a larger historical data set and generate the rules based on that.
Moreover, the proposed method performed better when the location of the collected historical data was the same place where the method was going to be used.We considered that the environmental changes are very relevant to the geographical position, and when the method was "trained" with a data set with completely different environmental conditions from the test data, it failed.However, for stations not very far from each other (such as the data for the station in Figure 8), the method showed a very good performance.The main drawback of this proposed method is that we are assuming only one sensor reading is faulty at a time and other sensors are correct.This might be problematic when there is a communication problem and all the data are missing.To continue this work, the authors are considering adopting the information from neighbor stations to detect and replace faulty observations.
In smart grids, intelligent sensors distributed throughout the grid allow for continuous collection of valuable data.In addition, large amounts of information related to historical faults, repairs, reported alarms, and so on, are recorded in different ways.This information can be used for many purposes, including data cleaning, fault detection, failure prediction, and load forecasting.However, most power electricity companies do not utilize this data fully and they often do not realize the full benefits of doing so.In this paper, we demonstrate that applying data-driven methods on the historical data provides valuable information that can be used for detecting and replacing faulty sensor readings.

Conclusions
In this paper, we propose a method for detecting and replacing sequences of consecutive faulty data originating from streaming weather sensors.To detect faulty data, a combination of both the derivative constraints and the association rules was used.Replacing the faulty data was done based on automatically generated association rules from the historical data.Furthermore, when replacing faulty data, we took into account the confidence of the rules.This means that instead of using the rules as crisp inferences, the rules are weighted by their confidence.
In order to evaluate the method, experiments on real data with real and synthetic errors were performed.The results show that the proposed A-rule-based method outperforms the commonly used methods, such as EWMA and VAR, especially when having a sequence of consecutive faulty weather sensor readings.Furthermore, among all three rule-based methods, the error in A-rule(EC)-when using the confidence of the rules and applying experts' knowledge for data discretization-is the lowest.

Figure 1 .
Figure 1.Examples of temperature sensor readings with faulty observations.

Figure 3 .
Figure 3.An example of an installation of a weather station on a transmission tower.

Figure 4 .
Figure 4.The general relation between the accuracy of input data and ampacity calculation.

Figure 5 .
Figure 5. System architecture of the proposed method.

Figure 6 .
Figure 6.Part of the data set before (Z) and after discretization ( Z).

Figure 7 .
Figure 7. Histogram plot of the change per minute for the attributes solar radiation, temperature, humidity, and wind speed for all the historical data.The absolute value of the changes were used for estimating the constraints.

Figure 8 .
Figure 8.Detection and correction of real faulty data from Figure 1b.

Figure 9 .
Figure 9. Detection and correction of synthetic faulty data from a temperature sensor.

Figure 10 .
Figure 10.Detection and correction of synthetic faulty data from a wind speed sensor.

Figure 11 .
Figure 11.The average of the difference between the real observations and corrected values for 100 data sets for cleaning temperature data.

Figure 12 .
Figure 12.Detection and correction of faulty data from temperature sensor, when A-rule(EC) does not perform well.

Table 2 .
Error in cleaning temperature data with a different number of faulty samples for 100 data sets.