In order to investigate the possibility and potential of wind power monitoring and control based on big data surrounding an algorithm for monitoring and incorporating synchrophasor measurement was developed. As described earlier, it has all the characteristics of adaptive and modular applications that can easily be installed and commissioned on the existing infrastructure. It also provides ability for later upgrades and integration into large scale applications.

#### 2.1. Big Data Surroundings

The power system infrastructure produces huge amounts of data. The nonlinear nature of this data makes the extraction of useful information complicated [

14]. Compared to standard mathematical models, data mining techniques are non-deterministic and provide a feasible and valid solution which is not exact but is simple to obtain, concise, practical and easy to understand. This characteristic is especially suitable when processing the big data streams which are inevitably involved. As mentioned earlier, large wind power capacities are being installed and connected to different voltage levels. Every wind turbine, wind measuring masts inside the wind park transformer substations, etc. represent the source of large quantities of data every second. All these data streams can be further expanded with the installation of new data sensors arrays. These large quantities of data can be deemed unnecessary, but with the usage of different big data algorithms a way to monetize this data can be found.

The most important data that can and should be used in power system data mining algorithms is the data for state estimation and future power system state predictions. These data streams can be classified into three main groups:

Phasor values measurements;

Loads and production measurements;

Other influential variables measurements.

Phasor values like voltages and currents together with belonging phasor angles, can be gathered through PMU measurements and can provide valuable insights into system operation. Also, load and generation data with exact time stamp can easily be measured and collected to afterwards be used for different analyses.

Other influential variables of additional data that are not directly connected to power system monitoring and control are also sometimes highly influential. These include meteorological data from various kinds of measurement systems of which most important are wind speeds and wind directions, air temperature, humidity and pressure, solar irradiance measurements. Together with meteorological data, other measurements such as conductor temperatures, overhead line sags, partial discharges, current transmission line capacity obtained by dynamic line rating (DLR) systems etc. can also be collected [

15]. All these data series can be used in wind and solar power system monitoring and control as well as for load forecasting applications and power evacuation possibilities. The prerequisite is to have an efficient solution for data transmission and processing.

#### 2.2. Data Mining Scope

As described earlier, the huge amounts of data inside power creates the big data surroundings. The non-linear nature of the system makes the definition of new models for extraction of useful information from heaps of gathered data even more demanding [

16].

Especially demanding is the usage of data from wind power plants since these stochastic sources produce even bigger amounts of data due to dependable variables which influence the output power.

Therefore, good data mining scope thus integrates wide area of variables. This paper defines simplified model which comprises of:

Wind power plant active and reactive power production (P_{Wind}, Q_{Wind}), at wind power plant point of common coupling (PCC);

Wind power plant active and reactive power settings (P_{Settings}, Q_{Settings}), which are operational decisions for the settings of wind power controller placed at wind power (PCC);

Total system load measurements (P_{L}), expressed in percentage, as a percentage of nominal load;

Voltage amplitudes and angles (phasors) measurements (V_{i}, δ_{i}) on selected nodes in the system;

Line, transformer and generator availability information.

Each operating condition (

**OC**) is defined as a mathematical set whose members are the following elements or variables:

- -
with i = 1, 2, 3, … n; where n is the number of nodes in power system with measurements of effective values and voltage angles in the system, and

- -
with k = 1, 2, 3, … m; where m—total number of input states over which data mining techniques are analyzed.

The abovementioned data can be expanded by defining the finely tuned fractal structures attached to it:

Wind power total can be divided into wind power of single wind turbine or a cluster of turbines;

Total system load can be divided into loads on busbar, consumer, or load area level;

Voltage amplitudes and angles can be enhanced with current amplitudes and angles for each branch as well as Thevenin impedance measurements;

Wind production is defined with wind speed and can further be detailed with wind direction, air temperature and pressure, solar irradiance and air humidity measurements;

Line and transformer availability can further be described through breaker status in line bays and transformer bays or through transformer and line monitoring systems.

All this data needs to form large and well-organized databases for further usage in control, planning, asset management and operation and maintenance (O&M) optimization process. Therefore, to take full advantage of the available data efficient algorithms for big data analysis are needed.

#### 2.3. Proposed Algorithm Design

The aim of the developed algorithm is to create a new kind of early warning signal (EWS) and recognize the structure of critical transitions for transmission system and wind power operators in the form of a situational awareness (SA) indicator [

17]. These signals should be structured to warn the operators that the alarming operating condition could be reached and that preventive or corrective actions should be done (e.g., wind power curtailment or reactive power support increase) and thus move the system to normal operating state, like described in figure below (

Figure 4). Created EWS signal as a situational awareness indicator serves as a main triggering signal for operating decisions in wind power settings in order to change operating condition back to EWS value NORMAL. Therefore, EWS could serve as a first line of defense to reduce the risks of total or partial system blackouts and thus reducing the opportunity costs associated with the costs of electric energy not being delivered.

Commonly used data mining algorithms identified by the IEEE International Conferences on Data Mining (ICDM) are C4.5, k-Means, Support Vector Machine (SVM), Apriori, PageRank, AdaBoost, Neural Networks, Naive Bayes and Classification and regression trees (C&RT). These 10 algorithms cover classification, clustering, statistical learning, association analysis, and link mining, which are all among the most important topics in data mining research and development. In [

18] a review on the applications of data mining in power systems is given.

The approach described here combines several segments of classification and clustering and statistical learning in one algorithm. Also, it brings combined solution for monitoring and preventive measures operating decisions.

A basic workflow diagram of the proposed algorithm is described on

Figure 5. The first step in the algorithm is data management and preparation which consists of time synchronization, format unifying, and ordering of historical raw data from actual power system measurements. Additionally, synthetic data which is produced and gathered from various kinds of simulations based on mathematical models is also included in this step. In this paper DigSilent Power Factory power system analysis software [

19] is used as a tool for production of simulation data.

The input data vector in the clustering process is equal to:

In this way, mathematically defined power system states are defined as input data in the algorithm. It is important to note that except for the variables defined herein, the input set of system states can be extended to a whole range of additional input signals such as data from various measuring devices for measuring electrical and nonelectric values, meteorological measuring devices, sensors and other devices. The model is therefore adaptive and modular. It is easy to upgrade by simply expanding the operating condition (**OC)** math data set.

The second step is data clustering, with the aim of defining system states on a given database or set of operating conditions. For the algorithm design described in this paper, the analytics software package Statistica [

20] was used. Standard variable definition from statistical theory was used where an independent variable (also called experimental or predictor variable), is being manipulated in an experiment to observe the effect on a dependent variable (also called an outcome variable). Total set of operating conditions in this example to be a representative sample needs to be large enough and cover all possible system states and. K-Means algorithm with Euclidian distances was used for clustering of the initial data set in following way:

Clustering was finally made into three clusters which describe normal (NORMAL), transition (WARNING) and problematic (ALARM) conditions. It is important to stress that all three system states should be present in input datasets in order to have a viable solution of this part of the algorithm.

After the clustering of the system states of a particular group or clusters for normal, warning and alarm operating conditions, the same definitions of the target groups serve as inputs for the classification part of the algorithm. With these clustered data, data classes are defined for later analysis of new metric input data:

C_{A}—a set of data classes in the algorithm

C_{NORMAL}—data class for normal operating condition

C_{WARNING}—the class of data for transition operating condition

C_{ALARM}—data class for normal critical condition

The third step consists of data classification of new measurement data and definition of a set of new system operating conditions (OC). Assigned system condition (NORMAL, WARNING and ALARM) were set as independent variables and previously defined variables in data mining scope (P_{Wind}, Q_{Wind}, P_{L}, V_{i}, δ_{i}) as dependent. New measurement data, according to its parameters, in the classification part of the algorithm are classified into predefined groups according to the values of the parameters that are taken as input data. Classification groups are defined as clusters created by earlier clustering of operating conditions.

Classification and regression trees (C&RT) method was used for this classification analysis. For that purpose, software Statistica [

20] was used. To assign weight factors to decision making process, misclassification costs [

21] were defined heuristically according to table below (

Table 1). In columns are predicted variables and in rows are measured variables.

To prevent overfitting of the data, a V-fold cross validation is used. 5% of the cases were used as “v-value” [

21]. V-fold cross validation where the data set is randomly divided into v equal parts and the learning phase of the algorithm is done on v − 1 parts and test on the remaining piece is especially suitable for such situations where a small number of cases is used for classification. Furthermore, pruning on variance that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances was used to get closer look at cost sequence for all calculated classification and regression trees. Cost sequence was calculated for re-substitution and cross-validation costs for all generated C&R (Classification and Regression) trees. In this way, a more simplified decision tree can be chosen according to law of parsimony, anticipating that things are usually connected or behave in the simplest or most economical way, especially with reference to alternative evolutionary paths [

22]. To reach a normal system state, as a final result there can be several operating conditions fulfilling the given conditions. This means the output from data classification process will be a set of possible operating conditions (

**OC**s). In the final step, final wind power plant operating decisions are made according to a simple procedure of selecting the best possible solution among the vector of possible operating states (

**OC**_{P}) whereby:

With the requirement that each element of vector **OC**_{P} is also an element of the class C_{NORMAL}.

A final operating decision still needs to be made, meaning settings of wind power plant controller (P

_{setting} and Q

_{setting}) at the point of common coupling need to be defined. Variable P

_{setting} is defined as setting of for output active power. If this setting is lower than available wind power, the result will be wind power curtailment. This variable is defined as a continuous variable. Variable Q

_{setting} is defined as setting of regime for reactive power regulation. This variable is defined as categorical variable (of total output Q or cos φ) meaning one setting represents one possible category (e.g., cos φ = 0.9 lagging or Q equal to 0.5 p.u.). This way reactive power control variable is discretized. Final operating decisions for wind power plants are made according to simple process of selecting the best possible solution among the set of possible operating conditions (OCs). Final operating condition is chosen to minimize the opportunity costs of wind energy export and thus maximizing the produced energy. Also, according to [

23], to prolong the lifetime of wind turbines it is necessary to lower reactive power production and its influence on power electronics in turbine converters. In harmony with the availability of wider range of PMU measurements the operation can be optimized with both available measurement and analysis results [

24]. Therefore, final decisions can be summarized as maximization of output active power and minimization of reactive power (Equations (5) and (6)).

Power transformers at point of common coupling (PCC) have limited capacity. Therefore, additional condition needs to be fulfilled in order not to endanger operational limits (Equation (7)) where S

_{TR} is the power transformer capacity (MVA).