Wind Turbine Data Analysis and LSTM-Based Prediction in SCADA System

: The number of wind farms is increasing every year because many countries are turning their attention to renewable energy sources. Wind turbines are considered one of the best alternatives to produce clean energy. Most of the wind farms installed supervisory control and data acquisition (SCADA) system in their turbines to monitor wind turbines and logged the information as time-series data. It demands a powerful information extraction process for analysis and prediction. In this research, we present a data analysis framework to visualize the collected data from the SCADA system and recurrent neural network-based variant long short-term memory (LSTM) based prediction. The data analysis is presented in cartesian, polar, and cylindrical coordinates to understand the wind and energy generation relationship. The four features: wind speed, direction, generated active power, and theoretical power are predicted and compared with state-of-the-art methods. The obtained results conﬁrm the applicability of our model in real-life scenarios that can assist the management team to manage the generated energy of wind turbines.


Introduction
Renewable energy sources are playing an important role in economic growth and are considered an alternative source of energy for environmental reasons. It can provide a clean and sustainable solution as compared to nonrenewable energy which heavily relies on coal, oil, and other fossil fuels [1,2]. The nonrenewable energy sources are not suitable due to global warming, which causes extreme weather events such as more frequent wildfires, droughts, heatwaves, melting of glaciers, and floods threatening our planet ecosystem [3]. Wind turbines are considered one of the primary sources of renewable energy generation that provides clean and sustainable energy in modern power systems. Furthermore, they are environment-friendly and have near to zero CO 2 emissions. The wind turbine based energy generation completely relies on the wind to keep it operational. However, wind speed and direction fluctuates [4,5]; therefore, the amount of energy produced may vary from one moment to another. It could be a severe setback since it is expected to be delivered and consumed on a real-time basis.
To manage the wind turbine, most wind farms have installed supervisory control and data acquisition (SCADA) systems for monitoring different components and logging the operational data [6,7]. This logged data contains information about wind speed, its direction, the amount of generated power, etc. It creates an opportunity to process the gathered time-series data for diverse applications ranging from operational and maintenance purposes to predictive analysis of generated energy by the wind turbines [8]. The analysis and prediction of energy generation can be beneficial for the management team who is managing the wind forms to make correct decisions about the generated power, its consumption, and prepare the storage capacity in smart grids.
The research community already developed statistical [9], machine learning [4], and artificial neural networks [10] based approaches for wind turbine prediction task. In this research, our aim is to perform an exploratory data analysis for a better understanding of the wind and generated power. The exploratory data analysis provides a deep insight into the wind turbine data logs by exploring available features, namely, wind speed, wind direction, active power, and theoretical power. Consequently, it will reduce the complexity of overall data logs to understand it. We designed a recurrent neural network variant long short-term memory (LSTM) model for wind turbines prediction. LSTMs can capture long-range dependencies and nonlinear dynamics given its internal structure. In this research, our contribution is as follows: • To design and develop a unique visualization platform for the analysis of wind turbine data gathered by the SCADA system.

•
To design and develop an efficient deep learning model for short term time-series prediction (with a time frame of a month).

•
To perform a comparative analysis with existing statistical and machine learning approaches to measure the improvements.
The structure of the paper is as follows: Section 2 outlines the related work on SCADA systems with research studies to forecast the wind power. In Section 3.1, the exploratory data analysis is performed to visualize the collected SCADA datasets. The detail about recurrent neural network variant LSTM-based prediction is presented in Section 3.2, while Section 4 presents the implementation details, obtained results, and comparison followed by discussions. The paper is concluded in Section 5 with possible future directions.

Related Works
Wind farms are attached to the SCADA system to constantly collect data about wind turbines. This data can be utilized for different analysis including failure prediction [8], gearbox malfunction [7], assessment of the wind turbine performance [11], and windturbine wake effect [12][13][14]. There have been many research studies to forecast wind power according to the characteristics of wind farms. In the wind power industry, neural networks are intensively being used to generate accurate predictions or for comparison purposes with the classical forecasting methods. In the early days, Xiaodan et al. [9] developed a statistical method to forecast the short-term wind power in one wind farm in western China. Their method is based on Autoregressive and Moving Average (ARMA) and satisfactory results are reported in terms of absolute error average. In case of strong randomness of wind speed, their model is less accurate.
Sun et al. [15] considered the artificial neural network (ANN) is developed to model the power of wind turbines. In their network training, they also considered the wake effect based on geographic and wind-turbine information. Their study concludes that wind turbines in different positions should adopt different yaw angle control strategies. A recurrent neural network variant LSTM [16] can learn the time-series information more effectively. It has the power to utilize temporal information efficiently for forecasting the new data points. It is successfully used in stock market predictions [17], natural language processing [18], and also in medicine [19].
In wind power energy, Alencar et al. [10] develops ultra-short, short, medium, and long term prediction models of wind speed prediction. They utilized the Autoregressive Integrated Moving Average (ARIMA), artificial neural network, and hybrid models. The experiments were performed on the dataset obtain in Brazil by the national organization system of environmental data (SONDA). They achieved better results in the case of the hybrid model and reported neural network based models are better than statistical methods like ARMA and ARIMA. Similarly, Khosravi et al. [4] applied machine learning algorithms to predict the wind speed for Osorio wind farm in the south of Brazil. They applied neural networks, support vector regression, fuzzy inference system optimized with computational intelligence-based algorithms. They reported a neural network based model outperform as compared to the considered model in the study. Furthermore, they conclude wind speed has a direct influence on the generated power.
Liu et al. [20] developed a short-term prediction of wind power that is based on discrete wavelet transform and LSTM. Their study concludes that LSTM based model can effectively capture the dynamic behavior of wind energy. Furthermore, they also decomposed the nonstationary time series data using discrete wavelet transformation. After transformation, they consider each component as independent and temporal relations were learned by LSTM. The comparison results proved superior results as compared to the recurrent neural network, multilayered feed-forward neural networks with backpropagation, and combination with discrete wavelet transformation. Similarly, Han et al. [21] developed a short-term wind prediction model based on LSTM for Jeju island in South Korea. They have a vision for realizing carbon free Jeju by 2030.
In previously developed models, the visualization of time-series data logs are missing which can help the management team to understand and analyze the unseen problems. Our research focused on the visualization and the prediction aspect of wind energy generation. The proposed framework provides short-term data visualization (i.e., time-stamp of one month), and it has the ability to predict the wind and energy generation that may help to manage smart grids.

The Proposed Model
The proposed model is illustrated in Figure 1 and it consists of: (a) exploratory data analysis; and (b) the prediction. The details of the subcomponents are as follows.

Exploratory Data Analysis
The exploratory data analysis is performed using visualization techniques. It can provide a better understanding about the data logs of SCADA system of wind turbines. The details are provided in the following subsections.

Dataset
The data analysis and prediction is performed over the publicly available dataset that is collected in the northwestern region of Turkey [22]. The onshore wind farm is monitored by the SCADA system to capture the information about the wind and power generation properties of the Yalova region. The measurements were made as time-series data which logs the information at the interval of ten-minutes during the year 2018. It contains four measurements. (1) Active power: it provides data about the power generated by the wind turbine. (2) Wind speed: it measures the speed at the hub height of the turbine. (3) Wind direction: It logs the direction of the turbine that turns automatically to the direction where the wind blows (4) Theoretical power: it is the power value computed by the control system using the current wind speed. It is computed by using wind speed related to kinetic energy (i.e., Equation (3)). One week readings are presented in Figure 2. In Figure 2, theoretical and active power is measured in terms of kilo watt (kw), wind speed in meters per second (m/s), and its direction in degrees (•). The x-axis presents time over the interval of 10 min, while the y-axis presents the unit for each feature of the SCADA system.

Data Analysis
The time-series data is transformed into a visual representation, where all its characteristics and distinguishable elements are clearly noticeable. It can provide useful information and assists the management team to understand the possible issues. Figure 3 presents data visualization with possible analysis factors.
In Figure 3, it can be seen that regions where the wind blows the most, model corroboration in terms of actual energy generation with theoretical power curve, the anomalies, and wind behavior in terms of speed and direction. The following section provides the details about each visualization generation and explanation:

Cartesian Coordinates Analysis
The wind speed and direction can play an important role in analyzing the generated power. Although the wind direction determines the regions where the wind blows over a certain area and the wind speed determines the amount of power to be produced. Given the wind speed cubic relation (i.e., shown in Equation (3)) with the generated power, the higher the wind speed the more power is to be expected. In the Cartesian coordinates system, a three-dimensional space is considered that is based on three perpendicular axes (i.e., Wind speed, direction, and generated active power). In Figure 4, it presents a three dimensional data visualization for the whole year. In Figure 4, two regions are dense that contribute maximum power generation. First region can be observed between [50 • ∼ 100 • ], while second region is around [200 • ∼ 250 • ] along wind direction. In the case of wind speed, it starts generating active power when the wind is blowing more than 5 m/s. Furthermore, the whole year is nearly consistent with the same wind behavior. It can be observed by visualizing every month to understand the concentrated regions, as shown in Figure 5. In Figure 5, every month shows a nearly consistent pattern for wind direction as well as speed.

Polar Coordinates Analysis
In polar coordinates, the compass rose is used as a base model to identify the most active regions where the wind blows. The following Figure 6 presents the whole year's wind direction and speed. In Figure 6, the observation can be made that wind blows northeast and southwest along with the wind speed that is observed over the diameter of polar coordinates. The individual patterns for the month of April, May, and June are presented in Figure 7.

Cylindrical Coordinates Analysis
In cylindrical coordinates, visualization may assist in knowing the generated power for a given wind speed and direction. In Figure 8, the information is compact with compass rose model and each month is presented by the circle. The radius of the circle represents the maximum speed of the wind for that specific month. For instance, wind behavior can be tracked by visualizing its footprint by evolving it through time as a new dimension. It can report the major changes in the wind for the region as well as point out where the changes start and where they continue as shown in Figure 9.  Figure 9 provides the information about the wind behavior at different moments between the active regions where the wind is blowing, these regions are displayed as the wind footprints in green for every month. Similarly, in Figure 10, the wind speed and direction is evolved with time for the month of April, May, and June. Every month shows some irregularities at a certain level, for example, almost at the middle of the considered month a small change happens with the wind speed. This pattern in the month of April shows a low wind speed value, then it starts to smoothly increase nearly half of the month later. Similarly, in the month of May, the wind speed is greater in the second part of the month, and in the month of June, an unusual change can be seen almost at the middle of the month in the wind speed value.

Wind Energy Generated Patterns
The dynamical behavior of the wind is related to the basics of its kinetic energy. The drawback is an energy generation system is always less than 100%, which means that we cannot transform one form of energy into another without loss. The entire energy generated by the wind turns out that nearly 59% of it can be transformed into power. This fact is known as the Betz limit [23] and it is independent of the turbine model. The theoretical power can be calculated by wind speed and it is related to the basics of its kinetic energy. It can be defined as: The wind consists of a lot of tiny particles, each one having kinetic energy; however, it is difficult to count every single particle. In order to obtain the kinetic energy, instead the mass flow of all those particles going towards and through the wind turbine area is used. This particle mass flow is defined as the density of the air ρ times the swept area of the turbine A times the velocity of the air v, i.e., By substituting Equation (2) in kinetic energy, we get: The theoretical power is computed by Equation (3), and it is plotted into Figure 11 with active power generated by wind turbine for the month of April, May, and June. In Figure 11, the active power generated by the wind turbine follows nearly the same relation that is defined by the Betz limit. While in the case of April the generated power is even lower than 59%, which shows some external factors also contribute-this can be tuned to get the maximum output from the wind turbine.

The Prediction
Let us consider univariate time-series data and represent each feature as f . It can be defined as: where subindex s represents the number of samples of f . A user-defined time window t is considered to take into account the number of observations to make predictions (i.e., 10 min in case of our dataset). It means that the values starting from x 0 to x 9 are used as past evidence to predict the x 10 value, which is given by: (x 0 , x 1 , . . . , x 8 , x 9 ) → X 1 and x 10 → y 1 In the next step, time window is shifted one element enclosing now the values from x 1 to x 10 and predict the x 11 value.
(x 1 , x 2 . . . , x 9 , x 10 ) → X 2 and x 11 → y 2 (6) Furthermore, this element will be stacked and the process repeats itself until we reach the final element x s of f such as: . . .
The dimensions of above vectors X and Y are: where subindices s and t represent the number of samples and time steps for the time window respectively. These X with respect to Y as a pair become the input to recurrent neural network variant LSTM-based model [24]. It is sequence-based model which considers temporal correlations between the previous and current information [25]. It consists of a single LSTM layer followed by a dropout layer. The considered number of time steps for the regressors was seven. An internal LSTM cell representation is shown in Figure 12. At any time stamp t, LSTM cell is presented by neurons inside it (i.e., as shown in Figure 12). Internally, an LSTM unit consists two main parts. The first part is the combination of the past with the present and the second one is the interaction of that combination after being processed with the present as shown in Figure 12. As any other neural network, the operations in the background resemble linear algebra computations, where a weight W is multiplied by the current input and a bias neuron value is added to that computation, which is then passed to the corresponding activation function to be evaluated. The past-present combination can be represented as follows: In Equation (10), the forget switch f t and the input switch i t act as parameters, both regulate whether the previous context c t−1 should be forgotten or not, and whether the candidate contextc t should be input or not, these switches are computed as follows: In Equations (11) and (12), it can be seen that sigmoid σ activation function is used as an on/off-switch since the range of the sigmoid function goes from zero (i.e., equivalent to forgetting everything) to one (i.e., equivalent to remembering everything), leaving any number to be picked in-between. The terms these switches regulate are the previous context c t−1 and the candidate contextc t . The previous context is the term the previous LSTM unit outputs (for the very first term of the process there is no previous context). While the candidate context is the extraction of information, here we cannot use a sigmoid function σ for computation because it can just clip off half of the actual values below zero from the input [h t−1 , x t ]. A hyperbolic tangent tanh is used and candidate contextc t is defined as: The second part, which is the interaction between the above combination and (again) the present, it is the actual output h t and it can be represented similarly as: The information extraction of the past-present combination c t regulated by the output switch o t , which is computed in the following way: The output o t along with its c t and the next input x t+1 will be passed to the next LSTM cell in the next time step t + 1 to operate in the same way as their previous ones. Since LSTM is sensitive to data scale, so we normalized our data using min-max between [0 and 1]. The following Table 1 presents the optimal value for our prediction network.

Train and Test Split
The dataset is not grouped in any way rather than by its four features to present the measurements for entire year. To do the forecasting for short term (i.e., every month), a filtering process is defined to count the number of elements for each month. At the same time, every month data is split into train and test sets by using a ratio of 70% and 30% respectively. The following Figure 13 presents one month data with train and test split for all four features collected by SCADA system.

Performance Measures
The three standard metrics the mean absolute error (MAE), mean squared error (MSE), and coefficient of determination (R 2 ) are used to measure the model prediction performance [26]. These measures are defined as follows: In performance measures, y represents the observed value,ŷ is the predicted value, and n is total number of samples. In case of MAE and MSE the lower value of prediction correspond to high accuracy and R 2 value range between 0 and 1 and higher value corresponds to high performance.

Wind and Power Prediction
The obtained results are presented for active power, wind speed, its direction, and theoretical power for each month. Furthermore, the performance measures are presented for three months prediction. We also present the difference between actual and predicted value for understanding the actual error of the model.

Wind Speed Prediction
The performance measures for wind speed prediction is presented in Table 2 for the whole year. While, Figure 14 presents the predictions for the months of April, May, and June.  In Figure 14 the prediction seems that the predicted values are close to the actual values of wind speed. In order to know the subtle difference, the error graph is presented in Figure 15. Figure 15 shows the consistent results as compared to the actual value of wind speed and few abrupt changes.

Wind Direction Prediction
The whole year wind direction prediction results are shown in Table 3 and obtained results are accurate. Furthermore, Figures 16 and 17 presents the predictions and obtained error for the months of April, May, and June.

Active Power Prediction
The active power prediction results are shown in Table 4 and obtained results are accurate. Furthermore, Figure 18 present the predictions and Figure 19 presents the obtained error for the months of April, May, and June.

Theoretical Power Prediction
The theoretical power prediction results are shown in Table 5, Figure 20 present the predictions and Figure 21 presents the obtained error for the months of April, May, and June.

Comparative Analysis
A comparative analysis is performed to evaluate the performance between the proposed model and existing statistical and machine learning techniques. The comparison is made taking into account the statistical moving average technique (MA) and a multilayer perceptron (MLP) model. The metric used in this comparative analysis is mean absolute error (MAE). The lower value of MAE is better. The performance of each model is presented in the following subsections.

Active Power Prediction Comparison
The our model (i.e, LSTM) made least error scores for majority of the months between them all in nine out of twelve cases as it can be seen in Figure 22. The MLP got the least error score just in three cases and the MA in none of them. In Figure 22, it is obvious to see that the results obtained from our proposed technique are consistent in all months and excluding the marginal cases.

Wind Speed Prediction Comparison
When predicting the wind speed our model (i.e., LSTM) got the least error scores in eight out of twelve cases as it can be seen in Figure 23. The MLP got the least error scores in four out of twelve cases, and the MA in none of them.

Wind Direction Prediction Comparison
When predicting the wind direction the LSTM got the least error score in twelve out of twelve cases as it is shown in Figure 24 outmatching the other two methods completely.

Theoretical Power Prediction Comparison
When predicting the theoretical power, the LSTM got the least error score in ten out of twelve as it is shown in Figure 25 while the MLP got the least error score in two out of twelve cases.

Conclusions
Wind turbines are one of the primary sources of clean and renewable energy. It can contribute to reduce global warming and save our planet. In this research, wind turbine data analysis and prediction are presented, which are based on SCADA system. The SCADA system generates time-series data continuously, which poses a challenge to analyze it in a timely manner. Our model can predict the short-term analysis for the wind and active power generation. The visualization-based data analysis can reduce the complexity of overall data logs to understand it. From the presented analysis, it can be concluded that wind speed and direction get stable at the start of a few days of every month. Wind direction determines the regions where the wind blows and its' speed determines the amount of power to be produced. Furthermore, it may assist the wind farms management team to know the wind blowing patterns along with speed and active power generation. The developed LSTM-based prediction model provides short-term prediction about wind and power generation. It can effectively capture the dynamic behavior of wind energy. The comparison of the prediction results with the technique moving average (MA) and multilayer perceptron (MLP) technique shows that our model outperforms.
Our future work includes the multivariate time-series analysis by considering different factors for wind energy forecasting.