Application of Rough and Fuzzy Set Theory for Prediction of Stochastic Wind Speed Data Using Long Short-Term Memory

: Despite the great signiﬁcance of precisely forecasting the wind speed for development of the new and clean energy technology and stable grid operators, the stochasticity of wind speed makes the prediction a complex and challenging task. For improving the security and economic performance of power grids, accurate short-term wind power forecasting is crucial. In this paper, a deep learning model (Long Short-term Memory (LSTM)) has been proposed for wind speed prediction. Knowing that wind speed time series is nonlinear stochastic, the mutual information (MI) approach was used to ﬁnd the best subset from the data by maximizing the joint MI between subset and target output. To enhance the accuracy and reduce input characteristics and data uncertainties, rough set and interval type-2 fuzzy set theory are combined in the proposed deep learning model. Wind speed data from an international airport station in the southern coast of Iran Bandar-Abbas City was used as the original input dataset for the optimized deep learning model. Based on the statistical results, the rough set LSTM (RST-LSTM) model showed better prediction accuracy than fuzzy and original LSTM, as well as traditional neural networks, with the lowest error for training and testing datasets in different time horizons. The suggested model can support the optimization of the control approach and the smooth procedure of power system. The results conﬁrm the superior capabilities of deep learning techniques for wind speed forecasting, which could also inspire new applications in meteorology assessment.


Introduction
The depletion of fossil fuel resources, environmental pollution, and greenhouse effect have required the development of clean and safe energy sources for power generation [1,2]. Among different renewable energy alternatives, wind energy is considered as a promising and applied solution for making a renewable society and reducing emissions of greenhouse gas [3]. Wind is also abundant, inexhaustible, and affordable, which makes it a feasible, large-scale energy source alternative and one of the most sustainable ways of electricity generation. Despite being one of the fastest-developing energy sources in the world [1,4], due to the stochastic nature of wind speed, producing practical and sustainable energy becomes challenging for wind farms [3].
Since wind speed largely determines the amount of electricity generated by a turbine, accurate prediction can provide a reliable and secure source for the production of wind energy and also decrease the operating cost of the power system [3,5]. and can be implemented to meet energy needs [28]. The wind turbine technology is quickly developing in the region concerning the efficient location, efficiency, and management scenarios. Based on a preliminary studies, regions located in the southern coasts of Iran are suitable for generating up to about 6500 megawatts electricity from the wind energy. The wind characteristics and assigned forecasts must, therefore, be studied in detail in order to provide an efficient wind resource assessment of Iran [29].
Therefore, this paper presents short-term wind speed forecasting using LSTM network with Rough Set Theory and Interval Type-2 Fuzzy Sets, with wind speed data obtained from international airport station located in southern coast of Iran, Bandar-Abbas City, the capital of the Hormozgan province. It is Iran's largest port city and a vital economic and commercial hub. Bandar Abbas is also susceptible to tidal action, wave set-up, wind formation, and storm surges along the Persian Gulf and Oman Sea coasts [30]. Additionally, the Hormozgan Province is well-suited for wind turbine installation, as aerology data indicate that there are eight regions in this region with adequate wind energy capacity, one of which is Bandar Abbas City [31]. Wind forecasting data are, therefore, essential for a reliable and secure power generation system, especially at airports in developing countries.
The lack of such knowledge may be the primary cause of the airport's poor air traffic flow control and the delay of renewable energy projects. The Bandar Abbas International Airport is situated in a wide-open area with favorable wind patterns for air traffic. Wind activity is influenced by the surrounding topography, which blocks, intensifies, or changes the path of the wind as it passes, resulting in local wind effects that are not resolved by numerical weather prediction models. This airport's characteristics make it ideal for researching the issue of wind speed downscaling as wind is heavily influenced by local topography and the nearby sea at the study site. Importantly, the lack of knowledge and related studies about wind speed at international airports is a major impediment to the airport's sustainable development and accurate short-term wind speed forecasts will be thus crucial in ensuring the Air Traffic Flow Management's safety mandate at the airport. The rough and fuzzy sets are applied to address the uncertainties of the wind speed data and improve the performance of proposed deep learning model. In order to assess the efficacy of the approach suggested, forecasting performance of LSTM is compared with regular RNN and multi-layer neural network (MLNN) models using modeling performance criteria.

Dataset
A wind measurement station in Bandar-Abbas City, located at an international airport on Iran's southern coast, is selected and wind speed data are retrieved from the Iran Meteorological Organization. In this case study, wind dataset contains wind speed data in 60-min (1 h) intervals for 3 years from 2016 to 2019. Given the possible seasonal behavior of wind speed data, ideally, more than one year should be used in the training set and use the remaining data for testing. In order to assess the proposed models and prevent overfitting, the 2016/01-2017/09 data are considered as the training (60% of datasets) and validation set, and the 2017/09-2019/01 data are considered as a testing set (40% of datasets).

Wind Characteristics
The meso-scale structure of the near-surface atmospheric conditions over the Persian Gulf are characterized by low-level winds with a single, coherent, land-sea breeze [32]. Wind speed data for 10m height was obtained from airport station. While southern winds are the predominant direction in the area, the average hourly wind speed in Bandar Abbas does not vary significantly throughout the year, with the maximum speed reaching 18-21 knots.

Rough Set Theory (RST)
The rough set was proposed in 1982 by Pawlak [33]. With the key principle of acquiring the characteristics of decision making or classification rules for problems by unchanging the ability to identify and using information reduction. The rough set theory has since been successfully applied in various disciplines [34,35] as one of the most powerful mathematical methods for dealing with ambiguity and vagueness and increases the performance of the predictor itself [36].
In this method, the key objectives are attribute reduction, correlation analysis, and significance assessment for ambiguous information systems using indiscernibility relationships to approximately approach the set of object through upper and lower approximations [37].
Suppose U is the nonempty universe with finite members; R is the equivalence relation in U; therefore, the knowledge base can be presented as a relation system K = (U, R). For the subset P ⊆ R and P = φ, the intersection of all the equivalence relation between P can be called P-indiscernibility relation, defined by IND(P): where [X] R represents the equivalence class containing x∈U in relation R. Basic concepts of rough set theory are lower and upper approximations, which help to quantify the description of uncertain information. Suppose X is the subset of U; then, the lower approximation RX and upper approximation RX are The boundary region BN R (X) = RX − RX, meanwhile, contains of those objects that cannot be classified with certainty as members of X with the knowledge in R. If BN R (X) = o, it indicates that RX = RX, and X cannot be represented by the equivalence class of R precisely. The set X is then called "rough" (or "roughly definable"); otherwise, X is crisp [38]. Figure 1 shows a schematic representation of RST.
predominant direction in the area, the average hourly wind speed in Bandar Abbas does not vary significantly throughout the year, with the maximum speed reaching 18-21 knots.

Rough Set Theory (RST)
The rough set was proposed in 1982 by Pawlak [33]. With the key principle of acquiring the characteristics of decision making or classification rules for problems by unchanging the ability to identify and using information reduction. The rough set theory has since been successfully applied in various disciplines [34,35] as one of the most powerful mathematical methods for dealing with ambiguity and vagueness and increases the performance of the predictor itself [36].
In this method, the key objectives are attribute reduction, correlation analysis, and significance assessment for ambiguous information systems using indiscernibility relationships to approximately approach the set of object through upper and lower approximations [37]. Suppose is the nonempty universe with finite members; is the equivalence relation in ; therefore, the knowledge base can be presented as a relation system = ( , ). For the subset ⊆ and ≠ , the intersection of all the equivalence relation between can be called -indiscernibility relation, defined by IND( ): where [ ] represents the equivalence class containing ∈ in relation . Basic concepts of rough set theory are lower and upper approximations, which help to quantify the description of uncertain information. Suppose is the subset of ; then, the lower approximation and upper approximation are The boundary region BN (X) = X − , meanwhile, contains of those objects that cannot be classified with certainty as members of with the knowledge in . If BN (X) ≠ ø, it indicates that X ≠ , and cannot be represented by the equivalence class of precisely. The set is then called "rough" (or "roughly definable"); otherwise, is crisp [38]. Figure 1 shows a schematic representation of RST.

Interval Type-2 Fuzzy Sets
The definition of type-n fuzzy sets was proposed by Zadeh in 1975 [39], followed by the definition of type-2 fuzzy sets in 1976 [40]. Unlike classical set theory, where an element must necessarily belong to a set or not, an element can belong to a set to a certain degree (0 ≤ k ≤ 1) in the fuzzy set approach. Fuzzy set theory and its applications extensively developed over last years and attracted attention of practitioners, researchers and decision makers [41].
Li and Huang have presented relatively new definitions for type 2 fuzzy sets [42] and since then, fuzzy sets capability to capture data uncertainty, especially type 2 fuzzy sets, has been one of the most common directions in artificial and computational intelligence [43][44][45]. The type-2 fuzzy set is capable of dealing with uncertainties due to its ability to model them and reduce their impacts. This type of uncertainty is called 'fuzzy uncertainty'. In contrast to 'probabilistic' uncertainty, which 'relates to events with well-defined, unambiguous data,' non-probabilistic or fuzzy uncertainty deals with ambiguities that often rely on qualitative details. Due to the lack of common numerical criteria for determining susceptibility, as well as the absence of "sharp" boundaries between susceptibility and non-susceptibility, mathematical models of susceptibility are difficult to derive. A benefit of using fuzzy set theory is that it enables the creation of inference models that express uncertainty using "natural language" [46].
Centroid, cardinality, fuzziness, variance and skewness are known as measurements of uncertainty for interval type-2 fuzzy sets (IT2FSs) [47]. The distinction between type-1 and type-2 fuzzy systems, having the same basic concept, is related to the type of fuzzy set applied. The IT2 fuzzy membership function is defined as follows for the discrete universe of discourse x and u [48]: is an interval type membership function. Consequently, an interval is the membership grade of each element of an IT2 fuzzy set. Uncertainty is characterized by the interval (commonly known as the footprint of uncertainty (FOU)) bounded by the upper (UMF) and lower (LMF) (Type-1) membership functions (MF). This secondary membership function determines the possibilities of the first MF. As can be seen in Figure 2, the IT2FS scheme is similar to the normal fuzzy scheme. Only the defuzzification mechanism is different, as it includes a type reducer block that converts the IT2 output set to a Type-1 set prior to performing a defuzzification step. In fact, type reduction is a phase in the process of defuzzifying type-2 fuzzy sets. Defuzzification of a type-2 fuzzy set consists of two steps: (a) type-reduction of the type-2 fuzzy set, which is the process of converting the type-2 fuzzy set to a type-1 fuzzy set, referred to as the type-reduced set (TRS); and (b) defuzzification of the type-1 fuzzy set, which is the process of defuzzing the TRS to obtain a crisp number, referred to as the type-2 fuzzy set's centroid [49]. A more detailed summary is available elsewhere [50,51].
The definition of type-n fuzzy sets was proposed by Zadeh in 1975 [39], followed by the definition of type-2 fuzzy sets in 1976 [40]. Unlike classical set theory, where an element must necessarily belong to a set or not, an element can belong to a set to a certain degree (0 ≤ k ≤ 1) in the fuzzy set approach. Fuzzy set theory and its applications extensively developed over last years and attracted attention of practitioners, researchers and decision makers [41].
Li and Huang have presented relatively new definitions for type 2 fuzzy sets [42] and since then, fuzzy sets capability to capture data uncertainty, especially type 2 fuzzy sets, has been one of the most common directions in artificial and computational intelligence [43][44][45]. The type-2 fuzzy set is capable of dealing with uncertainties due to its ability to model them and reduce their impacts. This type of uncertainty is called 'fuzzy uncertainty'. In contrast to 'probabilistic' uncertainty, which 'relates to events with well-defined, unambiguous data,' non-probabilistic or fuzzy uncertainty deals with ambiguities that often rely on qualitative details. Due to the lack of common numerical criteria for determining susceptibility, as well as the absence of "sharp" boundaries between susceptibility and non-susceptibility, mathematical models of susceptibility are difficult to derive. A benefit of using fuzzy set theory is that it enables the creation of inference models that express uncertainty using "natural language" [46].
Centroid, cardinality, fuzziness, variance and skewness are known as measurements of uncertainty for interval type-2 fuzzy sets (IT2FSs) [47]. The distinction between type-1 and type-2 fuzzy systems, having the same basic concept, is related to the type of fuzzy set applied. The IT2 fuzzy membership function is defined as follows for the discrete universe of discourse x and u [48]: The case of ( , ) = 1, ∀ ∈ ⊆ [0,1] is an interval type membership function. Consequently, an interval is the membership grade of each element of an IT2 fuzzy set. Uncertainty is characterized by the interval (commonly known as the footprint of uncertainty (FOU)) bounded by the upper (UMF) and lower (LMF) (Type-1) membership functions (MF). This secondary membership function determines the possibilities of the first MF. As can be seen in Figure 2, the IT2FS scheme is similar to the normal fuzzy scheme. Only the defuzzification mechanism is different, as it includes a type reducer block that converts the IT2 output set to a Type-1 set prior to performing a defuzzification step. In fact, type reduction is a phase in the process of defuzzifying type-2 fuzzy sets. Defuzzification of a type-2 fuzzy set consists of two steps: (a) type-reduction of the type-2 fuzzy set, which is the process of converting the type-2 fuzzy set to a type-1 fuzzy set, referred to as the type-reduced set (TRS); and (b) defuzzification of the type-1 fuzzy set, which is the process of defuzzing the TRS to obtain a crisp number, referred to as the type-2 fuzzy set's centroid [49]. A more detailed summary is available elsewhere [50,51].

Long Short-Term Memory (LSTM) Network
In 1997, the Long-Short-Term Memory (LSTM) Network, a version of the RNN, was proposed [52] to build large recurring networks that can, in turn, be used to tackle difficult sequence problems in machine learning and achieve state-of-the-art outcomes. The basic unit of the hidden layer of LSTM, unlike traditional neural networks, is the memory block [53], which includes memory cells with self-connections that memorize the temporal state, and a pair of adaptive, multiplicative gating units that control information flowing into the block ( Figure 3). As an input and output gate, two additional gates control the input and output activations in the block [21]. Due to the LSTM's specific structure, it is capable of effectively resolving the gradient disappearance and explosion problems that occur during the RNN training procedure [54]. In Figure 3, the plus sign indicates that element levels are being added, the multiplication sign indicates that element levels are being multiplied, and the con indicates that vectors are being merged. The forgotten gate, input gate, input node, and output gate, respectively, are presented by f t , i t , g t , and o t . The relationship of the dependencies between the data in the input sequence is captured by the cell.
proposed [52] to build large recurring networks that can, in turn, be used to tackle difficult sequence problems in machine learning and achieve state-of-the-art outcomes. The basic unit of the hidden layer of LSTM, unlike traditional neural networks, is the memory block [53], which includes memory cells with self-connections that memorize the temporal state, and a pair of adaptive, multiplicative gating units that control information flowing into the block ( Figure 3). As an input and output gate, two additional gates control the input and output activations in the block [21]. Due to the LSTM's specific structure, it is capable of effectively resolving the gradient disappearance and explosion problems that occur during the RNN training procedure [54]. In Figure 3, the plus sign indicates that element levels are being added, the multiplication sign indicates that element levels are being multiplied, and the con indicates that vectors are being merged. The forgotten gate, input gate, input node, and output gate, respectively, are presented by , , , and . The relationship of the dependencies between the data in the input sequence is captured by the cell.
is the function of sigmoid activation, and tanh describes the hyperbolic tangent function. The state ( ) of cell remembers previous values over arbitrary time intervals and the three gates control the flow of information into and out of the cell. Therefore, the LSTM network is very appropriate for prediction problems based on a time sequence [54]. The overall flowchart of the methodology has been depicted in Figure 4. While the input gate regulates the volume of data that enters the cell, the forget gate regulates the amount of data that remains in the cell. The values in the cell are used to calculate the LSTM's output activation; their extent is determined by the output gate. The LSTM structure's calculation formulas are shown in Figure 3: In Equations (6)-(11), W f , W i , W g , W o are the corresponding weight matrix connecting the input signal [h t−1 , x t ], and represents the element level multiplication. σ is the function of sigmoid activation, and tan h describes the hyperbolic tangent function. The state (s t ) of cell remembers previous values over arbitrary time intervals and the three gates control the flow of information into and out of the cell. Therefore, the LSTM network is very appropriate for prediction problems based on a time sequence [54]. The overall flowchart of the methodology has been depicted in Figure 4.

Evaluation Criteria
Performance of the proposed models were evaluated considering the uncertainties in model outputs using model evaluation criterial. The sum of squares error (SSE) and relative error (RE) are used as two evaluation metrics where,ŷ(t) and y(t) are the predicted and measured output, respectively. The optimal model is selected based on the minimum statistical error.

Evaluation Criteria
Performance of the proposed models were evaluated considering the uncertainties in model outputs using model evaluation criterial. The sum of squares error (SSE) and relative error (RE) are used as two evaluation metrics where, ( ) and ( ) are the predicted and measured output, respectively. The optimal model is selected based on the minimum statistical error.

Selection of Input Variable
The selection of input variables is a critical step in deciding the optimal structure of data-driven models. The wind speed time series in this case exhibits excessive variability ( Figure 5), and linear regression does not function well for forecasting because the time series is nonlinear [55].

Selection of Input Variable
The selection of input variables is a critical step in deciding the optimal structure of data-driven models. The wind speed time series in this case exhibits excessive variability ( Figure 5), and linear regression does not function well for forecasting because the time series is nonlinear [55]. Several papers [56] have used the autocorrelation function (ACF) to determine the crosscorrelation of a time series with itself at different points in time. Due to the fact that ACF quantifies a variable's linear dependence on itself and wind speed is a nonlinear time series, mutual knowledge (MI) is used to efficiently test both linear and nonlinear correlations.
MI's objective is to determine the stochastic dependency between two random variables without making any assumptions about their relationship's nature (e.g., linearity) [57]. In other words, MI evaluates the dependencies between random variables, where each variable contains information about the other [58].
The basic idea behind a variable selection algorithm based on MI is to maximize the joint MI between the subset and the target output [59]. Battiti [60] used MI in data analysis and presented feature selection algorithm based on this measure. An MI-based input selection algorithm was used by Rashidi Khazaee et al. [61] to select the proper inputs for the prediction model, demonstrated that the proposed input selection model, can efficiently select relevant input variables.
Although detailed description of the MI technique can be found elsewhere [62], por- Several papers [56] have used the autocorrelation function (ACF) to determine the cross-correlation of a time series with itself at different points in time. Due to the fact that ACF quantifies a variable's linear dependence on itself and wind speed is a nonlinear time series, mutual knowledge (MI) is used to efficiently test both linear and nonlinear correlations.
MI's objective is to determine the stochastic dependency between two random variables without making any assumptions about their relationship's nature (e.g., linearity) [57]. In other words, MI evaluates the dependencies between random variables, where each variable contains information about the other [58].
The basic idea behind a variable selection algorithm based on MI is to maximize the joint MI between the subset and the target output [59]. Battiti [60] used MI in data analysis and presented feature selection algorithm based on this measure. An MI-based input Although detailed description of the MI technique can be found elsewhere [62], portions of which are provided here to help elaborate the technique. MI between two random variables x and y is described as [63] I(x; y) = p(x, y)log p(x, y) p(x)p(y) dx dy (13) where the probability density functions of X and Y are p(x) and p(y), the joint probability density function of X and Y is p(x,y). The joint probability density p(x,y) will be equal to the product of probability densities in the case of no dependence between two variables, so the MI is equal to zero. The MI of v(t − l + 1) and v(t + 1) is calculated using l as the time lag, and v(t) as the wind speed time series value at time t. Figure 6 illustrates the MI for lags ranging from 1 to 100. The correlation between wind speed measurements decreases as the time lag increases. The product of input variable selection using MI for the dataset results in an 81-dimensional input vector that is fed to the suggested deep learning models. After each training epoch, the trained model's output is evaluated by calculating the model's performance on an unseen data set using the validation set data. Although there are no criteria for the percentage of data used in the validation and training phases, the decrease in validation errors during the training process shows that overfitting is avoided [64]. To address the correlation in the wind speed data, the input set includes wind speed values corresponding to time lags with MI greater than = 0.35. The product of input variable selection using MI for the dataset results in an 81dimensional input vector that is fed to the suggested deep learning models. After each training epoch, the trained model's output is evaluated by calculating the model's performance on an unseen data set using the validation set data. Although there are no criteria for the percentage of data used in the validation and training phases, the decrease in validation errors during the training process shows that overfitting is avoided [64]. To address the correlation in the wind speed data, the input set includes wind speed values corresponding to time lags with MI greater than = 0.35.

Wind Speed Forecasting Models
The RST/FST generated decision rule was provided as input to the LSTM. The LSTM analyzes the attributes of the data with decision rule for wind speed prediction. It means that the features selected by the RST/FST are provided as input to the LSTM for the analysis.
Through the simulation experiment, the predicted wind speed data using RST-LSTM is shown in Figure 7. The black line is the observed data of wind speed, and the dashed red is the predicted data. The optimal structure is calculated by test simulation calculations amongst the observed and predicted datasets, where the LSTM prediction model comprises of an 81-node input layer, a 25-node hidden layer, and a 1-node output layer, RST-LSTM (81-25-1), with a minimum RE of 0.065 and 0.085 during the training and testing phases, respectively.

Wind Speed Forecasting Models
The RST/FST generated decision rule was provided as input to the LSTM. The LSTM analyzes the attributes of the data with decision rule for wind speed prediction. It means that the features selected by the RST/FST are provided as input to the LSTM for the analysis.
Through the simulation experiment, the predicted wind speed data using RST-LSTM is shown in Figure 7. The black line is the observed data of wind speed, and the dashed red is the predicted data. The optimal structure is calculated by test simulation calculations amongst the observed and predicted datasets, where the LSTM prediction model comprises of an 81-node input layer, a 25-node hidden layer, and a 1-node output layer, RST-LSTM (81-25-1), with a minimum RE of 0.065 and 0.085 during the training and testing phases, respectively.  With RST-LSTM being slightly superior, estimates from both RST-LSTM and IT2F-LSTM models are close to the attributed observed wind speed variations possibly due to the application of highly correlated input data with less uncertainty for these models. In terms of curve fitting, the curve of the predicted data in both models is very similar to the curve of the actual data. In spite of the certain deviation of the crests and troughs in training and testing, due to the uneven occurrence of data, there is a perfect predictability in other locations.
Throughout the training and testing phases of the 1 h ahead prediction interval, the With RST-LSTM being slightly superior, estimates from both RST-LSTM and IT2F-LSTM models are close to the attributed observed wind speed variations possibly due to the application of highly correlated input data with less uncertainty for these models. In terms of curve fitting, the curve of the predicted data in both models is very similar to the curve of the actual data. In spite of the certain deviation of the crests and troughs in training and testing, due to the uneven occurrence of data, there is a perfect predictability in other locations.
Throughout the training and testing phases of the 1 h ahead prediction interval, the majority of the points were placed along the diagonal line based on the scatterplots of the observed versus predicted wind speed time series produced by both the RST-LSTM and IT2F-LSTM models (Figure 9a-d). The prediction results of the proposed models were, therefore, in good agreement with the corresponding wind speeds measured. The modeling results were compared with the original LSTM ( Figure 10), multilayer neural networks (MLNN), and recurrent neural network (RNN) models (Table 1) in order to assess the prediction performance of the proposed models, using the same input combinations as those used in the RST-LSTM and IT2F-LSTM models. The modeling results were compared with the original LSTM ( Figure 10), multilayer neural networks (MLNN), and recurrent neural network (RNN) models (Table 1) in order to assess the prediction performance of the proposed models, using the same input combinations as those used in the RST-LSTM and IT2F-LSTM models.
The overall results indicate that the RNN, MLNN, and original LSTM models cannot surpass the performance of the improved RST-LSTM and IT2F-LSTM models in wind speed prediction (Tables 1 and 2). The RST-LSTM model demonstrates the best performance, with lowest SSE (Sum of Squares Error) and RE (Relative Error) of training and testing datasets for different time steps. The modeling results were compared with the original LSTM ( Figure 10), multilayer neural networks (MLNN), and recurrent neural network (RNN) models (Table 1) in order to assess the prediction performance of the proposed models, using the same input combinations as those used in the RST-LSTM and IT2F-LSTM models.
(a)  The best structures of the prediction models developed in this study is shown in Table 3. As can be seen in Table 3, the best performance having the lowest error was observed for the 1 h ahead prediction for all proposed models presenting RST-LSTM (81-25-1) the most capable deep learning model to satisfactory estimate the overall behavior and accurately predict the short-term variations of the wind speed series.
RST-LSTM networks have the benefits of optimizing the selection and determining the significance of the variables affecting the dataset's internal relationships. This method overcomes the disadvantages of quantitative interpretation of intelligent prediction models and guarantees the objectivity of prediction model analysis [65].
In addition, compared to traditional neural networks, the benefit of the LSTM is its ability to learn long-term dependencies between the supplied network input and output, which are important for modeling difficult sequence issues with excessive fluctuations and achieving state-of-the-art outcomes. In terms of uncertainty incorporation, the main advantage of rough set theory is that it does not require any preliminary or supplementary knowledge about the data, such as necessary probability distributions in statistics, basic probability assignments in evidence theory, a grade of membership or the value of possibility needed in fuzzy set theory [66]. Although the performance of both RST-LSTM and FST-LSTM models is superior to that of the traditional neural network prediction approaches, the inter-comparison of the obtained results shows that the RST-LSTM model outperforms the FST-LSTM model in wind speed forecasting. Benefited from efficient variable input selection using mutual information, RST-LSTM with the property of pattern remembrance can be further applied by managers and decision makers to conduct better and more effective wind speed predictions and fill in the gaps between power generation and power utilization. It can help with the smooth operation of the power system as well as the optimization of the control strategy. Variables uncertainty that influence wind speed are reduced by the fuzzy rough set theory, which simplifies the input of the neural network prediction model and improves accuracy and speed.

Conclusions
In providing renewable and clean energy alternatives, wind power generation is playing an increasingly important role. The intermittency and stochastic nature of wind speed, however, makes its prediction a difficult issue for the construction and management of wind farms. Precise wind speed forecasting is, therefore, very critical for the efficient operation and planning of large-scale wind turbine integration. In this paper, the framework of deep learning is applied in the prediction of wind speed using LSTM model hybridized with rough (RST-LSTM) and fuzzy set theory (FST-LSTM) to generate more accurate model with efficient prediction. The short-term wind speed forecasting model is an important contribution for reliable large-scale wind power forecasting and integration in areas influenced by Mediterranean wind. Moreover, since wind speed is a nonlinear time series, unlike many other studies that applied ACF to measure a variable's linear dependency, an efficient method of MI is used here for input variable selection to evaluate excessive fluctuations of wind speed time series, both linear and nonlinear correlations.