Transductive Feature Selection Using Clustering-Based Sample Entropy for Temperature Prediction in Weather Forecasting

Entropy measures have been a major interest of researchers to measure the information content of a dynamical system. One of the well-known methodologies is sample entropy, which is a model-free approach and can be deployed to measure the information transfer in time series. Sample entropy is based on the conditional entropy where a major concern is the number of past delays in the conditional term. In this study, we deploy a lag-specific conditional entropy to identify the informative past values. Moreover, considering the seasonality structure of data, we propose a clustering-based sample entropy to exploit the temporal information. Clustering-based sample entropy is based on the sample entropy definition while considering the clustering information of the training data and the membership of the test point to the clusters. In this study, we utilize the proposed method for transductive feature selection in black-box weather forecasting and conduct the experiments on minimum and maximum temperature prediction in Brussels for 1–6 days ahead. The results reveal that considering the local structure of the data can improve the feature selection performance. In addition, despite the large reduction in the number of features, the performance is competitive with the case of using all features.


Introduction
Entropy measures have been used for many years to exploit the amount of information that a system contains. They play a significant role in interpreting and describing the dynamics of real-life complex networks such as climate, financial, physiological, Earth and medical systems [1][2][3][4][5][6]. There can be model-based or model-free approaches to evaluate the entropy measures. While model-based approaches benefit from the prior knowledge about the probability distribution of the data, model-free methods estimate the probability distribution based on the data. Since in many real-life applications, the probability distribution of the data is unknown, in this study, we use a model-free approach known as sample entropy, which is one of the popular methods for analyzing the complexity of a dynamical system. Moreover, in time series analysis, entropy measures can be utilized to illustrate the strength and the direction of causality. The authors in [7] investigate a bivariate dynamical system and suggest conditional entropy to evaluate the amount of information in one particular state of a process when the history of the other one is known. One major concern while using conditional entropy is the number of previous values, known as lag or delay, in the conditioning term. Furthermore, it is important to indicate which lags are more influential. In [8], a lag-specific transfer entropy method was proposed, which evaluates the causality between two time series only based on the informative lags; i.e., the informative delays are selected and the others are discarded.
In this study, we focus on a weather forecasting application as a complex system. Reliable weather forecasting is a central issue since weather conditions can affect our daily life and activities in different ways. State-of-the-art methods make use of Numerical Weather Prediction (NWP), which requires thousands of CPUs for the simulations and consequently is an intensive approach with regards to the computational complexity [9]. In recent years, black-box modeling has been used to address the issue of reliable weather forecasting. Some studies take into account the spatial and temporal properties of the dataset, e.g., Geographically Weighted Regression (GWR) explores the variation of regression coefficients considering spatial information [10]. Some studies have taken advantage of the locality structure of weather conditions to have a better performance. In global learning methods, the same weights are considered for all data points in the training data, while transductive learning algorithms assume that the samples in the test point vicinity are more influential for model fitting [11]. In [12], a clustering-based feature selection for temperature prediction is proposed in which the feature selection and model fitting are done for each cluster independently, and the trained models are used afterwards based on the membership values of the test point to each cluster. In [13], Moving Least Squares Support Vector Machines (M-LSSVM) has been proposed as a soft localization of Least Squares Support Vector Machines (LSSVM) for temperature prediction in Brussels. In this study, we propose a transductive approach for measuring the sample entropy in dynamical systems.
In a data-driven approach, weather forecasting can be seen as a Nonlinear AutoRegressive eXogenous (NARX) model; i.e., the historical data of some nearby cities are taken into account as input features. One may use a feature vector, which is the concatenation of the weather variables from different cities. Taking into account several lags for each variable leads to a high dimensional feature vector. Different studies have deployed information theory to find relevant features in static or dynamic problems [14][15][16][17][18][19]. In this paper, we investigate a global and transductive feature selection for a weather forecasting application. Note that the terms "local" and "global" are considered here in the machine learning sense as in [11] and are not referring to the geographical location of the weather stations. In the global approach, the same weights are considered for all data points in the training data for feature selection, while in the case of the transductive method, the samples in the test point vicinity in feature space are more influential. For the purpose of feature selection, in this study, we deploy the lag-specific information transfer idea to find relevant features in our problem as the globally selected features for the prediction task. In addition, we propose a clustering-based sample entropy methodology, which can be beneficial for transductive feature selection when the local structure of the data is taken into account. In this approach, depending on the clustering information of the training data and the membership values of the test point to the clusters, the samples have different impacts on the sample entropy. Deploying hard clustering can result in using only a part of the training data to measure the sample entropy with the same impact while discarding the other samples. However, using soft clustering, one may exploit the information of the whole dataset while considering different weights for the training samples. In this study, Soft Kernel Spectral Clustering (SKSC) is utilized for the clustering task. Least Squares Support Vector Machines (LSSVM), which is one of the popular learning methods, is used to model the data using the globally-and transductively-selected features.
In this study, the experiments are carried out for temperature prediction in weather forecasting. The data have been collected from the Weather Underground website [20] and include real measurements of weather elements such as temperature, dew point, humidity and wind speed for 10 cities including Brussels, Liege, Antwerp, Amsterdam, Eindhoven, Dortmund, London, Frankfurt, Groningen and Dublin. To evaluate the performance of the proposed method, there are two test sets in different periods of the year: (i) from mid-November 2013 to mid-December 2013 (November/December) and (ii) from mid-April 2014 to mid-May 2014 (April/May). Temperature forecasting is done for both minimum and maximum temperature prediction for 1-6 days ahead in Brussels.
The remaining part of the paper proceeds as follows: first, we explain the background and the proposed method. Then, we present and discuss the results for the application of temperature prediction in weather forecasting, and finally, concluding remarks are presented.

Materials and Methods
In this section, first we will explain the background of the methodologies that are used as baselines for transductive feature selection using clustering-based sample entropy. Afterwards, the proposed methods will be described in detail.

Background
In this section, we cover the methods used in our algorithm. First, we explain different entropy measures in static and dynamic cases. Secondly, we describe a lag-specific information transfer method, which is used as the main idea of feature selection in our proposed method. Later, Soft Kernel Spectral Clustering (SKSC), which is utilized to find the local structure in our data, and Least Squares Support Vector Machines (LSSVM), which is deployed as a learner, are explained respectively.

Entropy and Information Transfer
Entropy measures are popular methods to investigate the uncertainty of the data. In [21], Shannon discusses that the fundamental problem in a communication system is the reproduction of the message sent from one point to the other point. A communication system includes five elements: (1) the information source, which generates one or more messages to be delivered to the destination; (2) the transmitter, which manipulates the messages to pass them through the channel; (3) the channel, which is a medium to transfer the messages to the destination; (4) the receiver, which retrieves the original message by inverting what the transmitter did; and (5) the destination, which is the intended target of the message.
Shannon defines a measure of uncertainty for the outcome of the system known as Shannon entropy. Given a set of possible events ∆ = {δ 1 , δ 2 , . . . , δ n } with occurrence probability of p(δ i ) for i ∈ {1, . . . , n}, the Shannon entropy can be defined as follows: In (1), H(∆) shows the uncertainty in the information that the variable ∆ gives about itself. Joint entropy is a measure that evaluates the uncertainty of the outcome when there is more than one random process. Assuming there is another variable Π with a set of possible events Π = {π 1 , π 2 , . . . , π n } and with occurrence probability of p(π j ) for j ∈ {1, . . . , n}, the joint entropy can be defined as follows: where p(δ i , π j ) is the probability of the joint occurrence of δ i and π j [21]. Conditional entropy is a measure to assess the uncertainty of a random process while the other one is known. Given the value of Π, the conditional entropy of ∆ given Π can be defined as the average of the Shannon entropy as follows [22]: The aforementioned Equations (1)-(3) do not consider the time of the occurrence, and hence, they are known to be static. However, these definitions in information theory play significant roles in the analysis of dynamical systems [23][24][25]. In dynamic processes, the entropy measures can be useful to express the information content of a process over time, e.g., the information that the process is contained at a specific state or the one that is received from the previous states. These methods have been used in a wide range of real-world applications such as climatology, physiology, finance and biology [2][3][4]8,26].
The definition of entropy measures in dynamical processes is similar to the static case. To express the uncertainty in these systems, assume X i indicates a random variable sampled from a dynamic process X at time i, and X − i = {X 1 , X 2 , . . . , X i−1 } shows its past states. Given that p(x i ) is the probability that X i holds the value x i and S i is the set of possible values for x i , then the Shannon entropy explains the information content at the current state of the process (H( Furthermore, joint entropy expresses the information content of the current and the past states of the random variable X as follows: Moreover, conditional entropy is equal to the amount of information that the current state of the random process contained in addition to the past states and can be written as follows: where p(x i |x 1 , . . . , x i−1 ) is the probability that X i holds the value of x i given that X 1 to X i−1 are x 1 to x i−1 , respectively. In [27], the entropy in dynamical processes is introduced as a predictability measure. Considering that conditional entropy in dynamical processes can be interpreted as new information that can be gained by the current state and is unknown by the previous ones, one may say that if the value of the current state is completely predictable by the previous ones, then there is no new information in the current state; hence, the conditional entropy is equal to zero. Nevertheless, a large conditional entropy shows that the amount of the information generated by the current state is large; thus, there is a lack of information to predict the current state based on its history.
In order to use these measurements in real-world datasets, the experiments rely on the time series prediction. Assume X = {x 1 , x 2 , . . . , x N } is a time series of length N. To deploy the entropy measures, the probability density function can be approximated as follows [28]: where K(·, ·) is a kernel function to measure the similarity of x j and x i . After having the probability distribution, the Shannon entropy of the time series can be written as follows: where < p(x i ) > indicates the average of p(x i ) over all possible values x i . Furthermore, substituting (7) into (5), the conditional entropy can be expressed in terms of joint probabilities as follows [23]: In this study, in order to measure entropy values, we use the special case of the dynamical entropy known as sample entropy [3]. In the sample entropy method, the kernel function K(·, ·) is taken to be the Heaviside kernel. The Heaviside kernel sets a threshold r on the distance of x i to each sample, i.e., it indicates how many samples are within distance r of x i . One of the popular approaches is to measure the distance based on the maximum norm, which is the maximum of the absolute difference between each feature of two samples. Considering θ(x i , x j ) = max 1≤q≤d |x jq − x iq | where x iq is the q-th component (feature) of the i-th data point and d is the number of features, the Heaviside kernel is expressed as follows: Measuring entropy in time series has its own challenges. One important issue in computing entropy criteria is the curse of dimensionality [29]. As the size of the time series becomes larger, the conditional entropy gets closer to zero. Thus, in practice, a limited length of history is taken into account. Considering only m previous values for joint probability (to avoid the curse of dimensionality) and excluding the self-match, Equation (8) is equivalent to the sample entropy of the time series [3,23].
Sample entropy can be described as follows: assuming X is a realization of a time series {x 1 , x 2 , . . . , x N } with length N, X m i is a vector of length m defined as follows: Excluding the self-match, for i ranges from 1 to N − m, A m i (r) and B m i (r) in mand (m + 1)-dimensional space are calculated as follows: where K(·, ·) is the Heaviside kernel, which is used to indicate how many samples are within distance r of X m i . Afterwards, A m (r) and B m (r) are defined to be equal to the average of A m i (r) and B m i (r) over all possible X m i : Finally, the sample entropy in m-dimensional space is calculated as follows: Note that B m (r) is always smaller than or equal to A m (r); thus, SampEnt(m, r, N) has a non-negative value.
As previously explained, the entropy measures can be interpreted as a predictability power. In this study, we deploy the sample entropy definition to find relevant delays in a NARX model. In this scheme, in m dimensions (refer to (11)), the predictor time series are presented and in the m + 1 dimension (refer to (12)), and the target time series is added to them.

Lag-Specific Information Transfer
As previously mentioned, entropy criteria can be utilized to determine how much information is transferred from the previous states of a dynamical process to the current one. This can be extended to investigate more than one dynamical process and evaluate their relations and influences on each other. The authors in [8] have proposed a lag-specific transfer entropy method to evaluate the information transfer. Given X n and Y n are the values of time series X and Y at time n, . .} indicate the past values of the time series, based on the Granger causality (G-causality), there is a G-causality from X to Y if X − n includes the information that can improve the prediction of Y n above and beyond the information that Y − n involves [30]. The amount of the information contained in X − n can be measured using the following: Based on the definition, there is G-causality from X − n to Y if and only if I(Y n ; X − n |Y − n ) > 0 [31]. The authors in [8] have discussed the fact that this approach generally accumulates the G-causal influence of all past values, and therefore, it does not consider the lag-specific information; i.e., the amount of information that specific state X n−t gives to X n is unknown. In order to make it lag-specific, an itemized approach is proposed: there is G-causality from X n−t to Y if X n−t includes information that can improve the prediction of Y n above and beyond the information that both Y − n and This approach can exploit the amount of information in each lag of different dynamical processes, and therefore, the informative lags can be identified. Eventually, the transfer entropy can only aggregate the information contained in the informative past values. The procedure of finding informative past values can be described as follows.
Assuming V as a set of selected influential and informative components and V as a set of candidate components, then V ∩ V = ∅ and V ∪ V = {X n−1 , . . . , X n−L max , Y n−1 , . . . , Y n−L max } where L max is the maximum lag to be taken into account. Note that in this study, X i and Y i for i ∈ {1, . . . , n} are uni-variate time series. The procedure of detecting influential lags is an iterative procedure where V k and V k indicate V and V at iteration k, respectively. The algorithm starts with V 0 as an empty set. In each iteration, for each W ∈ V k−1 , a candidate set {W, V k−1 } is created, and the conditional entropy H(Y n |W, V k−1 ) is computed. The component W that results in the minimum conditional entropy (arg min W H(Y n |W, V k−1 )) is selected, and subsequently, V and V are updated as follows: The procedure terminates when an irrelevant component is added to the selected set V.
The relevance of the selected component is evaluated based on the significance of the reduction in the conditional entropy as follows: To determine the significance of the reduction in the conditional entropy, a statistical approach is used. The statistical significance is estimated by deploying time shifted surrogate data [8]. In this approach, the surrogate data are generated by multiple shifting of the original series W k for randomly-selected lag with respect to Y n . For example, assuming W k has N elements and W k = [W k1 , W k2 , . . . , W kN ], then the shifted time series with lag equal to l is [W k(l+1) , W k(l+2) , . . . , W kN , W k1 , . . . , W kl ] [32]. Afterwards, the reduction of the conditional entropy is evaluated for the original series and the new shifted ones. If the reduction of the conditional entropy for the original one is below the 100(1-a) percentile of its distribution on the surrogate data, W k is considered to be an irrelevant feature (delay), and the termination condition is fulfilled; otherwise, W k is a relevant variable and is added to the selected set V. Note that the lag variable for shifting W k has to be large enough (not close to one and N) to eliminate the causality effect between the new shifted time series and the output. If the null hypothesis is rejected, one can be sure that the reduction in the conditional entropy is in fact because of causality and not random.
The authors in [8] have used the idea of relevant component selection to evaluate the transfer entropy between the variables and measuring the amount of the information that they transfer to each other. Nevertheless, in this paper, regardless of the amount of the transferred information, we use the lag-specific component selection idea as a forward feature selection approach to find relevant features in an NARX model. Therefore, if a lag-specific component contains information based on the (17), then it is selected as a relevant feature.

Soft Kernel Spectral Clustering
To take advantage of the local structure of the data, one may use clustering. As previously said, using soft clustering can be beneficial to exploit the knowledge of all samples while considering different weights for the samples in each cluster depending on the membership of the test point to that cluster. In this study, we use Soft Kernel Spectral Clustering (SKSC), which is one of the popular non-linear clustering methods [33].
Assume κ is the number of clusters and x i is a row vector including d features for i ∈ {1, 2, . . . , N}. Considering l as the number of score variables needed to encode the κ clusters, the projection of the training data in the feature space can be represented by e (l) = [e (l) N ] T . Let γ l ∈ R + be the regularization parameter. The SKSC primal formulation is expressed as follows [33,34]: Here, ϕ(·) : R d → R h is the feature map that maps the data to a high or infinite dimensional and Mercer's theorem [35] can be expressed as follows: Note that for positive definite kernel function K(·, ·), one may exploit Mercer's theorem to implicitly use the feature map, and thus, ϕ(·) does not have to be explicitly defined.
In addition, D Ω ∈ R N×N is the diagonal degree matrix associated with the Ω where D (i,i) Let α (l) ∈ R be the Lagrange multipliers. Then, based on the Lagrangian L(w(l, b l , e (l) ; α (l) ) = , the optimality conditions for l = 1, . . . , κ − 1 are as follows [34]: After eliminating w (l) , b l and e (l) , the dual problem is described as follows: where α (l) is the vector of dual variables, λ l = N γ l , and M D = The clustering models in dual space for a given sample x i is as follows: After finding the initial borders of clusters, the prototypes in the score variables' space are recalculated to improve the clusters' borders, and subsequently, the data points are assigned to a cluster based on their distance to the prototypes. The prototype ψ c of cluster c for c = 1, . . . , κ can be found as follows: where N c is the number of samples in cluster c and e (l) i are the score variables of the samples in cluster c. In this study, we use the Radial Basis Kernel (RBF) (24) as the kernel function; thus, the kernel bandwidth σ together with the number of clusters κ are the two parameters that have to be tuned. The RBF kernel is: In this study, Average Membership Strength (AMS) is employed to tune the hyperparameters based on the grid search, and the values that yield the maximum AMS are selected. Thus, for different numbers of clusters (in this study, from 2-10) and different values of the kernel bandwidth, AMS on the validation set is evaluated, and the one that has the maximum AMS is selected. In AMS, the average membership value for the samples in the validation set to each cluster is calculated based on the cosine similarities between the samples and the prototypes of the corresponding cluster.
For a given test point x test , the membership value to the cluster c is expressed as follows [33]: where κ is the number of cluster and d cos test,j is the cosine similarity between the test sample and the prototype of the cluster j in the score variables space.

Least Squares Support Vector Machines
Least Squares Support Vector Machines (LSSVM) is a well-known machine learning method proposed in [36,37]. The main difference between Support Vector Machines (SVM) and LSSVM is the fact that instead of quadratic programming in SVM, LSSVM solves a set of linear equations. Let x ∈ R d , y ∈ R and ϕ : R d → R h where ϕ(·) is a feature map and h is the dimension of the feature map. The model in primal space is formulated as follows: where b ∈ R and w ∈ R h . The optimization problem is written as follows [37]: where {x j , y j } N j=1 is the training set, γ ∈ R + is the regularization parameter and e j = y j −ŷ j is the error between the actual and predicted output for data point j.
Let α j ∈ R be the Lagrange multipliers. Then, based on the Lagrangian L(w, b, e; α) = 1 , the optimality conditions are as follows: After eliminating w and e, the dual problem can be formulated as follows: where Ω is the kernel matrix. In this study, we deploy RBF as a kernel function, which is formulated in (24). Thus, the regularization parameter γ and the kernel parameter σ are the tuning parameters.
Having α j and b as the solution for the linear system, the LSSVM model as a function estimator is expressed as follows:ŷ In this study, we use LSSVM to learn the model based on the selected features; thus, good performance can be an indication that relevant features have been selected.

Transductive Feature Selection Using Clustering-Based Sample Entropy
In this study, we propose a methodology for transductive feature selection based on the clustering-based sample entropy. The seasonal behavior of the weather condition is the intuition to investigate the transductive feature selection. Mostly, feature selection methods take into account the relevance of the features for prediction in the whole dataset. However, in the transductive feature selection, we assume that some features can be considered relevant in some part of the data while being irrelevant when all samples are taken into account and vice versa. In [12], a clustering-based feature selection is deployed in weather forecasting. It is shown that selecting features based on the clustering information can result in a better performance for weather prediction.
Given that weather forecasting can be seen as a Nonlinear AutoRegressive eXogenous (NARX) model [12,38] and assuming Y t and X p,t for p = 1, . . . , d are the output and p-th exogenous inputs of the system at time t and s is a positive integer denoting the number of steps ahead in the future to predict, the NARX model can be written as follows: Having X ∈ R N×d and Y ∈ R N , where X j,i is the value of the exogenous input j at time i and Y i is the value of the uni-variate time series Y at time i, we define X train = [X 1:d,1:N−L max , X 1:d,2:N−L max +1 , . . . , X 1:d,L max :N−1 , Y 1:N−L max , . . . , Y L max :N−1 ] and X test = [X 1:d,N−L max +1 , X 1:d,N−L max +2 , . . . , X 1:d,N , Y N−L max +1:N ]. Note that X train ∈ R (N−L max )×((d+1)×L max ) and X test ∈ R 1×((d+1)×L max ) .
The diagram of the proposed method is depicted in Figure 1. As is shown, the algorithm consists of three main steps. In the first block, a clustering algorithm is applied on the data. The output of this block includes the clustering information of the training samples and the membership of the test point to each cluster. Depending on the membership values of the test point, some parts of the data-set are considered for the feature selection. Using hard clustering, only a subset of the data, which includes the samples of the cluster that the test point belongs to, is passed to the next block to find relevant features. However, using soft clustering, all data points can be used in the next block depending on the membership values of the test point to the clusters. In this study, we deploy SKSC in the clustering block. Thus, we exploit the information of all data: the data in each cluster affect the feature selection procedure based on the membership of the test point to the corresponding cluster. In the second block, the feature selection procedure is defined by finding informative lags or delays of the input time series. Knowing the samples in each cluster and the membership values of the test point to each cluster, the feature selection using the clustering-based sample entropy is an iterative procedure that has been shown in Figure 2 and can be expressed as follows.  Let V be the set of selected informative lags of the time series and V be the candidate components; thus, similar to lag-specific transfer entropy, V ∩ V = ∅ and V ∪ V = {X 1,1:N−L max , . . . , X 1,L max :N−1 , . . . , X d,1:N−L max , . . . , X d,L max :N−1 , Y 1:N−L max , . . . , Y L max :N−1 } where X i,t 1 :t 2 is a column vector including the values of the time series X i in the time period of t 1 to t 2 . The feature selection is done in an iterative procedure where V k and V k indicate V and V at iteration k, respectively. The algorithm starts with V 0 as an empty set. Each iteration can be explained in three steps: 1. For each W ∈ V k−1 , a candidate set {W, V k−1 } is created, and the conditional entropy H(Y n |W, V k−1 ) is computed based on the clustering-based sample entropy. 2. The component W that minimizes the conditional entropy (arg min W H(Y n |W, V k−1 )) is selected to be added to the selected set V. 3. V and V are updated as follows: V k = {W, V k−1 } and V k = V k−1 \ W k , and the termination condition is checked.
The procedure terminates when an irrelevant component is added to the selected set V. In this study, we utilize the surrogate data to evaluate the relevance of the selected feature as explained in Section 2.1.2.
The clustering-based sample entropy in iteration k can be expressed as follows: 1. Assuming S k ∈ R N×k is the concatenation of the selected set of features V k for all samples, the samples can be partitioned into separated groups based on the clustering information such that S c k ∈ R N c ×k represents the selected features for the samples in the cluster c.
2. In each cluster, for i ranging from 1-N c , A k,c i (r) and B k,c i (r) in k and k + 1 dimensional space are calculated as follows: where K(·, ·) is the Heaviside kernel which is used to indicate how many samples are within distance r of S c i,k . In this step, the probability density function is calculated for two cases: using only selected features and based on selected features together with the target value. Therefore, the conditional entropy of the target value given selected features can be calculated. Note that, (11) and (12) in the sample entropy definition. 3. Similar to sample entropy, A k,c (r) and B k,c (r) in k and k + 1 are defined to be equal to the average of A k,c i (r) and B k,c i (r) over all possible V c i,k : Note that ∑ c A k,c (r) = A k (r) and ∑ c B k,c (r) = B k (r) where A k (r) and A k (r) are equivalent to (13) and (14) in sample entropy definition. 4. Finally, the clustering-based sample entropy (CluSampEnt), which represents the conditional entropy, in k dimensional space is calculated as follows: where Memb (c) test is the membership value of the test point to cluster c. Note that B k,c (r) is always smaller than or equal to A k,c (r); thus, CluSampEnt(k, r, N) has a non-negative value.
Clustering-based sample entropy can be considered as a transductive entropy measure as it gives us more information about the samples that are more similar to the given test point. Note that if the membership values of the test point to all clusters are equal, the clustering-based sample entropy is equivalent to the sample entropy; thus, all training data points have the same influence on the conditional entropy.
Finally, in the last block in Figure 1, a learner is used to model the data using the selected features. In this study, we use LSSVM for learning the data based on the selected features. A better performance on prediction indicates that more relevant features have been selected.

Experiments on the Simulated Dataset
In this section, we have deployed the proposed methods to find the relevant delays of variables in linear and nonlinear synthetic datasets. In addition, we have compared the proposed method with three other methodologies. The first one is Automatic Relevance Determination (ARD), which is a popular feature selection approach in a Bayesian framework. We have used the implementation of ARD in the framework of LSSVM (LSSVM Toolbox Version 1.8, KU Leuven, Leuven, Belgium) [39]. This method involves three levels of inference: in the first level, the model parameters (the primal weights and bias) are estimated based on the prior, which corresponds to the sum of the squared error and the regularization parameters; in the second level, the hyperparameters, which are utilized to avoid over-fitting and under-fitting, are estimated; and in the third level, the kernel parameter estimation and the model comparison are done [37]. The second approach deploys partial conditioning based on Mutual Information (MI) as an entropic measure. In this method, to select the first feature, the mutual information of each feature with the target value is evaluated, and the one that leads to the maximum value is added to the selected set. In the next iterations, the feature that jointly with the previously selected features has the maximum mutual information with the target value is added to the selected set. The procedure continues until a pre-defined number of features is selected [18]. Finally, we utilized Least Absolute Shrinkage and Selection Operator (LASSO), which is a popular feature selection approach proposed by [40]. LASSO is a regularization method that produces sparse models by imposing an L1-norm penalty on the regression coefficients. Note that both proposed methods and the MI-based method are model-free approaches, while ARD and LASSO are model-based methods.
We have created 10 realizations of all systems for 1000 time steps and with random initialization. To evaluate the performance of different methods for a linear system, consider the following system: where e u,t and e y,t are independent white noise with zero mean and 0.5 variance. As was previously mentioned, in this paper, we assume that data in different clusters are a function of different variables. Therefore, we define the localized linear system example as follows: y t = 0.7y t−1 + 0.8u t−3 + e y,t Cluster2(501 ≤ t ≤ 1000) : u t = 0.9u t−1 − 0.6u t−2 + e u,t y t = 0.81y t−2 + 0.95u t−4 + e y,t . (38) Note that in the first 500 points, y t is a function of y t−1 and u t−3 , while in the next 500 points, it is related to y t−2 and u t−4 .
In the rest of the paper, we refer to Systems (37) and (38) as the global and localized linear system, respectively. Similar to the linear systems, consider a nonlinear global system defined as follows: We have made some changes in the global system such that the underlying processes for the first and the second 500 time steps are different such that the localized nonlinear system is defined as follows: Note that in the first 500 points, y t is a function of y t−1 and u t−3 , while in the next 500 points, it is related to y t−2 and u t−1 .
Considering L max = 5 and r = 0.1, the occurrence of the first two relevant features in both global and transductive feature selections together with ARD, MI-based and LASSO approaches are shown in Table 1. As is shown, all methods perform equally well for the global linear system, and they are competitive in the case of the global nonlinear system. Note that in all cases, the most relevant features for the prediction of y t are selected.  (37), (39)).

Method
Linear System Nonlinear System u t−1 y t−1 u t−2 y t−2 u t−3 y t−3 u t−4 y t−4 u t−5 y t−5 u t−1 y t−1 u t−2 y t−2 u t−3 y t−3 u t−4 y t−4 x t−5 y t−5 In Tables 2 and 3, the occurrence of the first two selected features for two test samples with different membership values to the clusters in the localized systems are presented; that is, Table 2 shows the occurrence when the test point membership values to the first and second cluster are 0.8 and 0.2, respectively; thus, the dependency of variables in the test point is closer to the first cluster underlying model. However, in Table 3, the membership values to the clusters are 0.2 and 0.8; so, the pattern in the test point is closer to the second cluster underlying model. The results reveal that in this scenario, the proposed transductive feature selection approach outperforms other approaches as it selects the relevant features more times. This is expected as other methods select the features based on considering that all data points have the same effect.

Method Linear System Nonlinear System u t−1 y t−1 u t−2 y t−2 u t−3 y t−3 u t−4 y t−4 u t−5 y t−5 u t−1 y t−1 u t−2 y t−2 u t−3 y t−3 u t−4 y t−4 u t−5 y t−5
Global-FS  Table 3. Number of times for which the corresponding feature is selected using 10 different initial values (test point in the localized system (38), (40): membership values to clusters [0.2, 0.8]).

Method Linear System Nonlinear System u t−1 y t−1 u t−2 y t−2 u t−3 y t−3 u t−4 y t−4 u t−5 y t−5 u t−1 y t−1 u t−2 y t−2 u t−3 y t−3 u t−4 y t−4 u t−5 y t−5
Global-FS In the rest of the experiment part, we evaluate the performance of the proposed global and transductive feature selection approaches on the application of weather forecasting.

Weather Dataset
In this study, data have been gathered from the Weather Underground website and include real measurements of weather variables such as minimum and maximum temperature, dew point, humidity and wind speed for 10 cities including Brussels, Antwerp, Liege, Amsterdam, Eindhoven, Dortmund, London, Frankfurt, Groningen and Dublin, as shown in Figure 3.
In order to assess the performance of the proposed methods in different weather conditions, the performance is reported on two test sets in different time periods: (i) from mid-November 2013-mid-December 2013 (November/December) and (ii) from mid-April 2014-mid-May 2014 (April/May).
The data cover a time period from the beginning of 2007-mid-2014 and contain 180 measured weather variables for each day. Note that the number of training samples is different for each test data point, and it is based on the number of days from the beginning of 2007 until the day before the test date.

Weather Forecasting Experiments
In this study, the experiments are done for minimum and maximum temperature prediction in Brussels for 1-6 days ahead. For both SKSC and LSSVM, the dual problem is implemented, and there are different parameters that need to be tuned: the number of clusters and the kernel bandwidth parameter in SKCS are tuned using AMS considering 60% of data for training and 40% for validation; furthermore, the regularization parameter and the kernel bandwidth in LSSVM are tuned using 10-fold cross-validation. In this study, we consider L max to be 10 and generate 50 realizations of the shifted surrogate data. Three different values [0.7, 1, 1.6] have been considered for the r value in the Heaviside kernel, and the results are reported for all of them. Figure 4 shows the AMS value for different numbers of clusters, and it can be seen that the maximum value of AMS is when the data are divided into two clusters. In addition, the clusters have been depicted in Figure 4, and one may say that the clusters indicate different patterns in different periods of a year. Note that having one cluster means that the same weights are considered for all data points, and hence, it is equivalent to the global feature selection approach.
In order to compare the accuracy of the data-driven approaches with the state-of-the-art methods in weather forecasting, the performances of the black-box methods are compared with the one of the Weather Underground company. In addition, we utilize the global and transductive feature selection to identify relevant delays of each weather variable. Note that there are 180 weather variables in the dataset, and for each of them, we consider L max to be 10. In this case, there are 1800 features, which is equal to 180 × L max . Furthermore, we compare the results with the one when there is no feature selection.
In Tables 4 and 5, the average Mean Absolute Error (MAE) of the prediction using the global and transductive feature selection for the minimum and maximum temperature in two test sets are compared. For each value of r, the method that has a better performance is bolded. In most of the cases, transductive feature selection results in a lower MAE, which can be an indication that selecting features transductively is able to identify relevant features better than the global approach.  Table 5. MAE for minimum and maximum temperature prediction in April/May.
Step Ahead Temp. r = 0.7 r = 1 r = 1.6 In Figures 5 and 6, the performance of the Weather Underground prediction is compared with the ones of the black-box methods for both test sets. The black-box approaches utilized LSSVM when: (1) there is no feature selection; (2) global feature selection is deployed; and (3) transductive feature selection is used. Note that in the case of utilizing feature selection, the r value is tuned using cross-validation. As is shown, the data-driven approaches are competitive with the state-of-the-art methods in weather forecasting. In the test set November/December, the black-box methods outperform Weather Underground, while Weather Underground shows more reliable weather forecasting in April/May. This can be due to the lower variance in observations in the test set November/December. In addition, the performances of the black-box methods when feature selection methods are employed are competitive with the case that there is no feature selection.  In order to analyze the reduction in the complexity of the methods in terms of the number of features, Figure 7 depicts the average number of selected features for different values of r. Obviously, increasing the value of r results in a larger number of selected features. This phenomenon is expected as with a larger r value, it takes more iterations for the conditional entropy to decay to zero; thus, more features are selected at the end of the feature selection procedure.

Global-FS Transductive-FS Global-FS Transductive-FS Global-FS Transductive-FS
As was mentioned, there are 180 weather variables, and considering L max to be 10, the total number of features is 1800. Figure 7 suggests that in all cases, there is more than a 97% reduction in the number of features. Taking Figures 5 and 6 into account, it can be seen that although there is a huge reduction in the number of features, the results are competitive with the case of using all features. In addition, we have investigated the prediction intervals as described in [41], and we have observed that even though we are reducing the number of features significantly, the prediction intervals are competitive, which indicates that the level of uncertainty is not increased. In addition to LSSVM as an NARX machine learning approach, we have investigated the impact of the feature selection on a linear approach such as the AutoRegressive with eXogenous input (ARX) model. The overall performances of prediction on both test sets together with using the ARX model in three cases (1) without feature selection, (2) features selected by the global approach and (3) features selected by the transductive approach are presented in Figure 8. As is shown, the feature selection methods can improve the performance of the linear models significantly even though very few features are selected.
The comparison between the performance of the linear model (ARX) and the nonlinear model (LSSVM), while the proposed feature selection methods are used, is depicted in Figure 9. As is shown, the performances of both models are competitive, and this may be due to the efficient feature selection approach.
Moreover, since the proposed method benefits from the clustering information to find informative features, we have compared the results on the weather dataset with the one proposed in [12], which also deploys clustering information to improve the feature selection performance. The mean absolute error of the minimum and maximum temperature prediction is shown in Figure 10. The experimental results reveal that while the number of selected features using the method in [12] is three to four-times larger than the number of features selected by the proposed transductive feature selection approach, the performances of both methods are competitive.    Note that there are some major differences between the proposed method and the one in [12]. While the proposed transductive feature selection method in this study is model-free, the one in [12] is a linear feature selection approach. In addition, in the proposed methods, depending on the membership values of each test point, the selected features can be different, while in [12], the features are selected per cluster, and the membership values of the test point affect the prediction.
In order to investigate the influence of each neighboring city on the temperature of Brussels, Figures 11 and 12 show the percentage of the selected features from different cities and different delays. As is depicted, in both global and local feature selection, for short-term prediction (one day ahead), closer cities such as Brussels itself, Dortmond and Amsterdam seem more relevant, while for long-term prediction (six days ahead), farther cities such as Dublin, London and Groningen are more important. Moreover, in short-term prediction, short histories (smaller delays) are more relevant, while for the long-term one, larger delays should also be taken into account.

Discussion
In this study, we propose a feature selection approach based on the entropy measures in an application of weather forecasting. The sample entropy was used to measure the conditional entropy of the target value, which is the maximum and minimum temperature in Brussels, while a set of features, which includes weather variables, is given. This set was formed in a forward selection of the time series that are affecting the target time series. The influence of time series on each other is measured using the conditional entropy; i.e., smaller conditional entropy shows a higher power of predictability.
The results suggest that selecting the informative time series that are affecting the target value can reduce the complexity of the model in terms of the number of features significantly. However, the performance of the model remains good even when there is a smaller number of variables in the model. Surprisingly, in many cases, the performance is improved. In addition, a smaller number of features can be beneficial for the visualization task in a complex network such as climate data.
The results of these experiments support the conclusion of the paper [12] in which it is shown that taking into account the local structure of data can result in better performance in weather forecasting. In addition, it is depicted that in different periods of the year, which means different weather conditions, the influence of the neighboring cities on the weather variables of the target city can be different. For example, in Figures 11 and 12, in the case of six-day-ahead prediction, London seems more influential in the April/May test set, while in the November/December test set, Frankfort is more informative.
A major drawback of the proposed transductive feature selection method, which uses the clustering-based sample entropy, is the fact that for each test point the whole procedure should be done independently. In daily weather forecasting, in each day, there is only one test point for which the weather conditions for 1-6 days ahead should be predicted. Considering the fact that in this application, the trained model should be updated on a daily basis, the transductive approach does not have higher complexity than the global one. However, in some datasets with more than one test point, the transductive feature selection becomes time consuming. One possible solution for this problem can be clustering the test points. Since the test points in each cluster are similar to each other, their membership values to the clusters in the training data should be similar, as well. Therefore, transductive feature selection can be done for each cluster of the test points independently. Note that the proposed method is applicable for any time series prediction, such as climate, financial or medical systems, since it investigates the impact of regressor time series on the target one.

Conclusions
In this study, we investigated a feature selection approach based on entropy measures in an application of weather forecasting. We deployed the sample entropy to evaluate the conditional entropy of the target value when a set of selected features is given. The forward selection approach is followed; thus, in each iteration, the variable that minimizes the conditional entropy was added to the set of selected features. In addition, considering the local structure of the data, we proposed the clustering-based sample entropy, which is similar to the sample entropy definition except the fact that the clustering information of the training data and the membership of the test point to the clusters are taken into account to perform the feature selection.
The performances of black-box methods are compared with the one of the Weather Underground company, and the experiments show that the data-driven weather forecasting is competitive with the state-of-the-art methods in this field. The results reveal that utilizing the proposed feature selection methodologies leads to a significant decrease in the number of features, while the performance remains adequate. Moreover, the experiments suggest that the transductive feature selection can improve the performance of finding relevant variables.