Localized convolutional neural networks for geospatial wind forecasting

Convolutional Neural Networks (CNN) possess many positive qualities when it comes to spatial raster data. Translation invariance enables CNNs to detect features regardless of their position in the scene. But in some domains, like geospatial, not all locations are exactly equal. In this work we propose localized convolutional neural networks that enable convolutional architectures to learn local features in addition to the global ones. We investigate their instantiations in the form of learnable inputs, local weights, and a more general form. They can be added to any convolutional layers, easily end-to-end trained, introduce minimal additional complexity, and let CNNs retain most of their benefits to the extent that they are needed. In this work we address spatio-temporal prediction: test the effectiveness of our methods on a synthetic benchmark dataset and tackle three real-world wind prediction datasets. For one of them we propose a method to spatially order the unordered data. We compare against the recent state-of-the-art spatio-temporal prediction models on the same data. Models that use convolutional layers can be and are extended with our localizations. In all these cases our extensions improve the results, and thus often the state-of-the-art. We share all the code at a public repository.


Introduction
Climate change becoming an urgent concern, energy production increasingly relies on clean renewable energy sources like solar and wind. One fundamental constraint with these sources, however, is that their energy production is stochastic and directly depends on the ever changing weather conditions. This is especially true for wind. Consequently, the matching of energy demand and supply in time is getting more acute. This is being addressed by emerging smart grid technologies where nodes can produce, consume or store energy, communicate with other nodes, and the energy is priced in real time. The ability to forecast wind (or solar) energy production is an essential part of this solution. For wind this boils down to forecasting wind speeds at the locations of the wind turbines.
In addition to classical Numerical Weather Prediction (NWP) models, machine learning approaches are increasingly applied, especially for particular locations and short time spans ("nowcasting") where the longer-term NWP models are not available or are expensive to update and rerun. The wind prediction at multiple locations (often spatial grids) given the historical data and possibly other types of signals, is a spatio-temporal (or geo-temporal) task, as it involves both spatial and temporal components. Many deep learning architectures have been recently proposed for this type of task [1].
When predicting a regular grid, the spatial component of the task is typically handled by the convolutional layers in the deep architectures, the eponymous building blocks of convolutional neural networks (CNNs). Convolutional layers have a very nice property, that they treat each location equally and learn, share the same weights at each. This is very helpful, as the laws of thermodynamics (or meteorology) are the same at each location, but only to a point, as locations may also have their special intrinsic features.
In this contribution we propose to enhance convolutional layers with several flavors of localized learnable features to strike a balance and benefit from both location-invariant and location-specific learning. We test our proposed approaches first on a synthetic benchmark dataset and then on three wind forecasting datasets rigorously comparing to the state-of-the-art. Our localized convolutions consistently increase the performance of the corresponding non-localized architectures and improves the state-of-the-art.
This article is organized in the following way. We provide our motivation and conceptual idea of the localized CNNs in Section 2. We systematically review related previous work in Section 3, both state-ofthe-art geo-temporal prediction architectures organized by the way they integrate the spatial and temporal aspects (Section 3.1) and previous attempts at localizing CNNs (Section 3.2). We introduce our proposed methods of localizing CNNs in Section 4, moving from more specific to more general implementations. We provide detailed descriptions of the state-of-the-art deep neural network architectures that we use in our numerical experiments, together with our localized CNN modifications in Section 5. One synthetic and three real-world datasets, their experiment specifics, and the results with the different models are presented in subsections of Section 6. There we also propose a mutual-information-optimized embedding of the data that originally were not on a regular grid in Section 6.3.1. The article is concluded with the discussion in Section 7.

Localized CNN Motivation
Convolutional neural networks (CNNs) [2] are applied to spatially-ordered data, like images. They are defined by the convolutional layers where each neuron has a fixed local receptive field and shares weights with all the other (repeated) neurons arranged in a lattice of the same filter. Convolution in 2D is defined as where f (·, ·) is some k × k filter applied to every k × k patch centered at (i, j) on the image g(·, ·). Typically, an element-wise nonlinear function is applied to the results of convolution which gives a set of identical weighted-sum-and-nonlinearity neurons each looking at a different part of the input image. Convolutions give CNNs some unique benefits, compared to regular MultiLayer Perceptrons (MLPs): • The total amount of different trainable weights is drastically reduced, thus the model is: -Much less prone to overfitting; -Much smaller to store and often faster to train and run; • Every filter is trained on every k × k patch of every input image g(·, ·), which utilizes the training data well; • The architecture and learned weights of the convolutional layer do not depend on the size of the input image, making them easier to reuse; • Convolutions give translation invariance: the features are detected the same way, no matter where they are in the image.
The last feature is very important for good generalization when objects to be detected are randomly framed in the images, which is common.
However, complete translation invariance might not always be optimal even for images as not all objects or features are equally likely to appear in every part of them. This is especially evident when the images have a constant static framing, like passport or mugshot photos of centered faces (even more so if they are specifically aligned [3]), or frames from a stationary security or road camera. Probabilistically this is also true for images in general: for example, eyes are likely to appear above a mouth and a nose, or sky is more likely to appear in the upper parts of the image than the lower. Models with an appropriate bias could result in a better recognition or a more frugal implementation.
There are more applications where the convolutional neurons would benefit from "knowing where they are" on the lattice / image, as pointed out in [4].
Complete translation invariance is also usually sub-optimal in geospatial CNN applications, like meteorological forecasting, where the locations of the lattice points are fixed. While the same laws of thermodynamics apply in every location, each location typically also has its own unique features like altitude, terrain, land use, large objects, sun absorption, heat capacity of the ground, heat sources, etc. that affect the dynamics. This becomes even more relevant, when the covered area increases, and so do the differences and variety of the locations, including local climate factors.
The complete location agnosticism in CNNs can be remedied in several ways by supplying different additional static location-dependent features: 2. A combination of local random static inputs, that could potentially as well allow to uniquely "identify" each location; 3. The above mentioned real world relevant unique local features if they are explicitly available (typically not all of them are).
Options 1 and 2 only give a chance for the convolutional neurons to "orient themselves" on the lattice, but not much more. Option 3 provides some local features relevant to the task and should probably be used in any case whenever they are available. The convolutional neurons have to learn how to incorporate all this information in a meaningful way. We could, however (or in addition), allow the model to learn the location-based differences more directly by introducing additional: 1. Learnable local inputs/latent variables; 2. Learnable local transformations of the inputs.
These enhancements allow the model to learn some useful features or dynamics of each location and store them locally. They allow us to have "Localized CNNs": a model that learns to treat the different locations/pixels in the input similarly, but not identically. This is the main theoretical (or methodological) contribution of this article, that is described in detail in Section 4.
There may be other ways of producing localized CNNs in addition to the ones described above. We do not aim to eliminate translation invariance and other mentioned benefits of CNNs completely, which could easily be done by reverting back to regular MLPs, but to strike a balance by making translation invariance non-absolute and retaining the other useful features of CNNs to the extent possible.

Related Work
In this section we review related previous work. It is split into two parts. In Section 3.1 we provide the formalism and a systematic review of state-of-the-art architectures that combine spatial and temporal aspects of the predictions in this domain. The spatial components of the architectures involve convolutional layers where our localizations are applicable. In Section 3.2 we review the few past efforts of localizing CNNs that are most relevant to our approaches and in Section 3.3 mention several previous cases on input learning in deep networks.

Geo-temporal prediction
In this research we tackle wind speed prediction at many stations, given their history. For instance, these stations could be sensor data collected from wind turbines, where the forecasting at each site is especially important [5].
We want to forecast wind speeds at a given regular rectangular spatial grid with dimensions W × H, where W and H are the width and the height (or rather length) of the grid, respectively. We assume that the observations are made with a uniform sampling rate in time, and there are T time steps in total. This problem is referred as a spatio-temporal forecasting on a regular grid [1]. Here we only consider the wind speed, so the observations at the time instance t can be expressed as a matrix X t ∈ R W ×H , t ≤ T, t ∈ N. At the time step t we are interested in predicting X t+h ∈ R W ×H , where h is the forecasting horizon, given an l-window of observations from the past. This can be expressed as where p(·|·) denotes the probability of a particular state, given the history. Multi-step forecasting could be defined in a similar way, as seen in [6]. From an application perspective, there is a vast body of research focused on spatio-temporal problems. For instance, spatio-temporal relations are modeled in traffic forecasting [7], [8], precipitation nowcasting [6], air-quality prediction [9] and other fields.
Many of the applications have been inspired by the progress in video prediction and classification, since raster spatio-temporal data can be viewed as a video with an arbitrary number of color channels (in our case a single one). For instance, one of the first large-scale applications of CNNs in video classification was in [10], while the combination of CNNs and Recurrent Neural Networks (RNNs) has been introduced in [11].
In most of the contemporary research for wind characteristics forecasting, the spatio-temporal problem is usually solved by a connected spatial and temporal components of the architecture. A time window of the input frames are usually fed to the spatial component, and the representations produced by it are then fed to the temporal component. This can be interpreted as a spatial encoder and a temporal decoder, especially when several prediction horizons are simultaneously produced.
One way to implement this is to apply an identical spatial model to each frame and then feed a time window of the resulting representation to a temporal model, that combines these features in a meaningful way over time and produces the prediction matrix: where f spatial (·) is the model extracting spatial features, usually the standard CNN, and f temporal (·) is the temporal model with spatial features as input, typically an RNN model or a simpler memory-less feedforward neural network, where the representations from the spatial encoder are concatenated. Here the spatial encoding is not time-specific, which is probably adequate, considering that we use a sliding temporal window in input and a capable temporal model above this layer. This approach has been used in a model named Predictive Deep Convolutional Neural Network (PDCNN) [12] where a CNN followed by an MLP was applied to forecast wind speeds in a farm of 100 wind turbines aligned in a rectangular grid and demonstrated superior results to classical machine learning techniques. Shortly after, the researchers introduced a follow-up model consisting of a CNN followed by a Long Short-Term Memory (LSTM) [13] RNN, called Predictive Spatio-Temporal Network (PSTN) [14], demonstrating increased performance compared to both the PDCNN and the LSTM alone.
Alternatively, time can be taken as an additional dimension of the grid and a spatial encoding done of the entire time window:X where [·, ·, . . . , ·] denotes concatenation, f spatial (·) is a model that treats the concatenated over time inputs as spatial but having l times more input channels, typically a classical CNN, and f predictor (·) is a model interpreting the spatial representation, usually a feed-forward neural network, to make the predictions. This approach empowers the spatial component to combine inputs along time. Note, that f spatial here does no convolution over the time dimension, but instead all its filters see all the time steps. Thus it produces and passes to f predictor a representation with no temporal dimension. Conversely, if every filter in f spatial would only see a single time step, (4) would fall back to (3). Not all the architectures used for this type of task have both the spatial and the temporal component. For example, the f predictor in (4), which is a somewhat degenerate case of f temporal in (3), can be dropped resulting in a fully convolutional model The authors of [15] have applied the DenseNet [16] architecture as f spatial to predict wind power. The data were non-spatially-ordered, but to use it with CNNs, they embedded the wind turbines into a rectangular grid based on their relative positions. The authors examine two model variations: FC-CNN, which is a model of type (3) with f predictor being a fully-connected layer, and E2E, which remains fully convolutional as in (5).
On the other hand, the problem can be treated as a purely temporal one, ignoring the spatial nature of the dataX by using, e.g., a purely RNN or MLP model. Combining the spacial and temporal components in an architecture is not a trivial task. In the approaches discussed so far, even though the spatial component (CNN) takes into account the spatial information, the usual choices of the temporal component (MLP or RNN) do not. This seems somewhat sub-optimal, especially here, since the output of the whole model is still spatial, just as its inputs.
This problem has been addressed first by introducing Convolutional LSTM (ConvLSTM) in [6], whereas not only the CNN but also the LSTM support spatial order preserving operations. To achieve this, the temporal portion of the LSTM's state is a 2D matrix whose entries correspond to the topology of the spatial input. This architecture has been applied successfully to wind power forecasting [17].
A slightly different approach has been taken in [18] where the authors had non-uniformly embedded spatial data and instead of explicit embedding it onto a grid and making use of the CNN to model spatial relations, used a CNN to extract the feature relations between multiple weather factors.

Previous CNN localizations
There are several recent contributions that propose appending the coordinates to the spacial representations to aid the CNN localization. A CNN with two additional input channels indicating the x and y coordinates of each input in the 2D plane, called CoordConv, was proposed in [4]. The authors demonstrated that this substantially improved the performance compared to standard CNN when tasks involve a need for localization.
Similarly, adding location coordinates to the CNN representations, called a semi-convolutional operator, was shown to help separate different instances of the same type of object in an image pixel-wise segmentation task in [19]. A spatial broadcast decoder has been successfully proposed as the decoder part of a deep variational autoencoder, where it tiles the encoding with the appended x and y coordinates on a 2D lattice and uses a CNN instead of the usual deconvolutions in [20].
Similar CNN localization ideas to our learnable local tansformations motivated in Section 2 and detailed in Section 4 have been successfully applied for face recognition in [3]. The authors align the faces before the recognition and use locally connected neural layers [21] after the CNN layers in an architecture to aid the recognition of specific parts of the faces. Contrary to the geo-temporal applications, the faces are not pixel-perfectly aligned, thus localizations only in the higher levels of CNN make sense. But in that approach the convolutions and localizations do not mix, the latter follows the former.

Deep input learning
Input or latent variable learning in deep networks is not that uncommon and is often related to transfer learning.
There are some examples of specialization with deep MLP neural networks for language recognition by learning "conditional codes" as an additional input unique to every speaker in a mixed MLP / Hidden Markov Model [22], mixed MLP / CNN model [23], or as direct inputs to all the layers of a MLP [24]. The authors note, however, that this method does not work well with adapting CNNs.
Input optimizations of CNN have also been explored in neural style transfer [25] or deep dream 1 applications, but this is done for a different purpose and it is not an additional input.
Meanwhile, learning additional / intermediate inputs to (a part of) a model that are transformations of the external inputs from data is abundant in deep leaning, and in fact is the fundamental principal of it.

Proposed Methods
We propose a few simple building blocks compatible with any kind of CNN architecture to help it better capture location-specific trends in the data.

Learnable Inputs
As mentioned before, CNNs by default are not able to identify the absolute position of the kernel. One way to amend that it to introduce learnable inputs (LIs) of the same spatial dimension that can be concatenated to the original inputs going to any kind of CNN. These static LIs are free parameters that themselves can be trained by backpropagation together with the rest of the network weights and do not require any kind of prior knowledge of the task.  More precisely, let the input array have a spatial dimension of w × h × n, where w is the width, h is the height and c is the number of features of the array. We add c LIs of the dimension of w × h each. Next we concatenate the input array with the learned map and yield a resulting array shaped w × h × (c + n). This is depicted in Figure 1.
In the training process LIs evolve simultaneously with the weights of the model and likewise are fixed after the learning phase, and the fixed learned maps are then used in other phases. The benefit of this is that now LIs can correspond to some local information extracted from training data about every location of the input and so more powerful models can be trained. For instance, [4] have shown that adding coordinates as an additional input helps the CNN to generalize at certain tasks where localization is needed, and if this is not needed the kernels responsible for interpreting the coordinate information learns to ignore them. We note that these LIs can learn such a representation that would help the CNN to navigate better.
The c input-sized LIs add extra w · h · c learnable parameters to the model, plus the additional corresponding input weights of the next layer.
One way to implement LIs in modern deep learning frameworks is to add a constant unitary input and connect it with learnable local weights to get the LIs.

Local Weights
LIs might be not enough to enable CNNs to treat inputs at different locations differently. For this we introduce Local Weights (LWs) as a locally-connected layer of weights [21] that are not shared like in convolution.
One particular interpretation of LWs could be as importance maps that weight locations of the input that are always more important to the task higher and others lower. This could be useful in cases where not all the regions of the input image are statistically equally important, as discussed previously. In an extreme case of this LWs could become input masks.
LWs are learned in the training phase jointly with the rest of the network weights in the same way as LIs. In our proposed layer we concatenate the original inputs to the locally-weighted ones before passing them to any kind of further CNN layers ( Figure 2), but this is not necessary.
More precisely, we define a squashing ⊗ (1,1) operation for incoming input X ∈ R w×h×n and local weights M ∈ R w×h×n×d , which produces locally-weighted input I ∈ R w×h×d and more generally ⊗ (A,B) is defined on input weights matrix (tensor) M ∈ R w×h×n×A×B×d as where i, j, o are row, column, and depth of the resulting matrix I respectively, and f (·) is the element-wise activation function. In our experiments we used the identity function f (x) = x.  We make a distinction between local weights LWs and learnable inputs LIs. Since LIs are static, they cannot directly locally control the influence strength of the external inputs, and neither, more generally, the best local weighted combination of the external inputs, which LWs can. On the other hand LIs convey special features of the location irrespective of the current inputs. Alternatively, LIs can be thought of as bias weights in LWs.
d LWs with 1×1×n receptive fields (7) add w · h · d · n learnable parameters to the model, which is n times more compared to d LIs.
In addition to the described LWs connected to 1×1×n receptive fields of the input (7) and the the more general, bigger A × B × n ones (8), they could be even more direct elementwise weighting of the inputs with the receptive fields of 1 × 1 × 1. We explore such variation by defining a operation on weight matrix (tensor) M ∈ R w×h×n as a standard element-wise multiplication

Combined Approach
Finally, we investigate merging both the described CNN localizations LIs and LWs into a combined block illustrated in Figure 3. This is in essence the LW block followed by the LI block. We can vary the number of LIs and LWs depending on the situation.

Implementation by a Locally-Connected Layer
Input Output  Figure 4: Implementation with a locallyconnected layer. Input goes through a locally connected layer with a (a × b) kernel and produces local features that are concatenated with the input.
The combination of LIs and LWs can also be compactly implemented by a locally connected layer with biases, as illustrated in Figure 4. This layer is similar to the convolutional one, except that the weights here are not shared.
The LIs here are somewhat enabled by the local bias weights. If the bias weights are not readily available, they can be implemented by appending constant unitary input channels to the input, before feeding them to the locally connected layer.
This approach, while potentially can lead to a cleaner code in some current deep learning frameworks, somewhat sacrifices the fine control of the LIs and LWs that are meshed together here, and potentially have more (redundant) trainable weights for a similar effect.

Baselines
In this section we introduce baseline models and their configurations that we use for comparison, together with our concrete localized CNN architectures. The objective is both to find the best model for a given dataset and also to show the compatibility and effectiveness of our proposed layers with these models. Note that we test these baselines only on real-world datasets.
The architectures that were explored are detailed in Table 1. We use a special notation for this. Functions denote layers here, and arrows denote the tensors passed between them with indication of their dimensions above. For more complex architectures, we define the parts of the model before defining the final model using them.
Here [·, ·, ..., ·] denotes concatenation; C(w × h × n) denotes a 2D convolutional layer having n kernels, each of size w × h; F (·) is a reshaping function, the output shape can be inferred from the outgoing arrow; σ(·) denotes a sigmoid activation function; F C(x) denotes a fully-connected layer with x neurons; P ool max (w×h) denotes a 2D max-pooling having pool-size of w×h; similarly, P ool avg (w×h) denotes average pooling; ReLU (·) denotes a rectified linear unit activation function; BN (·) denotes batch-normalization [26] followed by a ReLU activation; LST M (x) denotes an LSTM layer with x neurons; C T (w × h × n) denotes a 2D-transposed convolutional layer with n filters, each of size w × h; ConvLST M (x) denotes a ConvLSTM layer [6] with x neurons.
The Persistent model variation introduced in Table 1 concatinates output of each layer with the initial LI and/or LW combination. This enables the model to localize in all of the convolutional layers, rather than only in the first one. In this example same identical LIs/LWs elements are incorporated in subsequent layers. This is not necessary and uniqual elements could be used.

Benchmarks
We begin the exploration of learnable static components by first applying LIs to a synthetic dataset. Then we compare the effectiveness of these blocks in real-world settings.

Proof of Concept: a Bouncing Ball Task
We firstly test our ideas in a controlled environment on the classical synthetic bouncing balls dataset 2 [30]. This dataset is interesting, because it has both global and local dynamics. While the ball movement "physics" is the same in all the places of the frame, it also changes at the boundaries which the balls bounce off. Thus a model that can capture both the global and local dynamics could be beneficial in this case as opposed to the classical CNN which only models dynamics globally.
The data consists of 30 × 30 monochrome image (frame) sequences of white ball on a black background. The balls have a 7.2 pixel radius. They start at a random location inside of the frame moving in a random direction at a constant velocity. We generate 1 600 samples for training, 500 samples for the first testing set and 500 samples for the second one. Each sample consists of 25 consecutive ball movement frames as input and the task is to predict the frame 5 time steps after the last input frame as output. Figure 5 illustrates the dataset. Figure 5: A few samples from the second testing set of the bouncing balls task. The sequence of frames is superimposed here into a single one. The red circles indicate intput to the CNN (only every forth frame is represented, more transparent red circles correspond to earlier frames in the sequence) and the orange one indicates the expected ground truth prediction. Note that every orange circle here is a result of hitting a wall or a corner.
To have a better insight into the results, we introduce two types of testing data. In the first one all the ball trajectories are represented and in the second one only the predicted trajectories of balls bouncing off the walls and corners are selected, like in Figure 5.
We compare the following models: • CNN: a two-layer CNN with filter sizes of (20 × 20) and (10 × 10) respectively. There are 22 filters in the first layer and 10 filters in the second. This network has 242 043 learnable parameters.
• LI CNN: a similar setup to CNN, except that it has 20 filters in the first layer but two (30 × 30) learnable inputs are added to learn location-based features. This network has 237 841 learnable parameters.
• CoordConv: a similar setup to LI CNN, except that it has 21 filters in the first layer and additional x and y coordinate inputs instead of the learnable ones [4]. This network has 247 842 learnable parameters.
• RandomConv: same setup to CoordConv, except that the two inputs were randomly initialized in [−0.5, 0.5] range. Number of parameters remained the same as in the CoordConv setup.
We designed the networks in such a way, that the comparison would be fair with respect to the amount of learnable parameters. As a result, our proposed architecture had the smallest number of learnable parameters.  Each model was trained with 5 random initializations and each training session consisted of 100 epochs. We plot the averaged validation and testing errors at each epoch in Figure 6. We can see that the CoordConv, the RandomConv, and the LI CNN consistently outperform CNN, which can be attributed to the models' ability to localize. Both CoordConv and RandomConv perform very similarly, as they allow the model to localize, but the local parameters are not learned to be task-specific. LI CNN always outperforms them both even though they have more trainable parameters. We also see that the bounced-off trajectories are harder to predict and the models capable of localization have a bigger relative advantage here compared to CNN, which confirms our hypothesis.

Case Study: Wind Integration National Dataset
We next evaluate the proposed methods on a real-world dataset from a wind farm in Indiana, United States. We use Wind Integration National Dataset (WIND) [31] toolkit by The National Renewable Energy Laboratory (NREL), which contains the weather characteristics data from 126 000 stations covering years 2007-2013 with 5 min temporal sampling frequency. We select a (10 × 10) sized rotated rectangular grid of wind turbines, where the left bottom point has GPS coordinates of (85.215W, 40.4093N ) and the top right point has the coordinates of (84.9684W, 40.2212N ) as depicted in Figure 7. We used only wind speed data of the first three months of 2012, which included 25 920 dataframes. We chose this dataset, location and time interval to make our results comparable to [12]. We confirm the match of the dataset both visually (see Figure 8) and by the characteristics reported in [12]: maximum wind speed at 27.228 m/s and minimum at 0.048 m/s. We use the same setup of data preparations as in [12]: 60 % of the data was used for training, 20 % for validation and the rest 20 % for testing. Although, not mentioned in [12], we also normalize the data to a [0; 1] interval and de-normalize the prediction results before calculating the error. We carry out the experiments for forecast horizons of 5, 10, 15, 20, 30, and 60 minutes. We can observe in Figure 8 a spatial pattern shifting through the grid, indicating that for shortterm predictions simpler temporal models might suffice, given that the spatial model is powerful enough. But for longer forecasting horizon the spatial model might not be as relevant. This is in part due to the fact that our grid covers a relatively small area.
We used both classical neural network methods, special architectures for such spatio-temporal prediction proposed in recent literature, and our variations of localiced CNNs, as discussed in Sections 4 and 5. The results are presented in Table 2. Just as expected, we can see that for short-term forecast  the spatial modeling is more important and temporal modeling becomes more important when forecast horizon increases. For 1-step (5-minute) prediction, simple models without a dedicated temporal component are sufficient and in fact give the best results, while for longer prediction horizons, the LSTM often performed best, which has a strong temporal component and no spatial one at all. For shorter forecast horizons, where spatial modeling is relevant, virtually every model which we augment with our learnable localized features gave a better performance compared to the vanilla versions.
Interestingly, most of the models that performed the best on the validation set, did not perform as well on the testing set. This is probably due to the fact that the models were trained on only one season of the year, Winter, validated on the beginning of March, and tested on the rest of the month, sticking to the scheme in [12]. Dominating wind directions in Indiana depend on the time of the year 3 .

Case Study: Meteorological Terminal Aviation Routine Dataset
We believe that Localized CNNs could additionally be helpful when dealing with "non-orderly" embedded grid data. In this experiment, we use the dataset from the Meteorological Terminal Aviation Routine (METAR) weather reports of 57 stations in the East Coast including Massachusetts, Connecticut, New York, and New Hampshire (see [32] for details). This dataset consists of 6 300 data points with hourly temporal sampling that are not embedded in a regular grid.
It is assumed that the data come in every 6 hours, and we need to make a prediction for each hour until the new data come. We use 5 700 samples for training, 300 for validation, and 300 for testing to make experiments compatible with [32]. We also used the same window size of l = 12. It is important to note that in [32] a different LSTM model was trained for every time step prediction (6 models in total); we instead make the 6 time step predictions at once as 6 different outputs of the same model to save processing time, although sliding window approach could also be used.
Only temporal data was available and no further information about each wind turbine location was included. To see if we could benefit from a spatial arrangement, we placed the locations on a regular rectangular grid. We first embedded these randomly-ordered 57 wind turbine stations into a 8 × 8 grid and padded the last row's final 7 entries with zeros (they were not included in calculating errors). Since the positions of the stations were unknown to us, the embedding technique used in, e.g., [15] cannot be applied.
Placing locations randomly on a grid might not be very beneficial since CNNs make use of local correlations in the data. To get a better embedding we propose to order the grid based on mutual information of the location signals.

Mutual Information Based Grid Embedding
We interpret every temporal data of every turbine i as a random variable X i and want to embed every turbine in a grid in such a way that the overall sum of correlation estimates in every neighborhood would be the highest. To determine the strength of similarity between the turbines, we rely on the mutual information [33] (MI) measurement, which is a popular technique quantifying the amount of information that two variables share together. For instance, high level of MI between two variables means that the knowledge about one of the variables implies low uncertainty about the other variable. MI is defined as: where p (X,Y ) (x, y) denotes the joint probability function of X and Y , and p X (x), p Y (y) denote marginal probability function of both X and Y . Note that variable independence implies that I(X, Y ) = 0. Since neither marginal nor probability mass functions are known, we approximate these functions by creating discrete temporal time series of each wind turbine and calculate 1D histograms for both p X (x) and p Y (y) and 2D histogram for p (X,Y ) . Then we construct MI weights matrix M I(i, j), which stores the MI between i and j wind turbines. After that, we use an evolutionary algorithm to embed the 57 values into a 8 × 8 grid. Concretely, 300 individuals were in the population and each member of the population represented some permutation of the 57 wind turbines and 7 dummy elements. The probability of mutation was set to 0.2 and probability of crossover between two individuals was set to 0.3. The fitness F (a) of an individual a was evaluated by summing MI between all the direct neighbors (including diagonal) on the 8 × 8 grid of that permutation. We ran this algorithm with an objective to maximize F (a) for 50 000 epochs. This experiment was carried out using DEAP library [34]. The original direct ordering of the grid had F (a direct ) = 21.99 and the optimized ordering a optimized found by the evolutionary algorithm had F (a optimized ) = 35.96, showing a significant increase in overall elements similarity among neighbors. Graphical representation of every location's contribution to F (·) is presented in Figure 9.
Even though such a grid embedding might help to handle non-grid-location data with CNNs in a reasonable way, it is still a rough approximation.

Results
To validate the proposed models, we construct every variation in such a way, that each model would have a roughly equal number of learnable parameters. We use all of the baselines described in Section 5 and both the optimized and the direct grid embeddings of the data. The results are presented in Table 3.
We can see in Table 3 that the Persistent LI + LW -I CNN model on optimized embedding gave the best results by MAE metric, while performing equally good as Persistent LI + LW222 -I model on original embedding by RMSE metric, and both showed the best overall results. Also, the optimized embedding did decrease the testing error with every model that takes into account the spatial information and does not discard the input by RMSE error, while every model benefited from optimized embedding by MAE error metric. Models that discarded the input did not show obvious benefit from optimized embedding.
It is interesting that with the direct sub-optimal embedding the Persistent LI + LW222 -I CNN model which disregards the original input and is in principle capable of "swaping" adjacent inputs, performed best. This indicates, that the model is learning a better input representation than the provided one, and including the latter (as in Persistent LI + LW222 CNN) gets even detrimental. This is not the case with the optimized embedding.
Every architecture that did not discard the spatial relations of the data in its temporal component (except for DL-STF, which had a different model for every prediction horizon ), did perform significantly better than the other baselines marked with "*", even with the non-optimized embedding.
Comparing the performance of CNN, CoordConv, and our localized CNN models, we can see that in this case when the embedding is not perfect, learning the localized features is much more important, but just knowing the location (as in CoordConv) does not do much good, if any.
We see that, generally, the more location-specific features are learned on top of the CNN model (total number of learnable parameters was kept roughly the same), the better the performance gets on both embeddings. We could argue that the locally-learned features partially compensate the defects of the embedding. Table 3: Results on METAR data embedded on a regular grid, the direct and optimized embeddings. Models that disregard spatial relations are reported "-" results in the second, as they would be identical. Models that have no natural spatially-ordered output are marked with "*". ± denotes standard deviation.  Lastly, we test our localized CNN methods on a more spatially widely distributed wind speed dataset. For this we use wind speeds collected at 10 m above the surface level from "Climate data for the European energy sector from 1979 to 2016 derived from ERA-Interim" dataset 4 . This dataset covers most of Europe with a resolution of (0.5 • × 0.5 • ) with a temporal sampling frequency of 6 hours. We have selected a 10 × 10 region with the bounding rectangle of [52.75N; 57.25N], [6.75W; 2.25W], which is illustrated in Figure 10. Since the distances between the sites are much larger, so are the differences among the sites. We expect that while wind patterns have much in common at every location, in this scenario learning location-specific features in addition to convolutional neural networks becomes even more pertinent.

Model
More specifically, the dataset consists of 55 520 data frames, spanning a 37 year time window. We use 70 % of the dataset for training, 10 % for validation and the final 20 % for testing. The optimal window size was determined to be l = 8 by cross-validation. We test the models with the prediction horizon of h = {1, 2}, which corresponds to 6 and 12 hour predictions respectively. The first 12 data frames are illustrated in Figure 11. We can immediately observe that the wind patterns are quite different above the sea and the land. This is confirmed by plotting global means and standard deviations of wind speeds across the training dataset in all locations, Figure 12. The winds tend to be the strongest and most varied at the most open area of the grid at the Atlantic Ocean (the top left corner).
Models used with this data were the same as described in the previous section. The results are reported in Table 4. We also include an bigger version of PSTN [12] "PSTN bigger" where we increased the number of kernels in the first layer to 30 and ensured that the total number of learnable parameters is greater than in LI + LW PSTN.
We can see in Table 4 that in general, the models that have a recurrent net LSTM component (LSTM, PSTN) perform best in each group. They are followed by the models that include both convolutional and dense layers (PDCNN, E2E, FC-CNN). Fully convolutional models (CNN variations, CoordConv) perform worst. This indicates that memory and/or longer spatial connections are required. This seems logical since this data a sparse both in space and in time.
Finally, we can once again see that all the models for which the convolutions were endowed with our learnable localization (LI, LW) performed better than the non-localized counterparts. Consequently, LI + LW PSTN showed the best overall results. Simply knowing the location (CoordConv vs. CNN) did not help in this case either.