Geographically Weighted Machine Learning and Downscaling for High-Resolution Spatiotemporal Estimations of Wind Speed

: High-resolution spatiotemporal wind speed mapping is useful for atmospheric environmental monitoring, air quality evaluation and wind power siting. Although modern reanalysis techniques can obtain reliable interpolated surfaces of meteorology at a high temporal resolution, their spatial resolutions are coarse. Local variability of wind speed is di ﬃ cult to capture due to its volatility. Here, a two-stage approach was developed for robust spatiotemporal estimations of wind speed at a high resolution. The proposed approach consists of geographically weighted ensemble machine learning (Stage 1) and downscaling based on meteorological reanalysis data (Stage 2). The geographically weighted machine learning method is based on three base learners, which are an autoencoder-based deep residual network, XGBoost and random forest, and it incorporates spatial autocorrelation and heterogeneity to boost the ensemble predictions. With reanalysis data, downscaling was introduced in Stage 2 to reduce bias and spatial abrupt (non-natural) variation in the predictions inferred from Stage 1. The autoencoder-based residual network was used in Stage 2 to adjust the di ﬀ erence between the averages of the ﬁne-resolution predicted values and the coarse-resolution reanalysis data to ensure consistency. Using mainland China as a case study, the geographically weighted regression (GWR) ensemble predictions were shown to perform better than individual learners’ predictions (with an approximately 12–16% improvement in R 2 and a decrease of 0.14–0.19 m / s in root mean square error). Downscaling further improved the predictions by reducing inconsistency and obtaining better spatial variation (smoothing). The proposed approach can also be applied for the high-resolution spatiotemporal estimation of other meteorological parameters or surface variables involving remote sensing images (i.e. reliable coarsely resolved data), ground monitoring data and other relevant factors.


Introduction
High-resolution spatiotemporal mapping of surface variables can present specific variation at fine spatial and temporal scales, which provides good knowledge of the spatiotemporal distribution of these variables. For wind speed, this technique is particularly useful for atmospheric environmental monitoring [1][2][3], air quality evaluation [4][5][6][7], wind power siting [8], and so on. Recently, meteorological reanalysis [9][10][11] has been employed in numeric weather prediction models with historic weather observations from multiple sources, including satellites and surface stations, to generate images with more reliable estimates at high temporal resolutions, which are a relatively new source of Theoretically, models that have no correlations or weak correlations can better improve their ensemble predictions, and strong learners can prevent ensemble predictions below their individual performance levels. To ensure high-resolution spatiotemporal mapping of meteorological factors with reasonable spatial variation at a fine local scale, the proposed approach used relevant covariates, including coordinates, time indices, elevations and reanalysis data (e.g., wind speed and planetary boundary layer height (PBLH)) from multiple sources. Coarse spatial resolution meteorological reanalysis data were used as a priori knowledge to capture spatiotemporal variations in the target variables at a regional scale [10] and were used in both individual learners and downscaling in the proposed approach. Resampling was also used to adjust and align the images at different origins and spatial resolutions.
In the proposed approach, geographically weighted regression (GWR) was conducted over the three learners' predictions to detect the spatial variability of their performances. GWR can take full advantage of the abilities of the three modern regression models by considering spatial heterogeneity for robust prediction. A case study of mainland China was conducted for wind speeds that are hard to map at a high resolution due to their implicit volatility. In the wind speed case study, the approach proposed in this paper was demonstrated to be applicable for the high-resolution spatiotemporal mapping of meteorological parameters and other surface variables that involve remote sensing (including reliable coarse-resolution data), ground monitoring data and other factors.

Study Region
The study region ( Figure 1) is mainland China with geographical coverage of 73 • 27 to 135 • 06 east longitude and 18 • 11 to 53 • 33 north latitude and an area of 9,457,770 square kilometers. In mainland China, the climate varies from region to region due to its massive geographical coverage and diversity of landforms and elevations. In the northeast, it is hot and dry in the summer and freezing cold in the winter. In the southwest, there is plenty of rainfall in the semitropical summers and cool winters. The major influential factors include the geographic latitude, solar radiation, distribution of land and sea, ocean currents, topology and atmospheric circulation. Thus, it is challenging to map the high-resolution spatiotemporal surfaces of meteorological factors such as wind speed for such a large region with considerable heterogeneity between regions.
Remote Sens. 2019, 11, x FOR PEER REVIEW 3 of 26 Theoretically, models that have no correlations or weak correlations can better improve their ensemble predictions, and strong learners can prevent ensemble predictions below their individual performance levels. To ensure high-resolution spatiotemporal mapping of meteorological factors with reasonable spatial variation at a fine local scale, the proposed approach used relevant covariates, including coordinates, time indices, elevations and reanalysis data (e.g., wind speed and planetary boundary layer height (PBLH)) from multiple sources. Coarse spatial resolution meteorological reanalysis data were used as a priori knowledge to capture spatiotemporal variations in the target variables at a regional scale [10] and were used in both individual learners and downscaling in the proposed approach. Resampling was also used to adjust and align the images at different origins and spatial resolutions.
In the proposed approach, geographically weighted regression (GWR) was conducted over the three learners' predictions to detect the spatial variability of their performances. GWR can take full advantage of the abilities of the three modern regression models by considering spatial heterogeneity for robust prediction. A case study of mainland China was conducted for wind speeds that are hard to map at a high resolution due to their implicit volatility. In the wind speed case study, the approach proposed in this paper was demonstrated to be applicable for the high-resolution spatiotemporal mapping of meteorological parameters and other surface variables that involve remote sensing (including reliable coarse-resolution data), ground monitoring data and other factors.

Study Region
The study region ( Figure 1) is mainland China with geographical coverage of 73°27′ to 135°06′ east longitude and 18°11′ to 53°33′ north latitude and an area of 9,457,770 square kilometers. In mainland China, the climate varies from region to region due to its massive geographical coverage and diversity of landforms and elevations. In the northeast, it is hot and dry in the summer and freezing cold in the winter. In the southwest, there is plenty of rainfall in the semitropical summers and cool winters. The major influential factors include the geographic latitude, solar radiation, distribution of land and sea, ocean currents, topology and atmospheric circulation. Thus, it is challenging to map the high-resolution spatiotemporal surfaces of meteorological factors such as wind speed for such a large region with considerable heterogeneity between regions.

Measurement Data
The ground monitoring station measurement data originated from the daily ground observation datasets of individual years for mainland China from the China Meteorological Data Service Center (http://data.cma.cn). The dataset was collected based on 824 national base meteorological monitoring stations. The 2015 daily wind speed data (unit: meter/second, abbreviated as m/s) measured at a height of 10-12 m above the ground were collected. Quality assurance was conducted to remove noisy samples. The final number of monitoring stations was 770. The target variable to be predicted was wind speed. The output mapping surfaces were at a spatial resolution of 1 km (projected coordinate system: Beijing 1954 with Krassowsky 1940, European Petroleum Survey Group (EPSG): 4214; https://epsg.io/4214) and a temporal resolution of 1 day. Figure 1 also shows the spatial distribution of the wind speed monitoring stations.

Covariates
According to influential factors and data accessibility, the following covariates were selected.
(1) Coordinates Latitude and longitude were used to capture differences in locations and the related geographical environment for meteorological factors. The quadratic transformations of the coordinates and their products (to reflect the interaction of latitude and longitude) were derived to reflect the diversity and complexity of the landforms. While many machine learning algorithms, such as neural networks, XGBoost and random forest, cannot directly model spatial autocorrelation, the coordinates are used as proxies in these algorithms to partially account for spatial autocorrelation.
(2) Elevation The diversity of elevation is also partially responsible for the considerable differences in the meteorology in mainland China. This study used elevation data with a 500m spatial resolution from the Shuttle Radar Topology Mission (SRTM; https://www2.jpl.nasa.gov/srtm/). SRTM was published in 2003 and covers more than 80% of the Earth's land surface.
(3) Reanalysis data As mentioned, meteorological reanalysis data were used to provide reliable coarse spatial resolution estimates. The reanalysis data were from the newest Goddard Earth Observing System-Forward Processing (GEOS-FP) dataset, which is based on the data assimilation system (DAS). GEOS-FP covers all of mainland China at a spatial resolution of 0.25 • (latitude) × 0.3125 • (longitude) and a temporal resolution of 3 h (ftp://rain.ucis.dal.ca/ctm/GEOS_0.25x0.3125_CH.d/GEOS_FP). The corresponding coarse-resolution data of wind speed was used. Furthermore, the PBLH was also extracted, since it is an important factor for the surface wind gradient and is closely related to wind speed [32,33].
Since the reanalysis data were at a coarse resolution, projection transformation and resampling were conducted to align the data in terms of origin and resolution for use as the inputs to individual learners.
(4) Day of the year The day of the year was used to capture the temporal variation in the wind speed to be estimated. (5) Regional separation Given the diverse differential geographical, atmospheric and land-use settings across mainland China, a map ( Figure 1) of the six regions of mainland China (northeast, north central, northwest, southwest, south center, and southeast) from the Resources and Environmental Data Cloud Platform (http://www.resdc.cn/) was employed to identify the regional qualitative factors in the models and account for the spatial heterogeneity of mainland China at the regional level.

Methods
The systematic framework was based on two stages ( Figure 2): (1) training and (2) inference and downscaling. In Stage 1, for the initial high-resolution prediction, three machine learning models, which were an autoencoder-based deep residual network, XGBoost and random forest, were trained, and GWR was used to fuse the outcomes from the three learners for ensemble predictions (Figure 3). In Stage 2, for high-resolution mapping of the new dataset, a deep residual network was iteratively used in downscaling to match the wind speed at a coarse resolution and the average wind speed at a fine resolution, which were initially inferred from Stage 1 to reduce the bias and obtain better spatial variation (smoothing) of the predictions.

Methods
The systematic framework was based on two stages ( Figure 2): (1) training and (2) inference and downscaling. In Stage 1, for the initial high-resolution prediction, three machine learning models, which were an autoencoder-based deep residual network, XGBoost and random forest, were trained, and GWR was used to fuse the outcomes from the three learners for ensemble predictions ( Figure 3). In Stage 2, for high-resolution mapping of the new dataset, a deep residual network was iteratively used in downscaling to match the wind speed at a coarse resolution and the average wind speed at a fine resolution, which were initially inferred from Stage 1 to reduce the bias and obtain better spatial variation (smoothing) of the predictions.

Stage 1: Geographically Weighted Learning
Stage 1 aims to train three representative base models to improve the ensemble estimates using GWR, which presents reliable fine-resolution spatiotemporal contrasts or variability.

Methods
The systematic framework was based on two stages ( Figure 2): (1) training and (2) inference and downscaling. In Stage 1, for the initial high-resolution prediction, three machine learning models, which were an autoencoder-based deep residual network, XGBoost and random forest, were trained, and GWR was used to fuse the outcomes from the three learners for ensemble predictions ( Figure 3). In Stage 2, for high-resolution mapping of the new dataset, a deep residual network was iteratively used in downscaling to match the wind speed at a coarse resolution and the average wind speed at a fine resolution, which were initially inferred from Stage 1 to reduce the bias and obtain better spatial variation (smoothing) of the predictions.

Stage 1: Geographically Weighted Learning
Stage 1 aims to train three representative base models to improve the ensemble estimates using GWR, which presents reliable fine-resolution spatiotemporal contrasts or variability.

Stage 1: Geographically Weighted Learning
Stage 1 aims to train three representative base models to improve the ensemble estimates using GWR, which presents reliable fine-resolution spatiotemporal contrasts or variability.

Base Learners
In ensemble learning, models with no correlations or weak correlations can theoretically generate better ensemble predictions with less error [23]. Assume m models with errors ε i (i = 1, . . . , m, denoting the model indices) drawn from a zero-mean multivariate normal distribution with variances ε 2 i = v and covariances E ε i ε j = c. Then, the error made by their average prediction is 1 m i ε i . The expected squared error of the ensemble prediction is: where c represents the covariance between the errors of different models. If c is equal to 0, which indicates no correlation between the errors of the models, then the expected squared error of the ensemble averages is 1 m of the error variances, ν. However, if c is equal to ν, indicating a perfect correlation between the models' errors, the expected squared error is equal to the error variances, ν, suggesting no change for the ensemble prediction errors. Therefore, the selection of models that have no correlations or weak correlations is crucial for improving ensemble predictions.
Furthermore, if a base model is robust, indicating small errors, the expected squared error of the ensemble predictions can be reduced to a value that may be lower than ν according to (1). Thus, three typical models (an autoencoder-based deep residual network [29], XGBoost [34] and random forest [35]) were selected. The deep residual network has a completely different structure from the other two models (XGBoost and random forests), which are based on a decision tree. However, the optimization approaches of XGBoost and random forest are different, with the former using gradient boost and the latter using bootstrap aggregating (bagging). Thus, the three models are quite different and robust in practical applications [29,34,36,37]. Other learners, such as AdaBoost or Gaussian process regression, can also be considered. To simplify application and illustrate a geographically weighted machine learning method, these three typical learners were used as the robust base learners in the geographically weighted modeling, as their ensemble predictions have sufficiently competitive performance for this paper's case study.
(1) Autoencoder-based deep residual network In this approach, the autoencoder provides the basic infrastructure for the network so that residual mapping can be implemented by an identity connection from the shallow layers in the encoding component to the deep layers in the decoding component [29]. Residual (shortcut) connections in neural networks have been demonstrated to address vanishing/exploding gradients [38] and accuracy degradation in CNNs [39,40]. Based on similar ideas, residual connections were added into the autoencoder-based deep network to improve learning accuracy and efficiency, as demonstrated in practical applications [29]. The Keras-based packages of this approach for Python (https://pypi.org/ project/resautonet) and R (https://cran.r-project.org/web/packages/resautonet) have been published, and examples can be retrieved online (https://github.com/lspatial/resautonet). Figure 4 shows the network topology for the prediction of wind speed. This network has 9 input nodes to represent 9 covariates (latitude, longitude, squares of latitude and longitude, products of latitude and longitude, elevation, GEOS-FP wind speed, GEOS-FP PBLH, and the day of the year). For the internal autoencoder, the network structure of the encoding component consists of 4 hidden layers (the number of nodes for each layer in sequence: [256, 128, 64, 32]), the middle coding layer has 16 nodes; correspondingly, the decoding component consists of 4 hidden layers and the number of nodes of each layer is the inverse of that of the encoding component. The residual connection is added from shallow to deep layers. The final output is the target variable (high-resolution wind speed, m/s) to be predicted. The following loss function was used for optimization: where y is the target variable (wind speed), x denotes the input covariates, f θ w,b is the mapping function with parameters of W and b, and Ω θ w,b represents the elastic net regularizer [41] that linearly combines the L1 and L2 penalties. Throughout this paper, the autoencoder-based deep residual network was also referred to as a (deep) residual network.
Remote Sens. 2019, 11, x FOR PEER REVIEW 7 of 26 linearly combines the L1 and L2 penalties. Throughout this paper, the autoencoder-based deep residual network was also referred to as a (deep) residual network. (2) XGBoost XGBoost is a scalable end-to-end tree boosting learning system that is widely used to achieve state-of-the-art results in many domains [34]. XGBoost uses a sparsity-aware algorithm and a cache-aware block structure for efficient tree learning.
Assume n examples and d features, = ( , ) , | | = . Additive functions are used to make the final predictions [34]: where K is the number of functions corresponding to each tree (K trees in total), f(x) = wq(x) represents the space of the regression trees (CART), and q represents the structure of each tree that maps an instance to a leaf. Based on gradient tree boosting, XGBoost was trained in an additive manner. Assume the regularized loss function of step k is (2) XGBoost XGBoost is a scalable end-to-end tree boosting learning system that is widely used to achieve state-of-the-art results in many domains [34]. XGBoost uses a sparsity-aware algorithm and a cache-aware block structure for efficient tree learning.
Assume n examples and d features, Additive functions are used to make the final predictions [34]:ŷ where K is the number of functions corresponding to each tree (K trees in total), f (x) = w q (x) represents the space of the regression trees (CART), and q represents the structure of each tree that maps an instance to a leaf.
Based on gradient tree boosting, XGBoost was trained in an additive manner. Assume the regularized loss function of step k is where l is a differentiable differential loss function and Ω is the regularizer. To derive the optimal addition, f k (x i ), the second-order approximation of the Taylor series can be employed to optimize the objective more efficiently than the first-order approximation: where are the first-and second-order gradient derivatives of the loss function in terms of the last prediction (ŷ ). Then, the optimal weights w can be obtained for a fixed tree structure, q. A greedy heuristic algorithm or approximate algorithms can be used to construct optimal trees according to the split score, that is, the loss reduction after the split, based on the optimal weights and loss. For details, please refer to [34]. There is an open-source library of XGBoost to support the R and Python interfaces (https://xgboost.readthedocs.io).
(3) Random forest Random forest [42] is an improved version of bootstrap aggregating (bagging) with a decision tree as its base model. Bagging starts with sampling (with replacement) n (the sample size) training samples from the original samples; then, the trees are trained with a bootstrapped subsample. The final predictions are made by averaging the predictions from the individual regression trees, as follows: where K is the number of trees, x represents the d-dimensional input and f k (x) is the output of the kth tree for x. This bootstrapping procedure can achieve good performance because it can decrease the variance of the model without increasing the bias, which means that, while the predictions of a single tree are highly sensitive to noise in the training set, the average of many trees is not as long as the trees are not correlated. In random forest, in addition to the data examples, sampling with replacement is also implemented for the set of input features to decrease the correlation between the models. The Python's scikit-learn machine learning library provides support for random forest (https://scikit-learn.org/stable/modules/ensemble.html).

Geographically Weighted Learning
For the three robust base models (the autoencoder-based deep residual network, XGBoost and random forest) and their averages, spatial autocorrelation cannot be directly embedded within the models and their predictions. Thus, GWR was proposed to obtain the optimal configuration (weights) of the three robust models and their spatial variations considering spatial autocorrelation and heterogeneity for a robust fused prediction ( Figure 3).
GWR is a local regression method with a moving window or spatial kernel to constrain the domain of the samples for regression [43]. In GWR, spatial dependence is considered according to Tobler's first law of geography (i.e., "everything is related to everything else, but near things are more related than distant things") [44]. Assume that there is a sample of features, x (three features for this paper's case Remote Sens. 2019, 11, 1378 9 of 26 study: the predictions of the three base learners) within the local domain, D, and that GWR considers location-specific regression coefficients: where (u i ,v i ) are the coordinates of the ith sample, β k (u i ,v i ) is the regression coefficient for the kth base prediction, and ε i is random noise (ε i~N (0,1)).
With the weighted least square method, the following solution can be obtained: where X is the input matrix of all examples, W is the spatial weight matrix of examples of the target location, (u i ,v i ), and y is the output vector. GWR is provided in the R spgwr package (https://cran.r-project.org/web/packages/spgwr/index.html). For this study, the Gaussian kernel was used to quantify the spatial weight matrix, W: where b is the bandwidth indicating the sampling domain and d ij is the distance between locations i and j. GWR can output the predictions and their variance by fusing the predictions from the three learners (the autoencoder-based deep residual network, XGBoost and random forest). Furthermore, by using spatially varying coefficients for each learner in GWR, each learner's contribution to the integrated prediction and its spatial heterogeneity can also be presented.
For each of the three individual learners, all sample data from 2015 were used to train an integral model. Considering the implicit character of local regression for GWR and the considerable difference between different days for wind speed, daily GWR was conducted over daily predictions of individual learners to obtain the corresponding predictions.

Stage 2: Downscaling with A Deep Residual Network
For new datasets, Stage 2 aims to use reanalysis data or other reliable data at coarse resolution to adjust the output initially inferred from Stage 1, making the averages of the output at fine resolution and the value at the corresponding coarse resolution consistent ( Figure 2). Reliable coarse-resolution data were used as a priori knowledge so that the ensemble predictions from Stage 1 were reasonable and conformed with the presumed trend at the background (coarse-resolution) scale. Downscaling began with the ensemble predictions from Stage 1 that captured spatiotemporal contrast or variability well at a finely local scale. Then, a deep residual network was iteratively used in downscaling to match the output at fine resolution and the values at coarse resolution until the threshold for their difference was attained. Except for the coarse-resolution target variable, the set of all other covariates of the new dataset were used to make predictions at fine resolution in downscaling. Assume the target variable value at a coarse-resolution grid cell, G i , i = 1, . . . , C (C is the number of coarsely resolved grid cells), and at a fine-resolution grid cell, g i , i = 1, . . . , F (F is the number of finely resolved grid cells). Due to its good ability to capture nonlinear associations between the covariates and the target variable [29] and spatial continuity, an autoencoder-based deep residual network was used to model the relationship between the aforementioned covariates and g i at fine resolution with the following regularizer: where F l represents the set of finely resolved grid cells that overlay the lth coarsely resolved grid cell. This regularizer in equation (10) indicates that the average of the finely resolved grid cells within each coarsely resolved grid cell is equal to the grid value of the latter. Given the reliability of the coarsely resolved dataset (reanalysis data), this regularizer is reasonable as the constraint. For implementation, the predicted values were adjusted for each finely resolved cell using the following formula to ensure equality in (10) and a deep residual network was used to update the regression in each iteration:ĝ where G l is assumed to be overlaid with g i , t represents the iteration time, andĝ (t) i denotes the adjusted values for iteration t.
Iteration proceeded until the average over the absolute difference in the finely resolved grid cells between two continuous iterations, that is, 1 , was equal to or below a stopping criterion value or the maximum number of iterations was attained. The complete algorithm is given in Figure 5. Additionally, an R package of this downscaling algorithm was published (https://github.com/lspatial/autoresnetR). criterion value or the maximum number of iterations was attained. The complete algorithm is given in Figure 5. Additionally, an R package of this downscaling algorithm was published (https://github.com/lspatial/autoresnetR).

Optimization of Hyperparameters and Validation
To obtain robust wind speed predictions, empirical knowledge and a machine learning grid search were used to find the optimal hyperparameter values. For the autoencoder-based residual network, initial networks were constructed according to previous empirical knowledge and then refined using sensitivity analysis. A cross validation grid search for mini-batch size, network depth,

Optimization of Hyperparameters and Validation
To obtain robust wind speed predictions, empirical knowledge and a machine learning grid search were used to find the optimal hyperparameter values. For the autoencoder-based residual network, initial networks were constructed according to previous empirical knowledge and then refined using sensitivity analysis. A cross validation grid search for mini-batch size, network depth, output type and activation functions were conducted to find optimal solutions of these hyperparameters. For XGBoost, the grid consisted of maximum booting iterations (100, 200, 300, 400), maximum tree depths (6 to 12), learning rates (0.05, 0.5, 1), and so on, for an optimal search. For GWR, the grid consisted of different bandwidths (100 km, 200 km, 300 km, 400 km, 500 km) for the search.
For an independent test of Stage 1, 30% of the complete samples were sampled (stratified by region ( Figure 1) and month) to validate each of the three individual models. This stratification sampling ensured even distribution of the samples across space and time to mitigate the overestimation of spatial results. Given the need for more samples for local regression, leave-one-site-out cross-validation (LOOCV) was conducted to validate GWR. Training (for all the models), independent test, and LOOCV (for GWR) coefficient of determination (R-squared, i.e. R 2 ), adjusted R 2 , root mean square error (RMSE), and mean absolute error (MAE) were reported and compared in the results. For Stage 2, similar metrics (independent test R 2 and RMSE) were reported for the autoencoder-based deep residual network to map the covariates to the finely resolved grids and for the coarse-resolution grid cells and the averages of the fine-resolution grid cells overlaid in downscaling.
For comparison with nonlinear GAM and the feed-forward neural network, the same samples used to train the base models in Stage 1 were used for training. Training and test R 2 , RMSE and MAE were also reported for both. To ensure fairness in comparison, except for residual connections, this feed-forward neural network had the same hidden layers and the same number (100, 959) of parameters as deep residual network. To diagnose potential spatial correlation in the residuals of ensemble predictions by GWR, Moran's I was calculated for each day's residuals [45]; variogram was also fitted for each day's residuals, and consequently was used in universal kriging to estimate the corresponding day's residuals. LOOCV R 2 and RMSE were evaluated for original and estimated residuals of each day.

Data Summary and Preprocessing
In total, 255,209 measurement samples with their covariates were collected from 770 wind speed monitoring stations across mainland China (see Figure 1 for the study region and spatial distribution of these monitoring stations). For 2015, the average daily wind speed was 2.1 m/s. Table 1 shows the statistics for the measurement samples and their partial covariates. A priori knowledge and the outer fences technique [46] were also used to filter out several invalid measurement samples.  Figure S1) showed less skewness for log-transformed wind speed measurements than original measurements (2.45 vs. −0.56). Thus, for GWR, the wind speed measurements were log-transformed in the training samples.

Training of the Models in Stage 1
In Stage 1, each of the three individual models was trained, and then, their predictions were geographically weighted to obtain the ensemble predictions. Table 2 presents the performances of the three base models, GAM and the feed-forward neural network in training and independent tests. In total, 76,563 independent test samples were obtained. The autoencoder-based deep network had a training R 2 of 0.68 (training RMSE: 0.76 m/s) (RMSE represents the error), which was lower than the training R 2 (0.76) of XGBoost (RMSE: 0.60 m/s) and slightly lower than the training R 2 (0.69) of random forest (RMSE: 0.76 m/s). However, the three models had similar independent test R 2 and RMSE values with very slight differences (test  Figure 6 shows faster convergence with lower loss and higher validation R 2 for deep residual network than for feed-forward neural network. The results also show a lower difference in R 2 and RMSE between the training and testing for autoencoder-based deep residual networks, indicating less overfitting in generalization. Figure 7 shows the plots of the predicted values/residuals against the observed values for the three individual wind speed models. In terms of their generalizations in independent tests, the three individual models had similar performance with very slight differences. network, XGBoost and random forest than GAM and the feed-forward neural network. Figure 6 shows faster convergence with lower loss and higher validation R 2 for deep residual network than for feed-forward neural network. The results also show a lower difference in R 2 and RMSE between the training and testing for autoencoder-based deep residual networks, indicating less overfitting in generalization. Figure 7 shows the plots of the predicted values/residuals against the observed values for the three individual wind speed models. In terms of their generalizations in independent tests, the three individual models had similar performance with very slight differences.     Table S1 for optimal parameters of the variogram models, and LOOCV R 2 and RMSE between original residuals and estimated residuals by universal kriging, as well as Supplementary Materials Figures S3 and S4 for the plots of variogram and scatter points between original and estimated residuals, respectively. Very small LOOCV R 2 (negative values) and almost random patterns of the scatter plots showed little contribution of variogram based universal kriging to estimation of the residuals.  Table S1 for optimal parameters of the variogram models, and LOOCV R 2 and RMSE between original residuals and estimated residuals by universal kriging, as well as Supplementary Materials Figures S3 and S4 for the plots of variogram and scatter points between original and estimated residuals, respectively. Very small LOOCV R 2 (negative values) and almost random patterns of the scatter plots showed little contribution of variogram based universal kriging to estimation of the residuals.

Predictions and Downscaling in Stage 2
With the models trained in Stage 1, the daily wind speed was predicted (spatial resolution: 1 km) for 2015 in mainland China. Figure 9 shows the prediction grids of a typical day (1 January 2015) in winter for the three individual learners and GWR (a for the autoencoder-based deep network, b for XGBoost, c for random forest and d for ensemble predictions by GWR). For the purpose of comparison, this paper also shows the prediction grids of a typical day (1 July 2015) in summer in Supplementary Materials Figure S5. For ensemble predictions, spatially varying coefficients for the intercept and the coefficients for the predictors were obtained as the GWR output for 1 January 2015 (Supplementary Materials Figure S6). The results show positive effects for XGBoost in northwestern and southern China and for random forest in central and eastern China and negative effects for the deep residual network in midwestern and northeastern China, for XGBoost in central and northeastern China, and for random forest in northwestern China. The variance grids of the ensemble predictions for 1 January 2015 by GWR were also obtained (Supplementary Materials Figure S7) as an indicator of the uncertainty. The results showed higher variance at several locations in western and northeastern China than that in other regions.
As shown in b and c of Figure 9 and Supplementary Materials Figure S5, even with a similar test performance, the prediction grids using XGBoost and random forest are fine overall but show some spatial abrupt (non-natural) variation on a local scale, possibly due to the discretization of the features in the decision/regression trees used as base models in XGBoost and random forest. Comparatively, the prediction grids produced by the autoencoder-based deep residual network appear naturally spatially smooth. Furthermore, the ensemble predictions from the three learners generated by GWR reduced the spatial abrupt variation, as shown in d of the two figures.

Predictions and Downscaling in Stage 2
With the models trained in Stage 1, the daily wind speed was predicted (spatial resolution: 1 km) for 2015 in mainland China. Figure 9 shows the prediction grids of a typical day (1 January 2015) in winter for the three individual learners and GWR (a for the autoencoder-based deep network, b for XGBoost, c for random forest and d for ensemble predictions by GWR). For the purpose of comparison, this paper also shows the prediction grids of a typical day (1 July 2015) in summer in Supplementary Materials Figure S5. For ensemble predictions, spatially varying coefficients for the intercept and the coefficients for the predictors were obtained as the GWR output for 1 January 2015 (Supplementary Materials Figure S6). The results show positive effects for XGBoost in northwestern and southern China and for random forest in central and eastern China and negative effects for the deep residual network in midwestern and northeastern China, for XGBoost in central and northeastern China, and for random forest in northwestern China. The variance grids of the ensemble predictions for 1 January 2015 by GWR were also obtained (Supplementary Materials Figure S7) as an indicator of the uncertainty. The results showed higher variance at several locations in western and northeastern China than that in other regions.
As shown in b and c of Figure 9 and Supplementary Materials Figure S5, even with a similar test performance, the prediction grids using XGBoost and random forest are fine overall but show some spatial abrupt (non-natural) variation on a local scale, possibly due to the discretization of the features in the decision/regression trees used as base models in XGBoost and random forest. Comparatively, the prediction grids produced by the autoencoder-based deep residual network appear naturally spatially smooth. Furthermore, the ensemble predictions from the three learners generated by GWR reduced the spatial abrupt variation, as shown in d of the two figures.    Figure S8) of the average finely resolved wind speed within each coarsely resolved grid cell and their residuals against the coarsely resolved wind speed (reanalysis data) for 1 January 2015 and 1 July 2015 show a close match between both with few outliers. The final finely and original coarsely resolved images for the two dates are shown in Figure 12.  Figure S8) of the average finely resolved wind speed within each coarsely resolved grid cell and their residuals against the coarsely resolved wind speed (reanalysis data) for 1 January 2015 and 1 July 2015 show a close match between both with few outliers. The final finely and original coarsely resolved images for the two dates are shown in Figure 12.     Time series of daily finely resolved grids of wind speed for 2015 were made across mainland China. Figure 13 shows four typical days of 2015, as aforementioned. Overall, high wind speeds were more evenly distributed across mainland China in spring and summer than in autumn and winter. On average, spring, summer and autumn had higher wind speeds than winter. In winter, the locations in and close to the Tibet region of western China had higher wind speeds than those in other regions. The low wind speed aggravated air pollution in the Beijing-Tianjin-Hebei region [47].   Time series of daily finely resolved grids of wind speed for 2015 were made across mainland China. Figure 13 shows four typical days of 2015, as aforementioned. Overall, high wind speeds were more evenly distributed across mainland China in spring and summer than in autumn and winter. On average, spring, summer and autumn had higher wind speeds than winter. In winter, the locations in and close to the Tibet region of western China had higher wind speeds than those in other regions. The low wind speed aggravated air pollution in the Beijing-Tianjin-Hebei region [47].

Discussion
This paper proposes a two-stage approach for the robust estimation of wind speed, which is known to be challenging due to complex influential factors and wind speed volatility. In the proposed approach, Stage 1 aims to improve reliable estimates of the target variable to capture contrast or spatiotemporal variability at a fine resolution using geographically weighted machine

Discussion
This paper proposes a two-stage approach for the robust estimation of wind speed, which is known to be challenging due to complex influential factors and wind speed volatility. In the proposed approach, Stage 1 aims to improve reliable estimates of the target variable to capture contrast or spatiotemporal variability at a fine resolution using geographically weighted machine learning, and Stage 2 aims to adjust the finely resolved prediction grids initially inferred from Stage 1 to make them consistent with the coarsely resolved reanalysis data. Stage 2 reduced overfitting and improved spatial variation (smoothing). Therefore, the proposed approach can obtain reliable high-resolution wind speed estimations.
High-resolution mapping of meteorological parameters is challenging due to the limited number of available covariates (e.g., just nine covariates in the case of this paper), as multiple factors affect spatiotemporal variability and they present complex interactions. Although meteorological reanalysis employs comprehensive data from a variety of sources, including ground-based stations, ships, airplanes, satellites, and forecasts from numerical weather prediction models, to estimate meteorological parameters in the state of the system as accurately as possible [24], their spatial resolution is very coarse, thus constraining their applicability at local levels. Considering this challenge, in Stage 1, a geographically weighted machine learning method was adopted based on three representative state-of-the-art learners: an autoencoder-based deep residual network, XGBoost and random forest. In comparison with GAM and the feed-forward neural network, the test showed that the three base learners achieved much better generalization and efficient learning. Compared with support vector machine (SVM) and the fuzzy neural system, which presented very slow convergence in the test using the wind speed samples, the three learners were convenient to use, with high generalization and fewer feature engineering operations.
Although these learners achieved good performance with spatial autocorrelation partially captured by the coordinates and their derivatives, spatial autocorrelation was not embedded directly within the models, and thus their residuals might present spatial autocorrelation. Furthermore, for large regions such as mainland China, there is considerable diversity and many differences exist between regions. Therefore, geographically weighted machine learning was leveraged to integrate the predictions made by the three types of learners. As a local regression technique, GWR was used to account for spatial autocorrelation and heterogeneity [48][49][50], compensating for the shortcomings of modern machine learners to improve ensemble predictions. In the test of the 2005 wind speed for mainland China, individual learners achieved an R 2 of 0.63-0.67 (RMSE: 0.72-0.77 m/s) in independent tests, and GWR further improved the R 2 to 0.79 (RMSE: 0.58 m/s) in LOOCV. The results showed that GWR effectively improved the predictions of individual learners by spatially varying fusion. The results of very small Moran's I in the residuals of ensemble predictions and little contribution of variogram-based universal kriging to estimation of the residuals illustrated that most of the spatial autocorrelation was accounted for by the proposed approach.
Due to the discretization of the quantitative covariates used in the decision (regression) trees, massive grid predictions by the decision tree-based methods [51] (XGBoost and random forest) might present spatial abrupt (non-natural) variation at the local spatial scale, as shown in Figure 9. Comparatively, the autoencoder-based deep residual network did not involve discretization of covariates whose fully continuous quantitative information was kept in the models and thus could generate prediction grids with spatially smoother surfaces than XGBoost and random forest. Then, the ensemble predictions from the three base models were fused using GWR, which mitigated the spatial abrupt variation. Therefore, using three individual robust learners and GWR, the ensemble estimates can effectively capture the contrast or variability of the target variable at a fine spatial scale with improved R 2 and lower RMSE.
To further reduce the potential bias and improve spatial variation (smoothing) of predicted values (caused by decision tree-based algorithms), coarsely resolved reanalysis data, if reliable, can be used as the regularizer to adjust the ensemble estimates to make them consistent with the coarse-resolution grids at the background scale. To achieve reliable generalization with complete removal of spatial abrupt (non-natural) variation, an autoencoder-based deep residual network was used to regress the adjusted or regularized ensemble outputs over the selected covariates. Therefore, the specific contrast or variability at fine resolution captured in Stage 1 could be kept in downscaling with the reliable coarsely resolved reanalysis data to obtain reasonable grids. The example of wind speed illustrated effective simulation in downscaling to reduce bias and improve spatial variation (smoothing). The downscaled time series of the finely resolved grids presented reasonable seasonal patterns of wind speed across mainland China.
For downscaling, compared with kriging-based ATPP interpolation [25], autoencoder-based deep residual network provides a flexible network architecture with a large parameter space. Although spatial autocorrelation, such as kriging, was not directly embedded in the network, GWR was used to capture spatial autocorrelation and heterogeneity in Stage 1 at a local fine scale. Furthermore, coordinates and their interaction derivatives were used as covariates within the model to represent spatial variation in Stage 2 downscaling. In contrast to the kriging method, the downscaling of the deep residual network did not require variogram simulation, which might introduce uncertainty for the volatile wind speed. Sensitivity analysis showed that the universal kriging accounted for only 14% of the variance explained for wind speed (but 72% for relative humidity), compared to 66% by the autoencoder-based deep residual network in the independent test. This illustrated the inapplicability of kriging interpolation for capturing the variability of mutable wind speed, and showed that the proposed approach can better predict wind speed.
For application to other meteorological or surface variables and other regions, this proposed approach is divided into two stages ( Figure 2). Stage 1 aims to train three base learners (deep residual network, XGBoost and random forest) and GWR using X and y from the training samples. Stage 2 involves inference (prediction) and downscaling using the new and reanalysis datasets to obtain reliable high-resolution grid predictions. The new dataset first supplied X to the trained base learners and GWR in Stage 1 to get the initial finely resolved predictions,ŷ. Then, the coarse-resolution reanalysis data were used to adjustŷ orŷ (inferred by the downscaling model in each iteration). Then, X and adjustedŷ orŷ were used to train or retrain the downscaling model. This process was repeated until the preselected stopping criterion value (SCV) was attained, as shown in Figure 5.
This study has the following limitations. First, although downscaling was introduced in Stage 2 with coarsely resolved reanalysis data to adjust the ensemble predictions, the reanalysis data were assumed to be reliable so that the adjusted surfaces in downscaling could also be reliable. Otherwise, downscaling may distort the adjusted outcomes and evenly introduce bias into the results. Second, the proposed approach did not embed the mechanism knowledge for generating wind speed within the models, but instead used only a limited number of available covariates to capture spatiotemporal variability in Stage 1. However, the coarse-resolution reanalysis data were also used as the covariates within the model that represent hybrid results based on climate models, numeric predictions, satellite data and monitoring data. In particular, downscaling with the coarse-resolution reanalysis data was introduced to regularize the ensemble results. Third, the wind speed data used to train the models were primarily measured at a height of 10-12 m above the ground and thus the trained models predicted wind speed at a similar height. This may limit the application of the results for recovery of the wind potential for energy purposes, which involves estimating the wind speed at heights between 50-100 m above the ground. However, this paper focused on machine learning methods for high spatiotemporal mapping of wind speed rather than practical applications of wind potential recovery. Regarding the latter, new measurement data of wind speed may be gathered to retrain the models for an appropriate evaluation of wind potential.

Conclusions
In this study, a two-stage approach was developed to make robust high-resolution wind speed predictions across a large region, such as mainland China. In the proposed approach, Stage 1 obtained geographically weighted ensemble predictions based on three different types of robust learners, which were an autoencoder-based deep residual network, XGBoost and random forest, to capture spatiotemporal contrast or variability at fine resolutions with improved performance. Stage 2 introduced downscaling using the coarsely resolved reanalysis data as a regularizer to make the coarsely resolved grids and the averages of the overlaid finely resolved grids consistent. In the case study of wind speed prediction for mainland China, the proposed approach achieved a CV R 2 of 0.79 (RMSE: 0.58 m/s) in GWR ensemble predictions and a mean R 2 of 0.91 for the 2015 daily match of coarsely and finely resolved grids in downscaling. This approach provides reliable and robust predictions for wind speed and can be applied for high-resolution spatiotemporal estimation of other meteorological parameters and surface variables that involve multiple influential factors at different scales.
Supplementary Materials: The following are available online at http://www.mdpi.com/2072-4292/11/11/1378/s1: Table S1: Variogram fitting and cross validation for estimation of the residuals of predicted wind speed for four typical seasonal days (b); Figure S1: Histograms for wind speed (a) and its log transformations (b); Figure S2: Boxplots of the observed values and ensemble predictions by GWR for wind speed; Figure S3: Plots of variogram fitting; Figure S4: Plots of the original residual from ensemble predicted wind speed vs. estimated residuals using fitted variogram based universal kriging (UK) for four typical seasonal days; Figure S5: Predictions by individual models (a for autoencoder-based deep network, b for XGBoost and c for random forest) and their ensemble predictions by GWR for wind speed of 1 July 2015; Figure S6: Spatially varying intercept (a) and coefficients (b, c and d) for the predictions of thee individual learners for wind speed; Figure S7: Spatially varying variance for the ensemble predictions by GWR for wind speed; Figure S8: Plots of averages of matched finely resolved wind speed (a and c) and their residuals (b and d) vs. the coarsely resolved wind speed (reanalysis data) for 1 January 2015 (a and b) and 1 July 2015 (c and d).
Author Contributions: L.L. was responsible for conceptualization, methodology, software, validation, formal analysis and writing.