Interpolation of Instantaneous Air Temperature Using Geographical and MODIS Derived Variables with Machine Learning Techniques

Abstract: Several methods have been tried to estimate air temperature using satellite imagery. In this paper, the results of two machine learning algorithms, Support Vector Machines and Random Forest, are compared with Multiple Linear Regression and Ordinary kriging. Several geographic, remote sensing and time variables are used as predictors. The validation is carried out using two different approaches, a leave-one-out cross validation in the spatial domain and a spatio-temporal k-block cross-validation, and four different statistics on a daily basis, allowing the use of ANOVA to compare the results. The main conclusion is that Random Forest produces the best results (R2 = 0.888 ± 0.026, Root mean square error = 3.01 ± 0.325 using k-block cross-validation). Regression methods (Support Vector Machine, Random Forest and Multiple Linear Regression) are calibrated with MODIS data and several predictors easily calculated from a Digital Elevation Model. The most important variables in the Random Forest model were satellite temperature, potential irradiation and cdayt, a cosine transformation of the julian day.


Introduction
Air temperature (T a ) is a very relevant climatic variable that controls several environmental processes, particularly evapotranspiration [1]; it is also a key feature in global change studies, and reflects the surface energy balance [2]. So, accurate estimations of T a and its spatio-temporal variability are important in several Earth and environmental sciences and in land surface process modelling [1,3]. Air temperature is usually measured at weather stations at a standard height of 1.5-2 m with diverse temporal resolutions. However, because weather stations provide limited information about spatial patterns at regional or global scales [3], several methods have been used to estimate the spatial distribution of T a [4,5]: • Vertical lapse methods [4] use height as the main variable to explain temperature spatial distributions. The vertical lapse rate is evaluated from the sampling data and then applied to the whole study area. A more sophisticated approach uses daily atmospheric profiles provided by the Moderate Resolution Imaging Spectroradiometer (MODIS) product MOD07_L2 to locally estimate the adiabatic lapse rate [6]. The main drawback of this approach is the coarse spatial resolution (5 km) of those products.

•
Geostatistical techniques (kriging) [17,18] estimate Ta as a weighting average of the sampling points with the weighting coefficients obtained after a statistical analysis of the spatial variability of the variable (semivariogram functions).The main drawbacks of such interpolation methods are that they do not use covariates and that they may be sensitive to a clustered distribution of weather stations [19].

•
The Temperature-Vegetation Index (TVX), proposed by [20,21], is based on the correlation between NDVI and LST, assuming that T a is approximately equal to LST in fully vegetated areas. Significant uncertainties appear in sparse vegetation areas [22].

•
Methods based on the surface energy balance such as ADEBAT [4,13]. The objective is to estimate T a using a more physical approach. It has two main drawbacks: several variables that can only be measured in weather stations are needed and, as the Bowen ratio (the ratio between the sensible heat flux and the latent heat flux) is one of them, it is necessary to know the latent heat flux (LE) to use ADEBAT. However, T a is usually, as in this case, estimated in order to estimate LE from it, so the use of a surface energy balance is not suitable.
Remote sensing methods are constrained by the time of the day when images are taken. During the night, LST is a very accurate proxy for T a as solar radiation has no effect, simplifying the ground surface energy balance. During the day, it is necessary to take into account several variables, such as cloud cover, wind speed, soil moisture and surface roughness, which remote sensors cannot retrieve [6]. However, as remote sensing provides LST spatially distributed estimations, they may be used as highly correlated predictors to estimate air temperature. Tao et al. [23] showed that correlation between LST at night and minimum T a is higher (R 2 = 0.93) than between their daytime equivalents (R 2 = 0.79).
Several approaches have been recently used to estimate air temperature from satellite data; one of the objectives being to use just remote sensing data adding, at most, easy to obtain environmental variables. Jin and Dickinson [24] estimated the diurnal cycle of LST through temporal interpolation of NOAA-AVHRR. Sun et al. [25] used the same method for modeling Geostationary Operational Environmental Satellite (GOES) measurements. Gholamnia et al. [26] attempted to introduce methods that estimate air temperature based on LST data by direct approach and without auxiliary environmental information. Golkar et al. [27] estimate instanteneous air temperature using standard remote sensing methods with minimum data. Mira et al. [28] tried to introduce LST and elevation as predictors to a multiple linear regression model (MLRM). Recently, machine learning methods have been compared with traditional MLRM; for instance, Xu et al. [14] stated that linear regression cannot predict the air temperature based on LST in all conditions and that RF models obtain better accuracy than the MLRM.
As a result of these attempts, several global weather products have been produced that estimate variables using satellite imagery. For instance, Hooker et al., [29] estimated a monthly temperature dataset at a spatial resolution of 0.05 • for the interval 2003-2016.
We think that more research is needed to investigate how the use of satellite information and different geographic predictors to calibrate machine learning algorithms might outperform traditional interpolation methods. Thus, the objective of this work is to use two well known machine learning algorithms, Support Vector Machines (SVM) and RF, to estimate T a at the AQUA passage time (between 12:00 and 14:20) in a medium size basin in SE Spain. We used as predictors a set of environmental variables included in some of the previous works that we think would influence the temperature spatial patterns. The results are compared to those obtained with more traditional approaches: MLRM and Ordinary Kriging (OK). Finally, regression-kriging will be tested using OK to interpolate the residuals of the three regression models, provided that such residuals show spatial autocorrelation. Two cross-validation approaches were carried out to deal with accuracy over-estimation induced by spatio-temporal autocorrelation: a simple leave-one-out cross-validation (LOO-CV) in the spatial domain and a spatio-temporal k-block cross-validation (k-block-CV).
The use of both approaches allows to evaluate accuracy in two different situations: spatio-temporal k-block-CV provides an accuracy estimation for models calibrated with a large observations set but avoiding the overfitting due to spatio-temporal dependence. On the other hand, LOO-CV in the spatial domain produces an accuracy estimation when only spatial but not temporal dimensions are considered (models are generated using observations for just one day, an observation is extracted in each iteration to estimate the model error on it). Figure 1 shows a flow chart summarising the methodology, which will be detailed below. All the research was carried out using free and open source software. For step 1, that includes downloading and preprocessing of information, GRASS [30], a multi-purpose Geographical Information System, was used.

Methodology
Step 2, that includes model calibration, validation and the generation of raster predictions, was carried out using the R-CRAN program [31].

Study Area
This research was carried-out in the area controlled by the River Segura Water Authority (DHS in its Spanish abbreviation) (Figure 2), which includes the Segura river basin (19,000 km 2 ) and several minor coastal basins in SE Spain. It is a semiarid area with scarce and irregular precipitation, high temperatures, and a large number of hours of sun that cause high potential evapotranspiration. Despite the scarcity of water, agriculture is an important economic sector using both groundwater (available because the predominance of carbonate rocks) and water transferred from other basins. Population density and intensive irrigated agriculture represent a significant water demand. The study area is also characterised by substantial height differences over short distances, which together with the semiarid climate and the use (limited in space) of groundwater and transferred water, create a strong environmental variability.
The spatial distribution of temperature is strongly influenced by height distribution, the annual average temperature increases gradually from the Northwest mountain ranges (10 • C) to the coast (18 • C). The annual temperature regime has a winter low in December and January, while the highest values occur in July and August [32]. The proximity to the sea softens summer and winter temperatures in the coast, while an increase in continental features is observed towards the northern area, both because of the distance to the sea and because of the presence of successive mountain alignments [33].
The cloudiness in the DHS is more abundant in the equinoctial seasons, since in winter and summer long periods with anticyclonic situations are common. The northwestern sector has the highest cloud cover of the DHS, as it is exposed to the Atlantic fronts and suffers frequent convective storms in the May-September period, while coastal areas experience less cloudiness, especially the southwestern coast, although is not infrequent the presence of coastal mist. Thus, the average cloud fraction varied over the 2012-2014 period in much of the DHS between 20-30% in summer, between 30-40% in winter and between 40-50% in spring and autumn. In the north-west, such numbers are increased by 10 %. With respect to clear days (those with an average cloud fraction of less than 10%), in summer they range from 30 in the north-western extreme to 60 in coastal areas. In spring and autumn the number of clear days presents a greater homogeneity in the DHS, oscillating between 15 and 25 days. Finally, in winter, clear days range from 20 to 30 in the study area, with slightly more than 30 days in coastal areas.

Data Set
Different variables were used as predictors: (1) geographical variables: longitude, latitude, altitude, Topographic Wetness Index (TWI) [34] (used to describe the potential accumulation of cold air), monthly potential irradiation (Wh m −2 month −1 ) obtained from heights using the method proposed by [35] and distance from the coast; (2) time variables: day duration, cdayt (a cosine transformation of julian day following [36]), and the satellite passage time; and (3) variables provided by the MODIS sensor: EVI, albedo and the terrestrial surface temperature (LST) at the passage time of AQUA satellite.
The objective was to have a set of easy to obtain predictors that are, at the same time related with processes that explain air temperature. Latitude and altitude relation with temperature is well known; longitude was included to complete the 3 spatial axes. TWI estimates the accumulation of water in the terrain contributing to a cooling of the air. Monthly potential irradiation accounts for the differences in the amount of solar radiation reaching land surface due to differences in slope and aspect. LST is surface temperature as estimated by the satellite. As LST will vary in relation with the time when the satellite takes the image, this time needs to be taken into account. Cdayt is a transformation of the julian day to guarantee smooth transition between December and January; it oscillates between −1 (summer) and 1 (winter). Albedo measures the proportion of incident radiation reflected to the atmosphere. Distance to the coast is a proxy for continentality. EVI is a vegetation index useful as vegetation moderates extreme temperatures. Finally, the duration of the day increase the daily heat accumulation contributing to increase air temperature.
The classical formulation of TWI [37] is: where a is the upslope contributing area per unit contour and b is the slope. A more detailed description of methods to calculate slope and contributing area from a DEM can be found in [38]. The equation to calculate cdayt is: where t D is the julian day and φ is a time delay from the coldest day.

Weather Data
Air temperature data every 30 min is recorded by 53 weather stations belonging to the SIAM and SIAR networks (Sistema de Información Agrometeorológico de la Región de Murcia, Agro-meteorological information System in Murcia Region) and Sistema de Información Agroclimática para el Regadio, Agro-meteorological Information System for Irrigation) ( Figure 3). In this work, only data for 2012 were analysed to test which model produces the most accurate results. We assumed that the mean air temperature data (obtained from the average of 3 measurements taken every 10 min) provided by the SIAR and SIAM networks are equal to the air temperature at the time of satellite passage. Therefore, these data were used to calibrate and validate the different models used in this work for the prediction of the instantaneous air temperature.
Only data from clear days were used. Since there is no consensus in the scientific literature on the definition of a clear day, we considered as such those days with an average cloud cover lower than 10%.

Remote Sensing Data
The three environmental variables used as predictors in this work (albedo, vegetation index and surface temperature) were obtained from different MODIS products (Table 1). The R packages MODIStsp [39] package was used to download and process the images. The 8 day delay between the MOD13Q1 and MYD13Q1 products allows to obtain layers of both EVI combining AQUA and TERRA satellites with a temporal resolution of 8 days. The process involves 3 steps: (1) Elimination of pixels with very low quality or with large observation errors. To this end, according to [40] all pixels with a value greater than 5 in the VI usefulness index (VIUI) [41,42] were removed. (2) Combination of the EVI layers obtained by the MOD13Q1 and MYD13Q1 in those pixels whose values of both products were available. (3) Filling the gaps after step 2. If EVI values one week before and one week after were available, the average of both layers was used. Otherwise, the values estimated one week earlier or later were used. If no EVI values were available for the prior or the posterior weeks, the same process were carried out with the estimated values for 2 weeks before and after.
With regard to the LST layers, we decided to use those provided by the product MYD11A1, since AQUA passage time is closer than TERRA's to the temperature recording time. The LST layers were corrected taking into account the quality of the different pixels. In this case, according to [43] the LST pixels with an error estimation of more than 3 degrees Kelvin were removed. The GRASS module i.modis.qc was used to obtain the error layers. We also used the MODIS White-Sky Albedo layer as albedo estimation.
The hourly cloud fraction layers available in the database CMSAF-COMET Edition 1 [44] were used to obtain layers of the mean cloud fraction during the daytime period. This raster database (resolution of 0.05 • ) collects information on the fraction of cloudiness obtained from METEOSAT satellites, with an hourly time resolution. With the objective of giving greater importance to the central hours of the day, when the irradiation is concentrated, the daily layers were obtained as a weighting average of the hourly layers. The cosine of the angle of the solar zenith (β), proportional to the radiation received at any given time, was used; for the nocturnal layers w = 0 and for the diurnal layers w = cos β/ ∑(cosβ).

Geographical Variables
Some of the geographical variables (elevation, distance to the sea, potential insolation and TWI) used as predictors with the regression methods were calculated from the official 25 m resolution Digital Elevation Model (DEM) downloaded from the Spanish National Geographical Institute (Instituto Geográfico Nacional, IGN) website (http://centrodedescargas.cnig.es/CentroDescargas/index.jsp).

Estimation Methods
Four methods were used to estimate air temperature at the satellite passage time: MLRM, SVM and RF are regression (global interpolation) methods that rely on predictor variables to estimate the dependent variable; in this paper, only satellite data and other predictors that can be easily obtained were used, the first based on classical statistics and the last two on machine learning; finally OK, a local interpolation method that does not take into account predictors, was used for comparison. Finally, the residuals of the three regression models were interpolated with OK to obtain regression-kriging models (GLSRK, SVMRK and RFRK).

Multiple Linear Regression
Simple and multiple linear regression models (MLRM) are the most popular models for estimating air temperature. MLRM is a global interpolation method in which a functional relationship is defined between the dependent variable (in this case maximum and minimum temperature estimated in different observatories in the territory), and a set of spatially distributed environmental and geographical variables. The parameters may be estimated using ordinary least squares (OLS) or generalised least squares (GLS), a modification of OLS that takes into account the heterocedasticity and the spatial correlation in the observations. In this work, we used the implementation of GLS in the R package nlme [45].
The assumptions of the model were assessed by hypothesis contrasts: the Kolmogorov-Smirnov test to assess the normality of residuals (usually met) and the Breush-Pagan test to assess homocedasticity (usually not met).
With all the regression methods, a first step of variable selection was carried out using the Variance Inflation Factor (VIF) methodology proposed by [46]. In the case of MLRM a subsequent stepwise procedure was carried out to minimise the number of variables to provide a more parsimonious model.

Support Vector Machines
SVM [47] was originally developed for classification but it was adapted to regression as a robust method that tries to minimise the effect of outliers. SVM for regression is described in [48]. Instead of trying to minimise the sum of squared errors, data points whose residual absolute values are lower than a user defined threshold ( ) do not contribute to the fit, whereas points with |e| > , the so called support vectors, contribute linearly rather than quadratically to the error objective function to be minimised. This somewhat counterintuitive approach (the more accurately predicted points are not used to fit the line) has proven effective. A detailed description of this method can be found in [48] or [49].
SVM has been reported to obtain similar accuracy as RF and better accuracy than other machine learning methods such as neural networks [50][51][52]; however, its main drawback is that there is no way to know in advance which kernel and parameter values will give the best results. In this case, we reduced the problem by using a radial basis function (RBF) kernel. We optimised parameters C and σ simultaneously using a grid search method, 10-fold cross validation with the training data and mean squared error as performance measure. The implementation used was that of the R package e1071 [53].

Random Forest
RF [54] is an ensemble of decision trees. Decision trees are characterised by a small bias but a high variance. RF tries to solve this issue by training several (500 as default) decision trees. Each tree is trained with a bootstrapped subsample of the available training cases. In addition, every time a variable has to be selected to split a node, only a subset of the independent variables is considered (by default the integer part of √ p where p is the number of variables). In this way, although each tree provides a high variance estimation, averaging the results of all of them will result in a low bias and low variance estimation.
RF has 4 main advantages over other machine learning methods: (1) It provides an internal cross-validation procedure, (2) the default values for the parameters provide optimal estimations most of the times, (3) the decrease in heterogeneity provided by each variable along the calibration process of each tree, when aggregated, provides a measure of the importance of each variable, (4) it is possible to obtain an estimation of the effects of the different predictors on the model, allowing the operator to decide if such effects are physically sound or not. Points 3 and 4 mean that RF is not really a black box model, as other ML techniques; rather it might be considered a grey box model.
In this work, we used the version in the R ranger package [55], a very fast and memory efficient implementation of RF. We used the default values for the ntree (500) and mtry ( f loor( √ p)) where p is the number of predictors. Previous research [56,57] has shown that the accuracy achieved with such parameters is usually near the optimum.

Ordinary Kriging
OK is a local interpolation method based on the regionalized variable theory [58]. It uses solely the values measured in the observation points and their location. Its main advantage over other local interpolation methods (such as Inverse Distance Weighting) is that a statistical analysis of the spatial variability of the values is previously performed and summarised in the semivariogram function. Finally, OK performs, at each pixel, a weighted average of the values in the surrounding observation points with the weights calculated as a function of the semivariogram. The assumptions of OK are normality and first and second order stationarity, that is, the mean and the variance are constant in the area. These assumptions are rarely met, so several variations have been proposed to deal with trends in the data: e.g. Universal Kriging (taking into account spatial trends in the values) and regression-kriging (taking into account other predictors).
In this work, we used the R package automap [59], its main advantage being the automatization of a weighted least squares optimal estimation of semivariogram parameters using Gauss-Newton [60]. The variable was log-transformed to obtain a normal variable.

Regression-Kriging
A different approach is to perform a global (regression) interpolation, and interpolate its residuals with a local (e.g. kriging) method. In this way the local variability not explained by the global model and accounted in their residuals is dealt with by the local method.
Universal kriging is a well known method that firstly fits with a linear model a global surface using coordinates as predictors and secondly interpolates the residuals with OK [61]. Regression-kriging [62,63] is the same approach but using any environmental variables as predictors. Obviously any global method (GLS, RF, SVM, etc.) can be used in the regression part and any local interpolation method can be used instead of kriging to deal with the residuals. In this work we used OK to interpolate the residuals of MLRM, SVM and RF.

Validation
As the aim of this work is to evaluate the predictive performance of different models, cross validation was carried out for the seven estimation methods. Two types of problem have been raised about the use of cross-validation with spatio-temporal data when the objective is to extrapolate outside the spatio-temporal structure. Firstly, error estimates in a random cross validation may be too optimistic [64]. Secondly, time-static spatially distributed predictors (such as elevation) are prone to over-fit models [65].
In order to solve the spatial autocorrelation problem, it is advisable to validate with cases that are far enough from the training data to decrease spatial dependence and avoid too optimistic error estimates [64,66]. Something similar occurs with time-correlated data [64].
To carry out cross-validation, we have considered two approaches. Firstly a simple leave-one-out CV (LOO-CV) in the spatial domain, in which each case is used to validate a model calibrated with temperature data from the other observatories in the same day. The next pseudocode shows how it works: Secondly, a block cross-validation strategy [67]. Blocks are similar to folds in k-fold cross-validation, but, instead of the cases in each block being randomly selected (as in folds), all cases in a block correspond to a spatio-temporal subset of the 3D space formed by the 2D space and time. Spatial blocks forces testing on more spatially distant records, thus decreasing spatial dependence and reducing optimism in error estimates [66]; temporal autocorrelation is functionally similar, and blocks may be drawn in the same manner to better ensure independence between cross-validation folds [64]. The size of such blocks is determined after an analysis of the spatio-temporal autocorrelation identified in the data studied.
In order to implement k-block-CV, a prior study of the temporal and spatial autocorrelation of the daily instantaneous temperature was carried out. The temporal autocorrelation was analysed using autocorrelation functions (ACF) and partial autocorelation functions (PACF), and the spatial autocorrelation was analysed using semivariograms. In both cases the objective was to identify the distance (meters) and time lag (days) beyond which autocorrelation ceases to affect. The lag values of the ACF in each station and the ranges of the semivariograms in each day were obtained, allowing to estimate the adequate size for the k-spatial and k-temporal blocks inside each k-spatial block. The number of resulting blocks was the number of spatial multiplied by the number of temporal blocks.
In order to solve the second problem, concerning the predictor selection, a forward feature selection algorithm (FFS) [65] was applied to the variables selected in the VIF analysis. The algorithm is implemented in the R CAST package [65]. It, firstly, trains models of all possible 2-variable combinations, keeping the best model. Then, the number of predictors is iteratively increased and the improvement of the model tested for each additional predictor. The process stops when none of the remaining variables decreases the error of the currently best model [65]. The k-block-CV framework is used in FFS to obtain performance statistics.
Bennett et al., [68] state that Goodness of fit statistics measure different performance aspects, so several statistics should be used to decide which is the most accurate model. Four statistics, whose detailed description and interpretation criteria can be consulted in [69] or [68], were used in this research: Root mean square deviation (RMSE): Coefficient of determination (R-Squared): The modified Nash-Sutcliffe efficiency (NSE) measures the relative magnitude of the residual variance compared to the observed data variance, it is less sensitive than R 2 to outliers [69]: NSE ranges from -Inf to 1, this last value indicating a perfect match. If NSE = 0, the model is as accurate as using the mean of the observed data as predictor. If NSE < 0, the observed mean is a better predictor than the model. Percent bias (PBIAS) measures the relative tendency of the estimated values to differ from the observed values. It is less sensitive than RMSE to outliers and to the magnitude of the data: The optimal value of PBIAS is 0, with positive values indicating overestimation and negative values indicating underestimation.
In Equations (3) to (6), O i are observed values and E i estimated values. The four statistics were calculated for both validation methodologies. For LOO-CV in the spatial domain, they were calculated each day, so statistical distributions of their values along the year can be obtained and compared. An alternative option would have been to calculate statistics for each observatory; however, the temperature variations along the year would produce high R 2 and NSE values even if the accuracy of the different observatories were small (Simpson paradox).
As normality and homocedasticity could not be assumed, a robust version of one-way ANOVA using a heteroscedasticity-consistent covariance matrix of the parameters (HC3) method [70] to test whether differences among the methods were significant. If that was the case, a post-hoc contrast between pairs of models, based on Tukey-Kramer contrast and using HC3 method to correct p-values among classes, was performed to discover groups of non-significantly different methods.

Results
Two different validation approaches were used in this work, but the main objective was to compare different interpolation methods. For this reason we will present the results comparing the accuracy of the different methods, first with the LOO-CV in the spatial domain approach, and, later, with k-block Cross-Validation.

Leave-One-Out Cross-Validation in the Spatial Domain
For the regression models (GLS, RF, SVM) a VIF analysis was previously performed to recursively eliminate those variables with a high linear correlation with the rest. Recommended threshold values for VIF range from 5 [71] to 10 [72]. In this work it was firstly established in 5, but it was relaxed to allow the inclusion of LST (VIF = 6.43) as we considered LST, a priori, the most significant predictor, it is near to the lowest recommended value than to the higher recommended value, and the finally eliminated predictors have VIF values way larger: day duration (VIF = 39.04) and latitude (VIF = 24.43).
Both latitude and day duration were filtered out for their high correlation with other variables. Day duration is highly correlated with cdayt, and, due to the NW-SE altitude descent, altitude and longitude make latitude redundant in the study area (see Figure 3). Table 2 and Figure 4 show the results of the cross-validation of the eight methods. The number of days were 121 with RF, RFRK, SVM, SVMRK and OK, and 118 with GLS and GLSRK. The cross validation was limited to those days in which the number of observations (weather stations without clouds and with data available for that day) was greater than the number of predictors (10). That means 121 days including 3995 predictions. However, in the case of GLS and GLSRK, the final set of predictions was 3928 and 118 days due to errors derived from the estimation of the variance-covariance matrix.
On average, validation statistics indicate that RFRK obtains the most accurate results, followed by RF. In addition, GLS and GLSRK show very high prediction errors, although predictions with absolute residuals larger than 20 • C were filtered out. When this filtering was carried out, the final number was 3886 in the GLS and GLSRK models. In the case of GLS and GLSRK, errors occurred in areas outside the domain and/or range of the data and in areas with local effects that implied situations not considered in the linear relationship estimated in the general model. In these cases, as can be seen in Figure 4, machine learning methods are more accurate. This difference is due to the fact that the linear models predict poorly outside the boundaries of the training data set, whereas the machine learning methods have a better generalisation capability and predict well outside the boundaries of the training data set [73].    Table 2 and Figure 4 show detailed results of such an analysis. The letters above the plots in Figure 4 indicate to which groups of non-significantly different values belong each method. According to NSE and RMSE, RF with kriging of the residuals (RFRK) is significantly better than the other methods (group a in Figure 4a,c). The next more accurate method is RF (NSE = 0.504 ± 0.022 and RMSE = 1.181 ± 0.025 o C), increasing NSE and worsening the RMSE results of the previous method (NSE = 0.578 ± 0.025 and RMSE = 1.068 ± 0.027 o C). According to R 2 RFRK is the most accurate method with the exception of RF without residual kriging. Several methods (all except GLS and GLSRK) obtain PBIAS values near to zero and do not show significant differences, so those methods do not overestimate or underestimate temperature values. Figure 4d represents the percentage of bias obtained, but the groups in the a posteriori contrast are obtained using bias in absolute terms. RF and RFRK (group a and ab respectively) are the most accurate cases. RF underestimates the temperature by 0.04%, while RFRK overestimates it by 0.17%. It is noteworthy that GLS and GLSRK, whose mean PBIAS are practically 0, obtain biases in absolute terms greater than the rest. The R 2 values obtained by the regression models were very high when measured in calibration (GLS: 0.9165, RF: 0.9939 and SVM: 0.9627).
From these results we conclude that, in general, RF-based methods are more accurate and their performance increases when spatial component are included (RFRK). It is also noteworthy the greater variance observed in the GLS method versus machine learning and OK based methods (Figure 4). This results show that the accuracy is very different among observatories and, therefore the robustness of the methods is lower. Figure 5 shows the importance of the predictors and their effects on the RF model. Effects in RF are computed using partial plots. In such plots, the prediction for a variable X, evaluated at X = x, is: where xi, o represents the value for predictors other than X for individual i andf is the predicted value. For continuous variables, red points indicate partial values and dashed red lines indicate a smoothed error bar of +/− two standard errors. Black dashed line are the partial values. We used the function partial.plot in the R package randomForestSRC [74].
The most important predictors are LST, cdayt and radiation. The effect of all of them are clearly as expected, except the slight reduction in estimated temperature for the largest radiation values. We think the real behaviour might be a stabilisation of temperature for high radiation, although this was altered because of the three points with larger temperature estimation, which may represent a local effect. The least important predictors also have sound effects, although in two of them (albedo and TWI) confidence intervals are too large to draw any clear conclusion. On the other hand, the effects of longitude and TWI are very small, less than 1 • C.  Figure 6a shows the semivariogram of the mean of the daily instantaneous temperature values; its range is about 49 km. Figure 6b shows a boxplot of the range values estimated in the different daily semivariograms. In every case the range is higher than 80 km, so this value was used as the size of the spatial blocks. The blocks were generated with the R blockCV package [67], obtaining finally 7 cells distributed in k = 6 blocks as the number of cases in the two cells forming block 1 is quite low. The cells, and block identification, are included in Figure 3.  Figure 7a represents the autocorrelation function (ACF) of the daily average of instantaneous temperature of all weather stations. As expected, this variable shows a clear intra-annual cyclic behaviour. Figure 7b shows the distribution of the first non-significant lag time for all weather stations. According to the figure, 25 days may be considered a safe threshold beyond which no autocorrelation occurs. So, blocks with temporal size larger than 25 days guarantee that temporal autocorrelation is not affecting cross-validation. However, to reduce computation time and to avoid problems due to some missing data, just two temporal blocks were used. An additional problem of this k-block-CV approach is that cases in the border of a spatio-temporal block are nearer to cases in the neighbour blocks than the advisable distance, so it is safer to reduce the number of blocks to reduce the percentage of such cases. To avoid this issue, to reduce computation time and to avoid problems due to some missing data, the number of temporal blocks was reduced to 2. Therefore, in this study a spatio-temporal k-block-CV with K = 36 (6 spatial blocks and 2 temporal blocks) was implemented. Table 3 shows the accuracy statistics for the three main predictive models. Neither local nor regression-kriging were validated with this approach because the computing time was excessive and because the semivariograms computed in a blocks are not necessarily representative for other blocks. Figure 8 shows these values and also the results of the comparisons of the values of the statistics. RF is the most accurate method according with all statistics, only PBIAS gives not very different results for all methods.   Figure 9 shows the RF importance and the effects of the variables that overcome the fast forward feature selection method. The most important variables are satellite land surface temperature, cdayt and radiation. Considerably less important variables, but still in the model, are time passage of the satellite, distance to the coast, and elevation.

K-block Cross-Validation
All the effects are quite significant as their confidence interval are quite narrow, both the confidence intervals and the shape of the effects plots are quite similar to those shown in Figure 5 for the same variables.  Figure 10 shows the resulting temperature maps at the satellite passage time for the seven methods analysed. The spatial pattern produced by OK is clearly poorer as it does not use any ancillary data. The patterns produced by the regression methods are quite similar, reproducing the influence of the topographical variables that are relevant for the modelling of the spatial variability of temperature. The main differences among the regression methods involves the prediction of maxima and minima. The regression trees on which RF is based prevent the prediction of abnormally high or low values when the values of the predictors exceed the values used in calibration. This does not happen with GLS or SVM, whose maps show higher extremes, especially in the case of the GLS model. These extreme values are, however, smoothed by the kriging of the residuals. Although the lack of anomalous values when predicting beyond the range of values used to calibrate the model is probably a point in favour of RF, it is difficult to ascertain to what extent it may result in an overestimation of minima and underestimation of maxima. Finally, all regression methods produce an artifact in the cells nearest the coast, which show temperature values probably lower than the real values. As was mentioned in the introduction, SVM has been reported to obtain similar accuracy than RF and better accuracy than other machine learning methods such as neural networks; however, its main drawback is that there is no way to know in advance which kernel and parameter values will give the best results. It is noteworthy that, in this study, RF obtained more accurate results than SVM, even when the parameters of the latter were optimised (C = 8, σ = 0.072) but not those of the former.

Discussion
Some of the research in which instantaneous temperature values are obtained at the time of passage of the TERRA and AQUA satellites on clear days, are based on the adiabatic lapse rate [6,27]. In these research, RMSE values of 3.3 • C and 2.6 • C, respectively, were obtained. Other studies are based on the assumption of a hydrostatic equilibrium in the atmosphere [27,75], obtaining an RMSE of 2.9 • C in both works. Most of the work related to the estimation of the air temperature from remote sensing data and using regression models are focused on obtaining the maximum and minimum temperatures, but not on the estimation of T a at the time of satellite passage. In this sense, ref. [15] propose a multiple linear regression using as predictors some of the variables used in the present work and in which they obtain an RMSE of 1.9 • C, although in this case the results do not correspond only to clear days, so their results are not extrapolable to those obtained in this work. The results obtained in the present work, especially for RF and RFRK, represent an important improvement with respect to the models based on the adiabatic lapse rate or on the hydrostatic equilibrium of the atmosphere and can be of great help for the improvement of the estimation of the real evapotranspiration, on a regional scale. Traditionally, the temperature at the time of the passage of the satellites, required by most methods, has been obtained using mechanical interpolation techniques, as the Inverse Distance Weighting [76] or models based on the hydrostatic equilibrium of the atmosphere from the values provided by the MODIS MOD07 product [77], which in addition to presenting a lower accuracy than that obtained for RFRK in this work, have the disadvantage of having a spatial resolution of 5 km.
In this paper, two different approaches to cross validation have been used: LOO-CV in the spatial domain and spatio-temporal k-block-CV. The latter approach arises from concerns about the problems of doing random cross-validation with spatio-temporal data due to a lack of independence among samples. In our case, quite larger R 2 and NSE values when doing k-block-CV, whereas RMSE and PBIAS values were also larger. The reason of the difference in RMSE and PBIAS when comparing the two validation methods is that the range of air temperature values in LOO-CV in the spatial domain models (using just one day) is lower than the corresponding range in the k-block-CV models (using 182 days). For this reason, comparison of RMSE and PBIAS among different validation approaches could be misleading; however, comparisons of R 2 and NSE are quite safer. The fact that LOO-CV in the spatial domain produces lower accuracy statistics is probably due to the use of a quite lower number of data when calibrating the models.On the other hand the results on which method is more accurate are also similar, with k-block-CV highlighting RF as the most accurate method.
It is difficult to establish the generalisation capability of this approach. The fact that in this case RFRK were the best model does not mean that it is going to be the best model in other areas. In general, it is not easy to pointing out a machine learning method as the best, in any research field. We think that similar investigations are needed in different areas, so that a meta-analysis of their results might be carried out to obtain some insight on which methods work best and under what circumstances.

Conclusions
The results obtained in the present work, especially for RF and RFRK, represent an important improvement with respect to the most widely used models. The accuracy of different machine learning algorithms was tested and compared with those of more traditional methods, such as MLRM and OK. The framework built for that goal is based on free and open source software and would be easily reproducible in other study areas. Easy to obtain environmental variables were used as predictors. Finally, this methodology produces daily temperature distributed surfaces with an accurate fit to the observations. It is noteworthy that the spatial variability of both temperature and environmental predictors in Mediterranean semiarid environments and the heterogeneous landscape due to intense human pressure difficult the estimation of environmental variables.
Both validation approaches show that machine learning algorithms, specially RF, reach significantly higher accuracy values. The use of both approaches allows a better knowledge of model accuracy in two different situations: Spatio-temporal k-block-CV allows to evaluate in a realistic way the accuracy of models that include all observations in time and space, considering the effect of spatio-temporal dependence. On the other hand, LOO-CV in spatial domain is an accuracy estimation considering a very small number of observations: n − 1 where n is the number of observatories, and all the data corresponding to the same day.
RF with OK of the residuals obtained the most accurate results with R 2 = 0.888, NSE = 0.805, RMSE = 3.009 and PBIAS = 6.258. The last result indicates that it is the only method that produces a slight temperature underestimation, whereas the rest of the methods overestimate temperature to a greater or lesser degree. The maps obtained with RF do not show the extreme values usually present in other regression methods, and that also appear in this study.
The introduction of environmental information related with temperature and easy to obtain from a DEM improves the prediction capability of the model. In particular, potential radiation, distance to the coast and elevation turned to be selected variables in both feature selection strategies.
K-block-CV produced a higher accuracy estimation than LOO-CV in the spatial domain. Probably because in k-block-CV calibrated models used quite more data than LOO-CV in the spatial domain, so the space of variables are more thoroughly sampled. The purpose of LOO-CV in the spatial domain is not really to obtain a a global accuracy measurement, but to obtain the accuracy in specific days. However, the ranking of methods was similar.
The variable selection based on k-block-CV produced a smaller set of predictors (6) than the variable selection based on LOO-CV in the spatial domain (10). All the predictors not included by the first one turned to be of scarce importance in the RF model. The most important predictors are LST, cdayt and potential radiation. This three predictors are physically quite correlated with air temperature. The satellite passage time, distance to the coast and elevation are less important. The last two predictors have a quite clear and physically sound effect on air temperature time. It is evident in the case of elevation, and in the case of distance to the coast, due to the high temperatures in the study area, the sea acts as a cooling factor more than a warmer factor. It is less clear the effect of the satellite passage time that should be analysed in further research.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: