Air Temperature Forecasting Using Machine Learning Techniques: A Review

Efforts to understand the influence of historical climate change, at global and regional levels, have been increasing over the past decade. In particular, the estimates of air temperatures have been considered as a key factor in climate impact studies on agricultural, ecological, environmental, and industrial sectors. Accurate temperature prediction helps to safeguard life and property, playing an important role in planning activities for the government, industry, and the public. The primary aim of this study is to review the different machine learning strategies for temperature forecasting, available in the literature, presenting their advantages and disadvantages and identifying research gaps. This survey shows that Machine Learning techniques can help to accurately predict temperatures based on a set of input features, which can include the previous values of temperature, relative humidity, solar radiation, rain and wind speed measurements, among others. The review reveals that Deep Learning strategies report smaller errors (Mean Square Error = 0.0017 ◦K) compared with traditional Artificial Neural Networks architectures, for 1 step-ahead at regional scale. At the global scale, Support Vector Machines are preferred based on their good compromise between simplicity and accuracy. In addition, the accuracy of the methods described in this work is found to be dependent on inputs combination, architecture, and learning algorithms. Finally, further research areas in temperature forecasting are outlined.


Introduction
Mitigating climate change is one of the biggest challenges of humankind. Despite the complexity of predicting the effects of climate change on earth, there is a scientific consensus about its negative impacts. Among them, the affectation of ecosystems, decrease of biodiversity, soil erosion, extreme changes in temperature, sea level rise, and global warming have been identified. Likewise, impacts on economy, human healthy, food security and energy consumption are expected [1,2].
Specifically, air temperature forecasting has been a crucial climatic factor required for many different applications in areas such as agriculture, industry, energy, environment, tourism, etc. [3]. Some of these applications include short-term load forecasting for power utilities [4], air conditioning and solar energy systems development [5,6], adaptive temperature control in greenhouses [7], prediction and assessment of natural hazards [8], and prediction of cooling and energy consumption in residential buildings [9,10]. Therefore, there is a need to accurately predict temperature values because, in combination with the analysis of additional features in the subject of interest, they would help to establish a planning horizon for infrastructure upgrades, insurance, energy policy, and business development [11].
Along with other atmospheric parameters, air temperature values are measured near the surface of earth by trained observers and automatic weather stations. In particular, the World Meteorological Organization facilitates the creation of worldwide standards for instrumentation, observing practices and measurements timing in order to ascertain the homogeneity of data and statistics [12]. Empirical strategies have been developed for temperature forecasting, obtaining accurate results. Their high accuracy and reliability have been very dependent on the acquired data, where most of them follow data quality standards and quality measures [13][14][15][16][17].
In particular, this area has become a significant field of applications of Machine Learning (ML) techniques, due to the difficulties with achieving a high accuracy in the temperature prediction. In particular, it has been proved that the volatility of temperature time series obeys nontrivial long-range correlation, presenting a nonlinear behavior [18]. In addition, these time sequences have a considerable spatial, temporal, and seasonal variability [19].
In literature, many ML-based approaches have been explored in forecasting applications. Specifically, in the air temperature time series analysis, Artificial Neural Networks (ANN) and Support Vector Machines (SVM) are the most widely implemented strategies. In particular, most of the ANN models, developed to predict temperature values, are represented by the MultiLayer Perceptron Neural Networks (MLPNN) and Radial Basis Function Neural Networks (RBFNN) [20][21][22][23][24][25][26][27][28][29][30][31][32], with the Levenberg-Marquardt and Gradient Descent being the most used optimization algorithms. With regard to SVM models, most of the works developed in the field involve Radial Function Base Kernels [33][34][35][36][37][38]. In terms of performance, at a global scale, SVM has reported better performance metrics in comparison with classical ANNs [39] from 1 to 20 steps-ahead. In contrast, at a regional scale, recent Deep Learning (DL) approaches have been proposed, reporting high accuracy values. Specifically, Convolutional and Long Short Term Memory (LSTM) Recurrent Neural Networks (RNN) have been used to forecast hourly air temperature with significantly small errors for 1-step ahead [40]. In turn, a similar approach was proposed by Roesch and Günther [41] to overview annual, monthly, and daily patterns associated with air temperature time series.
The primary aim of this study is to review the ML techniques proposed in the literature for air temperature forecasting and to identify research gaps. To the best of our knowledge, this is the first review in the literature of ML-based techniques focused specifically on the problem of air temperature prediction, taking into account global and regional points of view. This paper is organized as follows. In Section 2, the most used ML-based strategies are described and the relevant associated concepts are introduced. In Sections 3 and 4, the comparison of the temperature forecasting ML-based strategies, at global and regional levels, is presented. Finally, conclusions and research gaps in temperature forecasting are discussed in Section 5.

Overview of Machine Learning Based Strategies and Forecast Performance Factors
ML is defined as a branch of the Artificial Intelligence field. The main objective of the algorithms developed in this area is to obtain a mathematical model that fits the data. Once this model represents accurately known data, it is used to perform the prediction using new data. In this way, the learning process involves two steps: the estimation of unknown parameters in the model, based on a given data-set, and the output prediction, based on new data and the parameters obtained previously.
In this way, ML strategies find models between inputs and outputs, even if the system dynamics and its relations are difficult to represent. For this reason, this approach has been widely implemented in a great variety of domains, such as pattern recognition, classification, and forecasting problems. There are three common methods implemented in ML:

•
Supervised Learning, which has information of the predicted outputs to label the training set and is used for the model training.

•
Unsupervised Learning, which does not have information about the desired output to label the training data. Consequently, the learning algorithm must find patterns to cluster the input data. • Semi-supervised Learning, which uses labeled and unlabeled data in the training process.
• Reinforcement Learning, which uses the maximization of a scalar reward or reinforcement signal to perform the learning process, being positive or negative based on the system goal. Positive ones are known as "rewards" while negative ones are known as "punishments".
Considering the large amount of ML-based approaches developed in forecasting applications, this work is focused on the most widely implemented ML strategies in temperature prediction: ANN and SVM. Although these methods are trained in a supervised way, neural network algorithms, capable of unsupervised training, could be included as well [42].
In particular, it is important to note that the most used input features in this field include the previous values of temperature as well as relative humidity, solar radiation, rain, and wind speed measurements. On the other hand, the prediction evaluation measures, more frequently used in these works, to assess the performance of these algorithms, have included the Mean Absolute Percentage Error (MAPE), the Mean Absolute Error (MAE), the Median Absolute Error (MdAE), the Root Mean Squared Error (RMSE), and the Mean Squared Error (MSE). Other indices, proposed in the literature, can be used as the correlation coefficient R (Pearson Coefficient), or the index of agreement d which are usually normalized in the (0-1) range [43]. These algorithms and their particularities will be discussed in the following subsections.

Artificial Neural Networks
Specifically, ANNs have been widely used for classification and forecasting applications in meteorology due to their accurate results solving pattern recognition, nonlinear function estimation, and optimization problems [44]. The accuracy of their results is based on the ANN's capability to characterize nonlinear relationships and the availability of historical data of meteorological variables, making them an attractive analysis tool for researchers around the world.
The perceptron is the basic structural element of an ANN. The inputs associated with this component are scaled by weights (W i ), added over the n inputs x i , translated by a bias (b) and passed through an activation function f . The perceptron transfer function can be written as: Perceptrons can be combined to form a MultiLayer Perceptron Neural Network (MLPNN). In general, the inner structure of these ANNs in prediction problems is composed by n inputs, m k units for a single or k multiple hidden layers and a single output unit. The input layer receives the data-set for each class by means of its units that characterize the input features. The unit values of the hidden layers are defined by the sum of multiplications between the previous layer units and weights of the links connected to that node. Finally, the output layer is the final processing and its units represent the classes to be recognized or the variable to be predicted. An example of the mapping input x-output y function for an ANN with one hidden layer is defined by Equation (2: where g and c represent the activation function and bias for the output layer, respectively. With this rationale in mind, more complex architectures which include multiple hidden layers could be considered. Weights vector W characterizes the nonlinear mapping and is defined during the learning process to match the desired outputs, minimizing a defined error function; this stage is commonly called training. Each of the m hidden neurons are defined by an activation function that usually is represented by one of the following functions: For prediction applications, in general, the output activation function is considered linear. During the generalization stage, called "recalling", the validation on a different data-set is performed in order to evaluate the ANN performance with the weights calculated during the learning process. For temperature prediction, an example of the relationship input x-output y, considering the previous values of temperature as the input features, is presented in Equation (3): As can be seen in Equation (3), the ANN representation is equivalent to the classic nonlinear auto-regressive (AR) model for prediction purposes. In this way, l can be calculated using the auto-mutual information factor like in the AR models case. Careful attention must be given to the size of training data in order to obtain the best performance during the neural network analysis. The use of few training samples could not be sufficient to compute weights that allow the generalization, while a too large number could cause the data over-fitting and require much more time for learning. A more detailed description of this method may be found in [45,46].
A broad variety of ANN architectures have been proposed for forecasting tasks. Alternative to MLPNN, a RBNN has been widely explored in air temperature forecasting. This architecture differs from the MLPNN in that the input layer is not weighted, and, based on this representation, the first hidden layer nodes receive each full input value with no modifications. Additionally, just the activation function is generally adjusted, which in most cases is set by a Gaussian activation function. In general, RBNNs involve a simpler training process because they contain fewer weights than classical MLPNNs, which leads to a good generalization and high noise tolerance.
In the ANN research line, an ML area called Deep learning (DL) has been widely implemented in many fields and applications. DL-based ANN is an approach which includes at least two nonlinear transformations (hidden layers). Their advantages lie in the ability to handle big data, and to automatically extract relevant features [47]. Different DL-based architectures have been implemented in forecasting applications; however, they have not been widely explored for the analysis and prediction of air temperature time series. Some examples of the architectures used in prediction tasks are the Recurrent Neural Networks (RNN) and the Convolutional Neural Networks (CNN). RNN, on one hand, uses the internal state of the network at the previous output as input to the model, following a chained module structure, so that the information is recurrently analyzed. Figure 1 shows this dependence structure, where A is a repeating module and x t , h t the input and output at time t, respectively. In traditional RNNs, this module will consist only of a single ANN. CNNs, on the other hand, are a kind of ANN developed for feature extraction. Originally developed for two-dimensional data and image recognition, these networks perform a series of operations on the data matrix to reduce its size. One-dimensional CNN are widely used in time series forecasting problems to identify patterns on time-sequenced data.

Support Vector Machines
SVM algorithm, on the other hand, has been considered one of the most robust and accurate strategies among the ML-based approaches. It is a kernel-based technique developed by [48] and has been used in forecasting, classification, and regression applications. The main objective of SVM is to map the input data x into a high-dimensional feature space by means of a nonlinear mapping and generate an optimal hyper-plane w.x + b = 0 in this new space. In contrast to the ANN strategy which uses the training error in the optimization process, SVM seeks to minimize an upper bound of the generalization error. In order to obtain the optimal hyper-plane {x ∈ S(w, x) + b = 0}, the norm of the vector w must be minimized while the margin defined between the 2 classes 1 ||w|| is maximized: The SVM regression estimating function to get the predicted output y * from the input data-set x is given by: where K(x i , x j ) is the kernel function commonly defined as: where d, r ∈ N and γ ∈ R + are constant. α i and α * i are Lagrange multipliers, which are solutions of a quadratic programming problem and satisfy the Karush-Kuhn-Tucker conditions. These coefficients are calculated by maximizing the following form: The parameter C defines the smoothness of the approximating function and determines the error margin to be tolerated. Lagrange multipliers α i and α * i act as forces pushing the estimating values towards the desired output value y. The parameter b (or bias parameter) in Equation (5) requires the direct derivation of Karush-Kuhn-Tucker conditions that lead with the quadratic programming problem described. More details of this approach can be found in [49,50].
These ML-based solutions have become an alternative approach to conventional techniques and have been used in a number of meteorological forecasting applications [51][52][53]. It should be noted that the impact of coupling these strategies with other tools, such as principle component analysis, Kalman Filter, fuzzy logic, among others, has been studied as an interesting improvement in the estimation process performance [54][55][56].

Evaluation Measures
Based on the fact that a standard evaluation measure has not been defined for prediction, the comparison among the different forecasting strategies has become difficult. This is mainly due to the different time horizons and scales of the estimated data and the variability of the meteorological time series for diverse locations. However, some measures have been proposed to compare the predicted outputŷ with the observed data y. The most widely used measures to evaluate the ML strategies, implemented for forecasting tasks, are: • Mean Absolute Error (MAE): This measure is an error statistic that averages the distances between the estimated and the observed data for N samples: • Median Absolute Error (MdAE): This measure is defined as the median of the absolute differences |y −ŷ| for any N pairs of forecasts and measurements: • Mean Square Error (MSE): This measure is defined as the average squared difference between the predicted and the observed temperature data, for N samples: • Root Mean Square Error (RMSE): This measure is the standard deviation of the difference between the estimation and the true observed data (See Equation (10)). This measure is more sensitive to big prediction errors: Although these measures have been widely used in forecasting tasks due to their simplicity, main drawbacks reported are focused on the scale dependency [57], the high influence of outliers in the prediction evaluation [58], and their low reliability [59], evidenced by the variability of the results when a different fraction of data are evaluated.
In addition to these error measures, percentage errors have been calculated as well during the evaluation in the forecasting domain. This group of measures includes:

•
Mean Absolute Percentage Error (MAPE): This measure offers a proportionate nature of error with respect to the input data. It is defined as: • Root Mean Square Percentage Error (RMSPE) RMSPE is calculated according to: These kinds of measures are unit-free, have good sensitivity when small changes are present in data, and do not show data asymmetry [60]. In addition, these measures involve divisions by a number equal or close to zero, with errors that could be indeterminate or excessively large, and have very low outlier protection compared to other measures which have bounds for errors [59,61].
An additional group based on relative measures contains functions calculated as a ratio of mentioned above error measures by means of the estimated forecasting and reference models data. In this group, it is possible to find:

•
Relative Mean Absolute Error (RMAE): This measure is computed as: where MAE and MAE * are calculated by using Equation (7) for the forecasting model and the reference model, respectively. • Relative Root Mean Square Error (RRMSE): This measure is calculated in a similar way to the RMAE, but in this case using the error defined in Equation (10): This approach, in general, establishes the number of cases when the evaluated forecasting model is superior to the reference but does not give an assessment of the difference value [62].
Likewise, additional indices have been used in the evaluation of the forecasting systems. Among them, the correlation coefficient R (Pearson Coefficient) has been defined as the co-variance between the estimatedŷ and the observed y data over the product of their standard deviations (Sŷ, S y ): The Index of Agreement d, on the other hand, is calculated by the expression: Based on the d-statistic, the closer the index value is to one, the better the agreement between the observed and the predicted data.
Based on the fact that each evaluation measure has the disadvantage that could guide an inaccurate evaluation of the prediction process, it has been proved that is not possible to choose only one measure. Shcherbakov et al. [62] recommend using the error measures and correlation coefficients when the analyzed time series have the same scale and when a data pre-processing has been performed. In addition, although percentage measures have been widely used in forecasting tasks, they do not recommend them due to their non-symmetry.
An additional topic that has been widely considered during the result evaluation of ML models is the Statistical Significance Analysis (SSA). While, in most applications, they have been used to select the best ML model, this tool also supports the interpretation of prediction results. In general, ML-based strategies are commonly validated using re-sampling approaches like k-fold cross-validation from which mean skill scores are directly computed and compared. Although this approach is very simple, it could be misleading as it is difficult to know if the difference between mean skill scores is real or the result of a statistical fluke. In this context, SSA is proposed to overcome this limitation and quantify the likelihood of the samples of skill scores being measured, given the assumption that they were drawn from the same distribution. If this hypothesis (often called null hypothesis) is rejected, it indicates that the difference in skill scores is statistically significant. As such, SSA is considered very useful to improve both model reliability and results interpretation and presentation during the model selection process. Forecasting applications have included, for instance, normality tests to confirm that data-sets are (or are not) normally distributed, parametric statistical significance tests for normally distributed results, or non-parametric statistical significance tests for more complex distributions of results.

Input Features, Time Horizon, and Spatial Scale
ML strategies have become an alternative approach to conventional techniques and are used in a number of different applications for modeling, prediction, and forecasting of temperature values. For the particular case of forecasting applications, three considerations on the input features can be envisioned to characterize the model: • The model is based on other meteorological or geographical variables (e.g., solar radiation, rain, relative humidity measurements, among others).

•
The model only takes into account the historically observed data of temperature as system input.

•
The model takes a combination of both temperature values and other parameters, to perform the prediction.
Likewise, it is important to underline that one of the most intuitive criteria that impacts the prediction performance is the forecast time horizon (known as the look-ahead or lead time). The forecast time horizon in temperature prediction, defined as the length of time into the future for which prediction is performed, is characterized in terms of a long-term and a short-term estimation. The n-months ahead forecasts are designated as long-term forecasts. Alternatively, the short time horizon is defined as a n-hours or n-days ahead prediction.
An additional factor that has a significant impact on forecast performance is the spatial scale. Due to the well-known aggregation effect, forecasts for geographically diverse stations, which aggregate in a global scale prediction, usually have smaller errors than the forecasts for individual meteorological stations in a regional scale. Local effects, which are more random, are often more difficult to predict in the temperature trends. In this way, this paper reports the air temperature values forecasting, performed at a global and regional scale. Specifically, at a regional level, hourly, daily and monthly predictions have been envisioned, based on the particular applications of the different forecasting systems.

Long-Term Global Temperature Forecasting
Different research papers have reported that the climate will warm over the coming century, as a reaction to the changes in the anthropogenic emissions of CO 2 [63]. Then, there is an increasing involvement of science and scientists to characterize the impacts of global climate change on decadal [64] or longer time scales [65], in order to structure prospects for global policy actions. This variability has been studied in response to the Global mean Temperature rise that the earth has been experienced since pre-industrial times. Therefore, this section details the application of ML-based strategies in global temperature forecasting using a variety of meteorological variables.
Miyano and Girosi [20] applied an MLPNN using back-propagation and the generalized delta rule, a stochastic gradient descent algorithm, to predict Global Temperature (GT) variations. They used 45 data points (1861-1909) for training and tested the approach by means of three data-sets: 1910-1944, 1910-1964, and 1910-1984 obtaining a RMSE of 0.12 • C, 0.13 • C and 0.15 • C, respectively.
Knutti et al. [66] propose a neural network based climate model to predict ranges for climate sensitivity. Data used for the estimation process include the observed surface warming over the industrial period and estimates of global ocean heat uptake. The neural network structure implemented includes 10 neurons, Sigmoid and Linear activation functions for the input and the hidden layer, respectively, and the Levenberg-Marquardt algorithm as optimizer strategy. Although the surface warming calculated from the climate model fits well with observations, some features like the almost constant temperatures found in 1940-1970 and the strong warming after 1980 are not well simulated.
Pasini et al. [67] used an ANN for GT anomalies estimation based on global physical-chemical forcing and circulation patterns. The GT is estimated as a function of parameters combination obtained from natural/anthropogenic forcings and an inter-connected ocean-atmosphere circulation pattern (called El Niño Southern Oscillation-ENSO) since 1866 (See Figure 2). Solar Irradiance (SI) and Stratospheric Optical Depth (SOD) are considered as indices of natural forcings on the climate system, while CO 2 concentration and sulfate emissions are characterized as anthropogenic forcings. The MLPNN includes a single layer with few (four or five) hidden neurons. It is trained by means of the generalized Widrow-Hoff rule based on gradient descent and momentum terms and the activation function is a normalized sigmoid proposed in [68]. The authors explain the physical relationship between inputs and targets by excluding some inputs-target pair from the training set. Once the network is trained, they use the excluded pairs as a validation/test set in order to assess the modeling performance on new cases that are unknown to the network. The model performance of the GT estimation strategy is presented in terms of the linear correlation coefficient R of 0.877; the highest value obtained among four scenarios proposed of input variations: Natural, Anthropogenic, Natural + anthropogenic, Natural + anthropogenic + ENSO. Fildes and Kourentzes [69] presented an empirical evaluation of univariate and multivariate forecasting methods used to predict GT. In particular, they assessed the CO 2 emissions inclusion in a nonlinear multivariate neural network, by means of data obtained from the annualised HadCrut3v (a data-set of land and ocean temperatures), and total carbon emissions from fossil fuels, between 1850 and the forecast origin. The authors developed an ANN model with a single hidden layer and carried out an evaluation of the suitable amount of hidden nodes (between 1-30). They identified 11 and 8 nodes to be convenient for the univariate and multivariate ANN, respectively. The nonlinearities were modeled using the hyperbolic tangent activation function and the optimizer implemented was In contrast to the ANN models, Abubakar et al. [39] proposed an SVM model to forecast the global land-ocean temperature (GLOT). Data analyzed, including rain, pressure, GT, wind speed and relative humidity, were collected from NASA's GLOT index for the period between 1880 and 2013. SVM model was kerneled with a Radial kernel Function and the optimal values applied for C, , γ and learning ratio η were 68, 0.001829, 65, and 0.06, respectively. Finally, a support vector of 7613 was chosen based on its accuracy. The performance of the model was compared with an MLPNN with one hidden layer and 11 hidden neurons, trained by means of a Levenberg-Marquardt learning algorithm. In the hidden and output layers, they included sigmoid and linear activation functions. Experimental results show MSE and RMSE of 0.004519 and 0.00121 for SVM and 0.08912 and 1.657110 for MLPNN, respectively.
Hassani et al. [70], on the other hand, predict Global Temperatures by means of 12 parametric and non-parametric univariate (only GT) and multivariate (GT and global CO 2 emission) models. Among the multi-regressive and the nonparametric spectral estimation algorithms, commonly used in time-series forecasting, they analyze the Neural Network Performance using the GT data obtained from the Goddard Institute for Studies (GISS), and the CO 2 data from the Carbon Dioxide Information Analysis Centre. The ANN implemented for the analysis is a feed-forward neural network with a single hidden layer and one hidden node. The algorithm used for training is the rprop+ and the activation function is a sigmoid. RRMSE obtained in this paper is 0.67 average for 1 to 10 steps ahead, showing higher error values compared with other competing models. Table 1 shows the results of ANN based methods used in GT prediction. In this list, the papers propose different architectures changing input definitions, structure and training algorithms to improve the forecasting accuracy. Although a lot of work have been done for regional estimation of temperature, based on SVM and ANN, most GT forecasting ML-based strategies are focused on ANN. However, taking into account the comparison between SVM and ANN in GT prediction, developed by Abubakar et al. [39], results show a best performance for the SVM model reporting the lowest MSE and RMSE values.

Regional Temperature Forecasting
Considering the strong potential impacts on climate, in response to the increase of CO 2 emissions, global temperature forecasting models have been proposed (e.g., General Circulation Models) in order to find strategies to mitigate the possible environmental and economic damages [78].
The resolution of these models is not high enough to give better characterizations in a regional scale. In this case, historical measurements of individual meteorological stations have been used to study the climate change in specific areas. In this section, research developed in air temperature forecasting, at a regional scale with different time horizons, are described.

Hourly Temperature Forecasting
Accurate forecasting of Hourly Temperature (HT) has an important number of different applications, ranging from electricity load forecasting to crop loss prevention. The inaccuracy and lack of measured HT data avoid any measure to mitigate the damage obtained from extreme temperature events. HT prediction has been studied in different research papers [21,22].
One of the initial ANN-based schemes applied in this field was developed by Hippert et al. [21]. In this work, a hybrid forecasting system, combining a simple autoregressive model and an MLPNN, was structured to predict hourly temperatures using past observed temperatures, forecasts obtained from the linear model, extreme temperatures forecasts, provided by the Weather Service, and the hour (codified as a sinusoid in order to stress its cyclical nature). The analyzed data were collected from a weather station in Rio de Janeiro, Brazil in 1997. In the experiments, AR and ARMA models were tested in conjunction with the MLPNN. For the period February (20)(21)(22)(23)(24) and the combinations AR+ MLPNN and ARMA + MLPNN, MAPE values were 2.82 and 2.66, respectively. These results are considerably lower than those obtained with only linear models.
Tasadduq et al. [22] implemented an MLPNN for the estimation of hourly mean values of temperature 24 h in advance. Full year hourly values of temperature are used during the MLPNN training for a coastal location-Jeddah, Saudi Arabia. The MLPNN includes only one input node, associated with the temperature of the previous day at the same hour, and is validated with the data from three different years, excluding the one used for training. The MPD calculated for every experiment is 3.16%, 4.17%, and 2.83%, respectively.
Lanza and Cosme [23] proposed a hybrid strategy for HT prediction based on a RBFNN, initialized by means of a Regression Tree. In this approach, each terminal node of the tree is connected to one hidden unit of the RBFNN. The system inputs are the current coded hour and the temperature to predict the next HT. Data used during the validation process were recorded in the Great Energy Predictor Shootout II in Texas during the period 20 May to 20 August. The proposed model is compared with a linear AutoRegressive with eXogenous inputs (ARX) model, showing a better performance with an MAE equal to 0.4466 • C in contrast to 0.5247 • C. It is important to note that a good consideration (at least for load prediction) is to obtain an MAE less or around 0.5 • C.
Abdel-Aal [3], on the other hand, estimates next-hour and next-day HT by training an Abductive Artificial Neural Network (AANN) on 5 years of data (1 January 1985-12 October 1989) and validating on data for the 6th year (1990). The data-set used includes the measured HT data from the Puget power utility in Seattle. For the next-day hourly forecasting model, 24 models for each hour of the day were implemented to estimate the following day HT in one step. Every model has the same set of inputs: 24 hourly temperatures on (d − 1)-day (T 1 , T 2 ,...,T 24 ), the measured minimum (Tmin) and maximum (Tmax) temperatures on (d − 1)-day, and the estimated minimum (ETmin) and maximum (ETmax) temperatures for d-day. In the same way, for the next-hour HT estimation, 24 models were implemented based on the full HT data on (d − 1) day (T 1 , T 2 ,...,T 24 ). In addition, every available HT on d day up to the preceding hour (NT 1 , NT 2 ,...,NT h−1 ) are used together with the measured minimum (Tmin) and maximum (Tmax) values for the (d − 1)-day and the minimum (ETmin) and maximum (ETmax) estimated temperatures for the d-day. Next,-hour and next-day hourly models obtained an overall MAE of 1.68 and 1.05 • F, respectively. These results were compared with an MLPNN, using a node configuration equal to 28-6-1 and a sigmoid transfer function, indicating inferior performance in contrast to the abductive model.
Maqsood et al. [24] used an ensemble of MLPNN, RBFN, Elman Recurrent Neural Network (ERNN), and Hopfield model (HFM), obtained by means of a constructive algorithm, to predict the 24-h-ahead weather parameters for winter, spring, summer, and fall seasons. The input and output parameters used for this analysis were related to HT, Wind Speed, and Relative Humidity values, collected at the Regina Airport by the Meteorological Department in Canada in 2001. The performance of this approach was contrasted with every strategy separately and results showed that the ensemble of neural networks produced the most accurate forecasts.The proposed strategy can be easily implemented to address HT forecasting applications without increasing the computation complexity.
The research reported by Smith et al. [25] included the evaluation of 30 models of MLPNN to forecast HT values up to 12 hours ahead. Input data are composed by five weather variables: HT, Relative Humidity, Wind Speed, Solar Radiation, and rainfall acquired from stations located in southern and central growing regions of Georgia. MLPNN architectures analyzed in this work are based on the Ward-style, which is a network with multiple node types and activation functions. These models had a linear input layer, three equally-sized and a single, logistic output node, which represents the HT at some prediction horizon. In this case, they carried out an analysis based on the training set sizes, obtaining six models (instantiated by 30 networks) with different training patterns. The most accurate network was trained over 50,000 samples and obtained an MAE of 1.51 • C for a 4-h model. In addition, they performed a comparison for the same model with and without seasonal input terms. The most accurate model was found to be with seasonal inputs. Based on the same architecture, the authors proposed an automated year-round temperature prediction [26] using training sets of 1.25 million patterns. In this case, they also evaluated the accuracy effect of adding rainfall input terms, concluding that these additional inputs did not increase the prediction accuracy. The MAE calculation for the year-round forecasting system varied from 0.516 • C for 1-h horizon to 1.873 • C for 12-h horizon. Recently, Jallal et al. [27] proposed an autoregressive MLPNN-based model with delayed exogenous input sequence to analyze the global solar radiation to predict the air temperature in a half hour scale. The analyzed dataset contains the measurements at the weather station Agdal that is installed in the Agdal garden, Marrakesh, Morocco for the year 2014, and the model reports an MSE value of 0.272.
In contrast, SVM regression was introduced in HT prediction by Chevalier et al. [33] in 2011. In this study, identical inputs and subsets of the historical data described in [26] were included in the analysis. For the SVM regression algorithm, the penalty factor C was set to 25 and the kernel used during the experiments was a radial-based function. In this study, the kernel was arbitrarily selected because it has been shown to be a good general purpose kernel [79]. Results showed that, for a reduced training set with 300,000 patterns, the SVM strategy was slightly more accurate than the MLPNN-based method. However, the MLPNN model predicted more accurately when the number of training patterns increased to 1.25 million (See Table 2).  In the same line of thought, Ortiz-García et al. [34] present a HT prediction system (up to 6 h ahead) based on SVM regression banks, constructed using synoptic information of the data by means of the Hess-Brezowsky classification (HBC) algorithm. For this study, seven meteorological variables were acquired from the Barcelona-El Prat International Airport automatic station (1 January 2009 to 31 December 2009), in a mean hourly scale. The authors grouped the SVM bank in terms of four synoptic variables, which characterize the atmospheric flow and weather patterns: three main groups of circulation types (zonal, mixed and meridional), and one group to cover unclassified situations called the transition situation. Then, the samples are divided and different SVMs are trained for each group. The next value predicted is obtained by checking the current synoptic situation and then applying the suitable SVMs. The authors show that this solution performs better than an alternative prediction method based on the Extreme Learning Machine (ELM) algorithm.
Mellit et al. [35] proposed Although most of the Deep Learning applications have been focused on classification problems, some research has successfully applied this approach in solving prediction problems. Recently, Hossain et al. [80] applied Stacked Denoising Auto-Encoders (SDAE) to predict HT based on the prior 24 h of HT meteorological data in northwestern Nevada. The results show a significant improvement in the HT prediction domain, as it achieves 97.94% accuracy compared to a simple ANN which achieves 94.92% accuracy. In addition, Hewage et al. [40] proposed a temporal modeling approach to perform the prediction based on convolutional and Long Short Term Memory (LSTM) recurrent neural networks. The validation is carried out with weather parameters obtained from GRIB data using the weather research and forecasting model. In particular, the surface temperature from January 2018 to May 2018 and for June 2018 are used for training, and test, respectively. A lower MSE is obtained for the LSTM network in comparison with the convolutional ANN-based approach. Table 2 shows the results of ML methods used in hourly temperature forecasting. In this summary, it can be seen that research involving ANN and SVM give similar results in terms of prediction, but it can be deduced that SVM approaches are easier to use than ANN, considering the number of parameters to adjust. In addition, the optimization process for SVM could be automatic while it is more complex for the best improvements of the ANN case. However, although a few research papers have been developed using Deep Learning strategies, the latest advances have considerably improved the accuracy rates in this particular application.

Daily Temperature Forecasting
In particular, Daily Temperature (DT) forecasting is a relevant issue in the energy field, since this specific variable can be used for load forecasting [81] or to estimate solar radiation [82], which is an important factor for photovoltaic farms and devices. In this case, when the predicted loads are not accurate, the power market participants are forced to buy higher-priced electricity or to sell lower-priced electricity [36]. In that context, short-term load forecasting is an important topic for the power system risk management. In literature, a relevant amount of research has addressed the study of DT forecasting by means of ML strategies. In this sense, Pal et al. [28] proposed to use a Self-Organizing Feature Map (SOFM) to find clusters in the data, and, based on these results, the training of an MLPNN for each cluster was carried out. The authors collected nine meteorological variables from the Regional Meteorological Centre in Calcutta, India, for the period 1983-1995. In this case, input features vector contains the information of the previous three days for the daily temperature prediction. Finally, a comparison with a single RBFNN and MLPNN was developed, showing that the proposed hybrid SOFM-MLP network consistently performs better than conventional networks.
Likewise, Maqsood and Abraham [29] presented a comparative analysis of different ANN architectures (MLPNN, RBFN, and ERNN) and a proposed ensemble of these models. These strategies are trained and tested using daily weather data of temperature, wind speed, and relative humidity in southern Saskatchewan, Canada for the year 2001. According to the authors, the proposed ensemble approach produced the most accurate forecast, while the MLPNN was the architecture that obtained relatively less accurate results during the temperature forecasting. A similar analysis was proposed by Ustaoglu et al. [30] to forecast daily mean, maximum, and minimum temperature in Turkey. In this survey, the authors implemented three different ANN-based strategies: MLPMM, RBFNN, and a Generalized Regression Neural Network (GRNN). For most of the experiments involved in this work, RBFNN performances were quite satisfactory providing close estimates compared with GRNN and MLPNN.
In the same line of research, Hayati and Mohebi [31] proposes an alternative configuration for the MLPNN architecture to predict the one-day-ahead temperature for Kermanshah city, west of Iran. In this study, a three layer MLPNN with six hidden neurons was trained and tested using ten years (1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006) of meteorological measurements. Based on the fact that Back Propagation training algorithms are generally quite slow for practical problems, they improved the convergence times by implementing the scaled conjugate gradient.
The same architecture was proposed by Dombaycı and Gölcü [7] to predict mean ambient temperatures in Denizli, southwestern Turkey in the period 2003-2006. Final configuration differences with the previous work lie in the optimization algorithm used for the implementation and the inputs selected for the forecasting system. In order to define the optimal parameters associated with the MLPNN architecture, Abhishek et al. [32] developed a performance analysis of the maximum DT forecasting system while varying the number of hidden layers, neurons, and transfer functions. The data analyzed in this work were collected from the station Toronto Lester in Ontario, Canada from period 1999-2009. Experimental results showed the best performance for a configuration defined by a 5 hidden-layer network with 10 or 16 neurons and a tan-sigmoid transfer function. An alternative Elman ANN approach was proposed by Afzali et al. [83] to predict mean, minimum, and maximum temperature during the years 1961-2004 in Kerman city, located in the south east of Iran. The one-day and one-month ahead air temperature is predicted slightly more precisely with this approach compared to the traditional MLPNN. Furthermore, Husaini et al. [84] proposed a Recurrent Higher Order Neural Network (RHONN) called Jordan Pi-Sigma Network (JPSN) to predict next-day temperature using measurements of five years (2005-2009) from the Malaysian Meteorological Department. More accurate results are found using this strategy, in comparison with classical MLPNNs.
In addition, a combination of classical MLPNN with the Wavelet Neural Network (WNN) has been presented for DT forecasting by Rastogi et al. [85] and Sharma and Agarwal [86]. In the research developed by Rastogi et al. [85], the input is associated exclusively with DT values, while Sharma and Agarwal [86] considered the cloud density as well. Both works analyzed the data obtained in Taipei during the years 1995-1996. Experimental results reported MAE values in the range of 0.7-0.9 and 0.25-0.62 in June, July, August, and September, for [85] and [86], respectively. These values represent better results in comparison to different time-variant fuzzy time series models. Mori and Kanaoka [36]; on the other hand, they introduced SVM regression to predict daily maximum air temperature. The proposed method was applied to nine input variables for real data acquired from AMEDAS (Automated Meteorological Data Acquisition System of the Japan Meteorological Agency) in Tokyo, for summer time from 1999 to 2001.
The authors showed that, by using the SVM-based approach, the average error of 1-day ahead maximum air temperature is reduced by 0.8% and 0.1% in comparison with an MLPNN and an RBFNN. However, this conclusion was drawn with models that were trained on a relatively small data-set containing 366 patterns and validated with 122 patterns. In a similar way, Radhika and Shashi [37] proposed an SVM to predict the maximum DT based on the daily maximum temperatures for a span of previous n days (2 to 10), measured by the University of Cambridge for the period from 2003 to 2008. Results were compared with an MLPNN, showing that, based on a proper selection of configuration parameters, SVM performs better than classical approximations of ANN. An analogous proposal has been put forward by Paniagua-Tineo et al. [38], which employed an SVM approach to model and predict maximum DT in several European countries. Weather related features, in this case, included a 10-year period of data for temperature, precipitation, relative humidity, and air pressure, specifically synoptic situation of the day and monthly cycle. The authors showed that this approach performed well when compared with MLPNNs. In this line, Wang et al. [87] improved the SVM-based temperature prediction model through the implementation of a heuristic global optimization method called Particle Swarm Optimization (PSO). The resulting PSVM approach was validated on daily minimum temperature values from 2005 to 2009 in Beijing. The experimental results showed that the proposed strategy performs better than some other SVM model such as Generalized Support Vector Machine (GSVM) and basic SVM using a considerably small sample size.
In order to enhance the performance of the SVM models for this particular task, some previous DT values have been included in the prediction system. However, taking into account several weather variables for some locations and for several days generates a large feature vector, which makes it necessary to establish a feature selection strategy to decrease the model complexity. In this way, Karevan et al. [88] presented a combination of k-Nearest Neighbor and Elastic Net (EN) to reduce the number of features. This study carries out the minimum and maximum temperature forecasting from one to up to six days ahead for Brussels, considering data from 70 stations, most of which are located in North America, Europe, and East Asia, during a period from the beginning of 2007 until mid 2014. Results are compared with an LS-SVM algorithm to show the accuracy improvement of the proposed approach.
In more recent research, Karevan and Suykens [89] takes into account the spatio-temporal properties of the same data-set to carry out the feature selection, by means of an algorithm called Least Absolute Shrinkage and Selection Operator (LASSO). A similar analysis to that described above was developed in this work for one to up to three days ahead in DT prediction for Brussels, based on meteorological data from 10 cities. The experimental results show that Spatio-Temporal LASSO improves, in most cases, the performance in comparison with the LS-SVM approach. However, results are not compared with the strategy proposed in [88].
A few research papers focused on Deep Learning have been developed in this field. Recently, Roesch and Günther [41] presented a Recurrent Convolutional Neural Network (RCNN), trained and tested on 25 years of climate data, to forecast meteorological attributes, such as temperature, pressure, and wind velocity. The authors used the ERA-Interim re-analysis of the European Centre for Medium-Range Weather Forecast (ECMWF) to get the data for training and evaluation.
In particular, around Zurich (Switzerland), they extracted a time series in a 7 × 7 grid, based on spatial features. The application developed in this work allowed for overviewing annual, monthly, and daily patterns associated with the time series. Based on the previously described research, Table 3 summarizes the ML methods used in daily temperature forecasting.

Monthly Temperature Forecasting
Climate change impact assessment requires a data analysis based on the temporal resolution at which impacts occur [90]. In this way, the evaluation of the current status and the future integrity of diverse environmental features (fauna and flora), required to assess the climate change, involve the construction of monthly and annual mean temperature models.
For this purpose, Bilgili and Sahin [91] predicted Long Term monthly air temperature using an MLPNN in Turkey. Inputs in this model were associated with geographical variables (latitude, longitude, and altitude) from 76 measuring stations and time. During the validation, the values determined by the ANN model were compared with the actual data , obtaining a minimum MAE of 0.508 • C. These geographical inputs also were analyzed by Kisi and Shiri [92] to predict long-term monthly air temperature in Iran. In the study, they evaluated the performance of a classical ANN and an Adaptive Neuro-Fuzzy Inference System (ANFIS) model, which is a combination of an adaptive ANN and a Fuzzy Inference System (FIS). Through the evaluation process, they illustrated that ANN strategy performed better than ANFIS in the test period based on the values of RMSE, MAE, and other coefficient statistics. In the same way, De and Debnath [93] implemented an MLPNN to predict the mean monthly surface temperature in the monsoon months (June, July, and August) over India. In this case, three models were developed associated with each monsoon month for both maximum and minimum temperature for the period 1901-2003. In the majority of the cases, prediction error was below 5%.
In the same line, Ashrafi et al. [90] used the MLPNN approach to predict mean temperature values in Iran . However, in this case, input values were associated with the mean temperature, dew point temperature, relative humidity, wind speed, solar radiation, cloudiness, rainfall, station level pressure, and green house gases of nine different climatic regions. In order to predict monthly mean temperature, the system analyzed one month, six months, 12 months, and 24 months before recorded data. In addition, the authors implemented three optimization methods: back-propagation (BP), Genetic Algorithm (GA) and combined GA-Particle Swarm Optimization (PSO), showing a better performance in the BP results. Research developed by Afzali et al. [83], described in the previous section, addressed the monthly temperature prediction as well. In this case, an ENN was proposed as a suitable solution, in comparison with the MLPNN.
On the other hand, Liu et al. [94] introduced the application of Wavelet coefficients (WT), based on SVM, to predict the air temperature in Tangshan monthly. During the experiments, the authors analyzed the monthly temperature data from 1960 to 2010, indicating that the accuracy obtained by means of an SVM method based on wavelet transform is significantly higher than that based on SVM and MLPNN-based models. In this context, Salcedo-Sanz et al. [95] examined the performance of SVM and MLPNN in the problem of monthly mean air temperature prediction in Australia and New Zealand. In this work, the authors analyzed data from a total of eight stations in Australia, three urban stations (1900 to 2010), and five rural stations (1910 to 2010), and other two stations in New Zealand (1930 to 2010). A performance comparison with MLPNN was carried out to show the accuracy improvement of using SVM. A similar study was presented more recently by Papacharalampous et al. [96]. In this work, the authors evaluated SVM and MLPNN techniques to forecast mean monthly temperature observed in Greece. During the evaluation, the authors assessed the one and twelve-step ahead forecasting performance of the algorithms. Based on the findings, they suggest that neural networks algorithm can produce forecasts of many different qualities for a particular individual case, in comparison with the SVM algorithm. This fact can be evidenced in the RMSE values, which range from 0.63 • C to 6.05 • C for the MLPNN case and from 0.73 • C to 2.30 • C for the SVM approach.

Discussion and Research Gaps Identification
The comparative evaluations developed in the papers reported in this work show different factors that affect the ML strategies performance. Among them, the input features, the optimization algorithms, the configuration parameters, and the corresponding evaluation measures are of the utmost importance. Air temperature forecasting systems have used meteorological and geographical variables as input parameters.
Among them, it can be mentioned: maximum, minimum, and average temperature, precipitation, pressure, Mean Sea Level, Wind Speed and Direction, Relative Humidity, Sunshine, Evaporation, Daylight, Time (Hour, day or Month), Solar Radiation, cloudiness, CO 2 emissions, latitude, longitude, and altitude. However, the maximum, minimum, and mean values of temperature are found to be the common parameter for all the research. In fact, a relevant amount of works use only these features as model inputs.
Taking into account that prediction accuracy is strongly dependent on the time period, the time horizon, and the location of the weather stations analyzed during the validation and other criteria, it is difficult to conclude about the quality of the estimations based only on the accuracy metrics (RMSE, MSE, MAE, etc). In this way, in order to perform the accuracy comparison of different prediction system, it is better to use a common data-set in the validation stage. In this context, Tables 1-4 show, when the paper reported the evaluations, the comparative results between SVM and ANN-based strategies for the same data-set.
Most of the research developed in this area (monthly, daily and hourly) are focused on ANN strategies (57%) in comparison with the other widely used strategy SVM (43%). However, it is possible to see that, when SVM and ANN were compared, in most cases, SVM reported a better performance compared with classical ANN-based strategies.
Diverse ANN models (i.e., MLPNN, RBFNN, ERNN, GRNN, JPSN, RCNN, SDAE) have been proposed for air temperature forecasting, the MLPNN and the RBFNN being the most used architectures for the ANN-based Approaches. Levenberg-Marquardt and Gradient Descent are the most used optimization algorithms, with Levenberg-Marquardt showing a better performance due to its learning rate and the smaller prediction errors. Likewise, the most used combination of activation functions reported is the Hyperbolic Tangent or the Sigmoid for the hidden Layer and the Pure Linear for the output Layer. For SVM-based approaches, Radial Function Base Kernels are the most implemented functions. In addition, a considerable amount of works use Grid Search or Cross-validation as a strategy to set the hyper-parameters involved.
Issues related to time series modeling are addressed in these research works during the corresponding algorithm's implementation. As a particular case, the data-set size is limited by the amount of measurements acquired for the analysis, unless underlying physical models or alternative simulations systems are used for data generation [40]. As such, DL-based approaches, for instance, require the acquisition of long time series or complementary simulation systems to generate enough samples to perform the training-validation process. In addition, in the research works reviewed in this paper, authors have analyzed one, two, three, or more years of temperature data as available to build ML-based models, and the training and testing data sets have a size of minimum three years and one year, respectively, in order to predict air temperature accurately.
Based on the published literature, the parameters that impact the most on the forecasting are many, so it could be problematic to take exactly the results of the parameter evaluation from other research. Based on this idea, in order to draw reliable conclusions, these reported parameters just could give an idea of the methodology developed, but they should be assessed for a data-set obtained from a new location.
In this review, some of the proposed approaches also use variations or combinations of strategies, as it can be seen in Tables 1-4. Based on the evaluation results, the ensemble of strategies or the significant variations offers a better accuracy than single classical algorithms but again the best combined strategy is difficult to define, due to the data-set changes. A considerable amount of work is required in order to determine the best ANN or SVM based methodology among those available, or the possible equivalence. In any case, this task is very difficult based only on the limited cases reported in the literature. Considering these preliminaries, the research gaps that can be identified in this review, to continue with the research in this field, are summarized as:

•
Most of the research presented in this review is focused on the local analysis of the air temperature. However, there is not an extensive study about the anomalies prediction of temperature at a global level by means of these ML-based approaches. Taking into account the robust data currently available in diverse web sites, different ML-strategies and input features could be used to accurately predict temperature anomalies at the global level. • Research reported at the regional level has not deeply analyzed the dependency of the temperature values of the surrounding area in the temperature estimation. A study oriented to analyze the impact of using temperature values of surrounding stations as inputs, based on the distance each other, could be of particular interest. • A large number of the works described in this review do not include a time horizon analysis. The lack of these results makes it difficult to have a better idea of the accuracy of the method proposed. Likewise, a set of evaluation measures must be calculated in order to facilitate the comparison with other methods which may use the same data-set.

•
Taking into account that accuracy results strongly depend on the data-set analyzed, a comprehensive study of the influence of the data-set size for training and testing should be done to offer a more fair comparison between strategies.

•
A comparative analysis with all the available ANN-based techniques (MLPNN, RBFNN, ERNN, GRNN, JPSN, RCNN, and SDAE) and SVM variations (LS-SVM, PSVM, WT+SVM) should be carried out in order to determine the best strategy and algorithms to forecast air temperature for different time horizon. In this sense, as well as it is developed in other areas, a competition using a complete standard data-set could help in this objective.

•
The effect analysis of each variable, such as maximum, minimum, and average temperature, precipitation, pressure, Mean Sea Level, Wind Speed and Direction, Relative Humidity, Sunshine, Evaporation, Daylight, Time (Hour, day or Month), Solar Radiation, geographical variables (latitude, longitude, and altitude), cloudiness, and CO 2 emissions, used in the prediction is required to be taken into account to increase the temperature prediction accuracy.

•
A further study about the feature selection, based on their relevance, should be performed. Different strategies, such as Automatic Relevance Determination, closely-related sparse Bayesian learning, or Niching genetic algorithm have not been taken into account.

•
Recently, Deep Learning strategies have shown a great performance for classification tasks [97]. However, a few studies have proven, with promising results, that prediction could be accurately done by means of these techniques. More further analysis should be developed in this area.

•
For the evaluation of RNN, the size of the time series required to accurately predict a single value of temperature should be studied. Likewise, a comprehensive study about the structure of the recurrent unit should be included.

•
In-depth analysis using statistical significance tests is required in order to assess the forecasting model's performance in terms of its ability to generate both unbiased and accurate forecasts. In these cases, the respective accuracy is evaluated by using both error magnitude and directional change error criteria. Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.