Basic Statistical Estimation Outperforms Machine Learning in Monthly Prediction of Seasonal Climatic Parameters

: Machine learning (ML) has been utilized to predict climatic parameters, and many successes have been reported in the literature. In this paper, we scrutinize the effectiveness of ﬁve widely used ML algorithms in the monthly prediction of seasonal climatic parameters using monthly image data. Speciﬁcally, we quantify the predictive performance of these algorithms applied to ﬁve climatic parameters using various combinations of features. We compare the predictive accuracy of the resulting trained ML models to that of basic statistical estimators that are computed directly from the training data. Our results show that ML never signiﬁcantly outperforms the statistical baseline, and underperforms for most feature sets. Unlike previous similar studies, we provide error bars for the relative performance of different predictors based on jackknife estimates applied to differences in predictive error magnitudes. We also show that the practice of shufﬂing data sequences which was employed in some previous references leads to data leakage, resulting in over-estimated performance. Ultimately, the paper demonstrates the importance of using well-grounded statistical techniques when producing and analyzing the results of ML predictive models. (e) calculate error bars on the relative prediction accuracies of different methods using jackknife estimation applied to pairwise differences between prediction errors; and (f) demonstrating the effect of data leakage on the reported performance.


Introduction
Recent advances in computing have shifted the focus of scientific communities from a data-scarce to a data-rich research environment [1]. This paradigm shift, known as the fourth paradigm of science, and often referred to as the era of "big data" [2], has emerged from the move of big data and AI into our daily lives and the pervasiveness of these two technologies, which are (i) leading to an explosion in innovation, competition, and productivity [3], (ii) causing a dramatic shift to data-driven research [4], and (iii) unleashing the benefits of data-intensive applications.
Climate science is a research field where data-driven models based on machine learning (ML) have become popular [5]. A major focus of climate science is the understanding and prediction of climate parameters such as rainfall and temperature [6] and many others. For many practical climate-influenced decisions where prediction times of months to a decade are likely to be the most important [7], providing accurate models to predict climatic parameters on these time scales is critical. The remarkable successes of ML and deep learning in a variety of fields such as computer vision and natural language processing suggests that this success may be extended to climate science as well.
However, there is a concern regarding how effective and legitimate these ML models are to address real world applications in climate science. This is reason for enthusiasm, but also for skepticism, as it is all too common to make excessive claims for new techniques, which turn out not to live up to their initial promise, as exemplified by Gartner's hype cycle model [8]. There are already several examples in the literature that show that ML does not always live up to its hype. A recent overview study reviewed several papers that used recurrent neural networks for top-n recommendation tasks, and found that a simple model using K-nearest neighbors outperformed most of the more sophisticated models [9]. One major deficiency identified by the study was the use of defective or weak baselines when quantifying the performance of newer proposed models. Other papers that also reached the conclusion that sophisticated ML models do not necessarily outperform simpler models include [10][11][12][13].
One key feature of ML methods is that they make no assumptions about the underlying distribution of inputs. This can be both an advantage and a disadvantage. The advantage is that ML methods can be applied to a wide variety of datasets without having detailed knowledge of the statistics of the individual datasets. The disadvantage is that ML may miss important characteristics of particular datasets. For this reason, if the user has some knowledge of the dataset's distribution, it is important to compare ML predictors with statistical estimates based on the presumed distribution. Such statistical estimates have the advantage that they are simple to calculate, require no training, and are easy to interpret [14].
One deep flaw in most papers in the literature is that accuracy estimates for ML methods are given (such as R 2 or root mean squared error) without providing error bars on these estimates. Hence, it is impossible to tell whether or not differences between methods are statistically significant. This may be one reason why different investigators often reach different conclusions about the relative effectiveness of different ML methods. For example, Armstrong et al. [15] concluded from an analysis in the context of ad hoc retrieval tasks that numerous published papers report mutually contradictory conclusions concerning ML model performance.
Another concern is that some common pre-processing practices produce data leakage, so that ML algorithm accuracies are over-reported. Some examples of such practices are: data shuffling, whereby researchers randomly shuffle the data [11,[16][17][18][19][20][21][22][23]; data imputation methods that use statistics (such as averaging) calculated on the entire data set, including both training and testing [24][25][26]; and data transformations such as de-seasonalization that also use statistics calculated on the whole dataset [27,28].
It is necessary to investigate the robustness of ML models in different fields of application. The current study is aimed at investigating the above mentioned deficiencies in the area of climatic seasonal parameters. This paper is organized as follows: Section 3 describes the data used; Section 4 discusses the methodology used for climatic parameter prediction; Section 5 shows the results obtained; Section 6 discusses the results; and Section 7 furnishes the conclusions.

Literature Review and Scope of the Research
ML is widely used in climatology to construct predictive models based on sequential data [11]. A variety of types of input data are used, including satellite images or periodic samples from gauges or weather stations.
The studies in the literature can be largely divided into two categories in terms of the predicted output: those that predict one or more entire images which provide a visual representation of a given predicted climate parameter on spatial maps of a specific geographical area under review ("whole-image prediction"); and those that predict only a single output representing a given predicted climate parameter at a fixed location ("singleoutput prediction").
From a practical point of view, usually the most important policy decisions involving climate require monthly predictions [7]. Relatively few studies exist which use image data to make monthly predictions [57,58]. When time scales on the order of months or longer are involved, datasets are typically much smaller than those involving shorter time scales. A broad range of ML methods are applied, from simple methods like multilinear regression (MLR) up to advanced neural networks models [13,[16][17][18]20,21,24,25,46,47,49,[59][60][61][62].
Because of the small data sets used, researchers often perform feature selection/reduction to avoid overfitting. Most often, the selected features in the literature are combinations of features derived from previous time steps in the data, for example, a parameter at month n may be predicted based one or more parameter values taken from months previous to n [ [25][26][27]46,63].
Because of the rotation of the earth around the sun, monthly time series data like rainfall exhibit a seasonal behavior on a yearly basis (exhibit a yearly periodicity) [64,65]. This is critical to address because traditional time series models tends to rely on the time series being stationary [64,66]. Hence, the authors in [64] saw it as necessary to remove the periodicity in a monthly time series data. They described three ways of going about this: (a) previous lag differencing, (b) seasonal referencing; and (c) monthly mean subtraction, where (c) was identified as the most suitable method for monthly time series data. However, we found that many papers dealing with monthly prediction of climate parameters did not transform the input data to remove seasonality. Some papers accommodate seasonality by including data from month n − 12 to predict parameters at month n [13,17,19,20,24,25,27,28,44,49,51,58,60,62,67]. Month n's time stamp (defined as n mod 12) was used as a feature in [19,49], but is not common in the literature.
In a few papers, the authors subtracted the monthly mean averages computed from the whole data set [25,28], with the inclusion of data from month n − 12. This procedure disrupts the integrity of the data by causing data leakage, whereby information from the testing set is introduced into the training set. Other papers make no attempt to account for seasonality [18,22,23,46,48,61,68,69]. Evidently, there is no consistent procedure for dealing with the seasonality aspect of the data; this is one point that we address in this paper.
In the previous section we emphasized the importance of using simple baselines to provide benchmarks to compare with more complicated methods. According to [66], the simplest baseline for predicting time series is to use the previous lag. For short-term image data, the previous image is used as a naive predictor for the next image [36,37,40,70]. As for monthly data, using previous lags as a baseline is not a common practice. Instead, a variety of baselines are used. Some papers use MLR based on previous lags [13,16,17,25], while the authors in [45] used same-month averages. Some papers do not use simple baselines, but rather compare several variations or architectures of more advanced ML methods such as SVR or MLP [26,27,46,63]. In summary, simple baselines are not consistently used in the literature.
The Objectives of the paper are as follows: (a) perform seasonal grid prediction on multiple climatic parameters; (b) investigate multiple untrained baselines, and in particular using a statistical estimator derived from a simple statistical model of the image pixel distributions; (c) analyze the effectiveness of subtracting the seasonality using the monthly average calculated only on the training data; (d) investigate the common feature sets used in the literature; (e) calculate error bars on the relative prediction accuracies of different methods using jackknife estimation applied to pairwise differences between prediction errors; and (f) demonstrating the effect of data leakage on the reported performance.
Our results show that across all climatic parameters studied, a very limited feature set (time stamp with spatial information) without seasonal subtraction outperforms feature sets that use previous lags, with or without seasonal subtraction. Furthermore, an untrained baseline based on a simple statistical model can out perform more sophisticated ML tools. Furthermore, handling data inappropriately so that data leakage occurs (as has been done in some previous papers) can lead to significant overestimation of predictive performance.

Data and Area of Interest
The climatic data were obtained from the NASA GESDISC data archive, which is accessible to users registered with NASA Earthdata [71]. The dataset used is obtained from the Famine Early Warning Systems Network (FEWS NET) Land Data Assimilation System (FLDAS). FLDAS contains monthly image data for 28 fields such as rainfall flux, evaporation, and temperature [72] with a spatial resolution of 0.1 • × 0.1 • . The data are archived in netCDF format, where it can be manipulated and displayed using freely available software packages within python and R. NASA also supplies a cross-platform application called Panoply that can be used to plot the data [71].
The downloaded data set for each parameter used contains 228 satellite frames on a monthly basis, between January 1982 and December 2000. Images depict the entire globe at a resolution of 1500 × 3600 pixels. Figure 1 shows a sample image of rainfall. In general, the images are color coded to provide information about the relevant parameter. In the current study, the climatic parameters used are rainfall, evaporation, humidity, temperature, and wind speed.  [72]. Color scale indicates normalized rainfall intensities.
To limit the computational load, we focused our prediction on Madagascar. Madagascar is the world's fourth largest island with an area of about 592,000 km 2 [73], and is separated from Mozambique on the main African continent by about 400 km [73]. The climate on the island is subtropical and is characterized by a dry season from May to October and rainy season from November to April [74,75]. Table 1 summarizes the characteristics of the Madagascar image data used in our study, which was extracted from the original FLDAS data. Madagascar is currently facing several challenges due to the potential impact of climate change on the agricultural sector, which can threaten food security [76][77][78][79], especially since farmers in the country are estimated to be 70% of the population [74]. Example images of the five climatic parameters used at a specific arbitrary timestamp are provided in Figure 2. The figure shows normalized values of five climatic parameters, namely rainfall, evaporation, humidity, temperature, and wind.  Figure 3 shows a flowchart of the system created and used to make predictions in this research. The end goal of the system is to predict monthly rainfall, evaporation, humidity, temperature, and wind speed images on a pixel level, using a sequence of previous images as an input. The rest of this section describes the progression through the flowchart in the figure in detail: first we discuss the pre-processing of the images and the preparation of the data set; then we describe feature selection; and finally, we indicate the tools used. The code together with the results are available on GitHub at https://github.com/EslamHussein55/Climatic-parameters (accessed on 16 April 2021).

Image Pre-Processing
All images in the parameter datasets were cropped to a rectangle of size 140 × 80 that includes the Madagascar land area. We transformed the image pixels to greyscale (0-255) and re-sized the images to 70 × 40 to further reduce their complexity. In view of the fact that extreme values are a common occurrence in geophysical parameter data, pixel values were regularized by replacing them with their square roots, following the example of [57,80,81]. Since our study is concerned with relative performance of different algorithms rather than absolute performance, for simplicity we did not remove over-ocean pixels, which are constant in all images and hence perfectly predicted.
We mentioned previously that some authors recommend transforming time series data to remove seasonality, while many authors do not follow this recommendation. To evaluate the effectiveness of transforming time series, we created two input data sets (denoted as 'raw' and 'de-seasonalized') for each of the five parameters. The raw dataset contains the original data, while the de-seasonalized data is transformed by subtracting same-month averages. Care was taken to compute monthly averages based only on the training data to prevent data leakage. For illustrative purposes, Figure 4 shows example raw and de-seasonalized images for rainfall.

Data Preparation
We prepared the data in a sliding window fashion, similar to the following studies [35,[82][83][84][85]. Figure 5 shows how pixels at a given location in a 12 month window are used to predict the corresponding pixel at the same location in the 13th month. In the figure, the symbols {f[0], . . . , f [11]} refer, respectively, to the frames {12, . . . , 1} months prior to the predicted frame, respectively. As shown in Figure 5, the datasets were used to produce sequences consisting of 12 consecutive months. All datasets were divided into training and testing sets, where the training set was made up of sequences occurring earlier in the dataset, and the testing set made up of sequences that followed those in the training set. This technique of maintaining chronological order when dividing the datasets into training and testing sets helps avoid the problem of information leakage into the trained model from the future [66]. Applying the sliding window generated 216 sequences with the first 156 used for training, and the rest as testing. Although the number of images appears relatively small, the training task is nonetheless computationally expensive since the training process utilizes 156 × 70 × 40 input vectors. This explains why previous similar studies also use relatively few images; for example [86] trains on only 47 images.

Feature Selection
Feature selection is critical to increasing training efficiency and model accuracy. Based on the reviewed literature for monthly prediction, we tested a variety of feature sets to understand the system mechanism. We also added in features systematically and assessed whether or not added features gave clearly better performances to ensure model parsimony and avoid overfitting [87]. The feature sets are described in the following subsections.
We first created a list of 12 candidate features for image prediction consisting of pixel values at the same location for the 12 prior months. To select a variety of these features, we prepared the data using the sliding window algorithm, where each 12-month window was used to predict the 13th month. Based on previous literature [13,[17][18][19][20][22][23][24][25][26][27][28][44][45][46][47][48][49]58 Given the geographical variation and seasonal nature of the dataset used, the following spatio-temporal features are also used in this study: The five past-pixel feature sets and the two spatio-temporal features were combined to form the following feature set variants: • Past-pixel features only (five variants, as listed above); Past-pixel features (five variants) plus (i, j, t). These 12 feature set variants were applied to both the raw and de-seasonalized training data.

Machine Learning Algorithms
A total of five ML techniques are used for image prediction: (a) multivariate linear regression (MLR); (b) k-nearest neighbor (KNN); (c) random forest (RF); (d) extreme gradient boosting (XGB); (e) multilayer perceptron (MLP). Since the training set consisted of less than 200 sequences, we did not use deep learning, which typically requires much larger training sets [89][90][91][92]. For all ML tools except for MLR, parameters were optimized via grid search with three fold validation, using the time series cross-validator implemented in scikit-learn [93]. The purpose of cross-validation is to avoid overfitting by making sure that the model is not overly dependent on the particular training data used to construct the model. Additionally, for MLP, a regularization parameter was used as an additional measure to counteract overfitting. Grid search optimizations to optimize ML parameters were performed separately for each feature set applied to each climatic parameter used on the raw data and separately again on the de-seasonalized data. All optimized parameters for all ML tools can be found in the GitHub link provided above.

Performance Metrics
One commonly used measure of the accuracy of a predictor's error is the mean absolute error (MAE). The MAE is calculated as the mean of the absolute values of prediction errors for all predicted pixel values: where M is the number of observations, and y obs m and y pre m refer to the observed and predicted value of the mth output, respectively.
There is a long-running debate over whether or not MAE is superior to root mean squared error (RMSE) in geophysical studies [94][95][96][97]. It is generally acknowledged that MAE is more robust, since it puts less weight on outliers. In view of the number of comparisons made in the current research, we settled on MAE as our principal measure of forecasting error, rather than reporting both MAE and RMSE.
In order to obtain error bars for differences between estimated MAE values for different ML estimates, we used the jackknife variance estimator [98]. The jackknife was implemented by obtaining M − 1 different MAE values by omitting successively the first, second, third, . . . image in the testing set. It is important to note that entire images were omitted and not single pixels, because pixel errors in the images are highly correlated: a jackknife estimator based on omitting single pixels will greatly underestimate the variance. Since we are interested in relative performance of the ML method compared to a selected baseline, we applied jackknife to the difference between the MAEs for the ML estimate and the baseline. This is another critically important point, because the variance for the MAE for individual ML methods is much larger than the variance of difference between ML and baseline MAEs because the MAEs for ML and baseline are highly correlated. A pseudocode for the procedure is given in Algorithm 1.

Baselines and Statistical Estimators
For this study, we employed four different untrained predictors as baselines: (1) previous month (denoted 'base-11'); (2) same month previous year (denoted 'base-0'); and (3) average of all training set images for the same month (referred to as 'seasonal baseline' or 'base-Se'); (4) the squared mean square root for training set images of the same month, rounded down to the nearest integer (denoted as 'base-Se(sqrt)'). When evaluating the effectiveness of different ML algorithms in parameter prediction, we compared these baselines against the trained ML models.
The first three baselines have precedents in the literature. The authors in [66] suggested the use of base-1) as the simplest baseline. Base-0 is suggested by the seasonality of the data. As for base-Se, the authors in [45] implemented the use of the monthly averages as a baseline.
The final baseline is justified by an inferred statistical model of the image pixel distributions, which is motivated as follows. It is clear that the distribution of seasonal climatic parameters for any pixel (i, j, n) must depend on the location (i, j) and the time stamp t = mod (n, 12). It is also clear that neighboring pixels at the same month index n are correlated. Allowing for these correlations, we posit the simplest possible statistical model for the pixel distributions: namely, that all pixels at month n are statistically independent of all pixels at month n as long as n = n; and further, that the probability distribution for the pixel value (i, j, n) depends only on the values of (i, j, t).
Given this assumed model for the pixel distributions, we may design an estimator for future pixel values as follows. It is a well-known result in theoretical statistics that the true median of the distribution of a random variable minimizes the expected MAE of a random sample. For a nearly symmetric distribution, the median is approximately equal to the mean. To reduce the influence of high outliers and make the distribution more nearly symmetric, we first take the square root of the data before taking the mean: the result will approximate the median of the square-rooted data, which is equal to the square root of the median of the original data. Consequently, the median may be estimated as the square of the mean of the square-rooted data, which is rounded down to reduce the bias produced by high outliers.

Results
In this section, we first present performance results for the different predictors, including baselines and ML methods with and without deseasonalization. Then we give error bars on the relative performance of predictors compared to the base-Se(sqrt) baseline prediction. Finally, we describe the effect of data shuffling on predictor accuracy estimates.
In the following discussion, the data is presented graphically for brevity. Data in tabular format is available at https://github.com/EslamHussein55/Climatic-parameters (accessed on 16 April 2021). Figure 6 gives residual plots and R 2 values for the three baselines base-11, base-0, and base-Se(sqrt) for the five climatic parameters (base-Se is not shown, but strongly resembles base-Se(sqrt)). As seen in the figure, Base-Se(sqrt) gives the most accurate estimations across all parameters (predictions lie closer to the 45 • line), as well as giving larger R 2 values. Indeed, the R 2 performance for base-Se(sqrt) is almost perfect, with all values over 0.96.  7-11 summarize the MAE results for models trained using different feature sets for each of the climatic parameters. The corresponding RMSE values were also generated, but since they closely resemble the MAE results, they are omitted here. Each figure contains two line graphs for raw and de-seasonalized data sets, respectively. For the raw data, the [i,j] feature set performed very badly, so we omitted these results from the figures to avoid stretching the vertical scale. In addition, the base-11 baseline was above the vertical scale for all parameters except rainfall, and is not shown.     Of the four baselines, base-Se(sqrt) is always the best, followed by base-Se, base-0, and base-11, in that order. In fact, Base-Se(sqrt) is also better than all ML tools for all parameters and feature sets, except evaporation for a few feature sets.

Performance Comparisons for Different Baselines, Feature Sets, and Preprocessing Methods
Next, comparison between raw-based and de-seasonalized-based predictions shows that de-seasonalizing tends to stabilize the performance, so that it is less dependent on the feature set used. If the feature set contains [0], then de-seasonalizing makes little difference. De-seasonalizing does not always improve the feature sets' performances, as will be discussed in more detail below.
A comparison of feature sets shows that the feature sets [i,j,t], [11,i,j,t], and [0,11,i,j,t] are consistently the best performers, both for raw-based and de-seasonalized-based predictions. In our detailed performance analysis below, we focus on these three feature sets.
It is significant that the above observations apply consistently to all five parameters, which suggests that the same observations can generalize to other climatic parameters. Figures 12 and 13 show the percentage error reductions for different ML algorithms for the 5 climatic parameters, using raw and de-seasonalized data, respectively. Only the three best feature sets are represented, namely i,j,t, [11]+i,j,t, and [0,11]+i,j,t. In the figures, the 100% level corresponds to the Base-Se(sqrt) MAE error: so, for example, the MLR value of 120% for rain (raw) with feature set [0,11] + i,j,t indicates that the MAE error for MLR is 1.2 times the corresponding Base-Se(sqrt) error. Error bars in the figures correspond to ±2 standard deviations, and were computed using the jackknife procedure described in Section 4.4.2, using the different ML methods and the Base-Se(sqrt) baseline. Figure 12. MAE of all trained models with features [i,j,t, i,j,t+ [11], i,j,t+[0, 11]], compared to base-Se(sqrt) on the raw climate datasets. On the vertical scales, 100 corresponds to the MAE for the base-Se(sqrt) estimator. Figure 13. MAE of all trained models with features [i,j,t, i,j,t+ [11], i,j,t+[0, 11]], compared to base-Se(sqrt) on the deseasonalized climate datasets. On the vertical scales, 100 corresponds to the MAE for the base-Se(sqrt) estimator.

Detailed Comparison of ML Tools and Feature Sets
For raw-based predictions, Figure 12 shows that the KNN, XGB, and RF algorithms typically attain between 100 and 110% of base-Se(sqrt) across all parameters, while MLR and MLP exceed 125% in several cases. For evaporation with the [11]+i,j,t feature set and for wind with the i,j,t feature set, the KNN XGB and RF algorithms are slightly better than base-Se(sqrt), but the error bars show that this relative improvement is not statistically significant.
For deseasonalized-based predictions, the accuracy of XGB and RF is nearly the same as for raw-based, but KNN performance is degraded by up to 10%. The errors for MLR and MLP are reduced to below 115%, but still tend to be 5-10% higher than errors for XGB and RF.
For all parameters except evaporation, the ML methods of KNN, XGB, and RF applied to the feature set i,j,t give the best performance on both raw and de-seasonalized data. This implies that (surprisingly) including lag-based features actually worsens prediction accuracy for these parameters. It is also surprising that the most and least sophisticated methods (MLR and MLP) have similar (and sub-optimal) performance in most cases.

Data Shuffling
In Section 1, we mentioned that several references shuffle the image sequences. In order to gauge the effects of this shuffling, we used RF with features set [11]+i,j,t to predict all climatic parameters with both shuffled and unshuffled data. For both shuffled and unshuffled data, 156 of the 216 total 12-month sequences were used for training and the rest for testing. The unshuffled data used the first 156 sequences for training and rest for testing, as described in Section 4.2, while the shuffled data took 156 sequences randomly from the entire dataset, thus producing overlap between training and testing sequences. Results showed that MAE obtained from shuffled data was 2-10% lower than from unshuffled data, due to data leakage.

Discussion
The results demonstrate that when doing seasonal parameter prediction on monthly time scales, it is important to use a well-motivated simple baseline, e.g., a statistical estimator computed from the source data. This finding is consistent with the points made in [9]. Baselines that depend on lags do not perform as well. Furthermore, a simple same-month average baseline which does not take into account the statistical properties of MAE cannot match the performance of baseline that is designed to estimate median values, which in theory will minimize MAE. For the seasonal parameters we tested, a carefully designed statistical estimator outperforms even highly sophisticated ML models. This finding raises concerns about positive results reported in previous papers that fail to supply statistical baselines.
The results also show that care must be taken in selecting seasonal features as inputs. In the literature, same month previous year (corresponding to our feature [0]) is commonly used [13,17,19,20,24,25,27,28,44,49,51,58,60,62,67]. However, we found that using [0] scarcely outperforms base-0, and is much worse than base-Se(sqrt). Indeed, we found that time stamp t (where t runs from 0 to 11) gave much better results, although it is rarely used in the literature. In addition, using both features typically gave worse performance than using t only.
Aside from using seasonal features, another way to account for seasonality is to de-seasonalize the input data by subtracting monthly averages. The results show that de-seasonalization tends to reduce model complexity: for example, when data is deseasonalized, then feature [0] becomes unnecessary. However, whether or not de-seasonalization lowers the error depends which algorithm and which features set is used. For example, the best-performing feature-algorithm combinations in our study used i,j,t with RF, or XGB, and for these combinations de-seasonalization of inputs made no difference. We conclude that appropriate feature and algorithm selection has more of an effect on performance than de-seasonalization.
A study similar to ours may be found in [45], in which base-Se is used to standardize the performance of different ML models in predicting 1-6 months ahead rainfall using past rainfall, temperature, and climate index. Compared to base-Se, the following ML algorithms had worse performance: MLR, RF, support vector machine (SVM), artificial neural network (ANN), long short term memory neural networks (LSTM), and convolutional LSTM (ConvLSTM). It follows that including additional climatic parameters as features and doing joint prediction may yield no benefits. Only when the authors used wavelets during pre-processing did their accuracy improve. Even with wavelets, the basic MLR model gave results that nearly matched a sophisticated LSTM model (no error bars for the difference are given, so it is impossible to tell whether there is a significant difference).
For the climatic parameters that were examined in this paper, using previous month (denoted as feature [11]) was not effective, and could even degrade predictive performance when added. However, this conclusion is not applicable to other parameters such as groundwater [7,57], which involves conditions that last over multiple months. The slight improvements seen when adding [11] to evaporation may be due to this effect.
Unlike most prior research in this area, we established the significance of differences in predictive performance between ML methods using error bars that were calculated using statistically rigorous jackknife estimates. The error bars for differences between MAE values for different estimation methods were much smaller than error bars on the MAE values themselves (such as those calculated in [45]). The jackknife methods employed are quite general, and can be used for other ML applications.
Finally, we established that images used for training and testing must be strictly separated and timed. Shuffling of image sequences, which has been employed in some prior research, leads to data leakage, which produces artificial reductions of prediction errors.

Conclusions
In this paper, we studied the application of machine learning to the prediction of seasonal climatic parameters on a monthly basis. Our conclusions may be briefly summarized as follows. First, a well-thought out baseline based on a simple statistical estimator will often outperform all ML models. Hence, studies of ML prediction algorithms that do not provide a baseline comparison are not sufficiently demonstrating the effectiveness of the algorithms. Second, the use of time stamp (i.e., month index) as a feature can replace de-seasonalization, and often yields better results than lags (i.e., previous month, or same month previous year). Third, we have demonstrated that jackknife estimation can be used to calculate error bars on algorithms' relative performance, which until now have not been generally reported in the literature. Fourth, we have shown that the practice of data shuffling produces error estimates that are artificially lowered. The methods we have used are quite general, and can be readily applied to other situations. The fact that our results are consistent over five widely different climatic parameters suggests that similar results may be expected for other climatic parameters measured on other regions. This conclusion is reinforced by the fact that similar results have been observed in another study of rainfall conducted in China [45].
In the current research, we have considered only single parameter prediction, using local spatio-temporal based features. For future work, we may apply similar methods to predictions based on other features. Reference [45] for instance shows that using wavelets can lead to better predictions-the question remains whether ML applied to these features can bring significant improvements, or whether simple statistics are sufficient.
Another possibility for future research is the application of deep learning. However, since most monthly datasets available are not large, deep learning may be of limited applicability for monthly prediction. Furthermore, the authors of [45] found that deep learning did not significantly improve on multi-linear regression for monthly rainfall prediction. Nonetheless, since the field of deep learning is developing rapidly, future techniques may produce algorithms that perform well even on datasets of limited size.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

ML
Machine learning ANNs Artificial neural networks Base-0,. . . ,base-11 Baseline estimators based on previous lags (see Section