Multiple Linear Regression Models with Limited Data for the Prediction of Reference Evapotranspiration of the Peloponnese, Greece

: The aim of this study was to investigate the utility of multiple linear regression (MLR) for the estimation of reference evapotranspiration (ETo) of the Peloponnese, Greece, for two representative months of winter and summer during 2016–2019. Another objective was to test the number of inputs needed for satisfactorily accurate estimates via MLR. Datasets from sixty-two meteorological stations were exploited. The available independent variables were sunshine hours (N), mean temperature (Tmean), solar radiation (Rs), net radiation (Rn), wind speed (u 2 ), vapour pressure deﬁcit (es − ea), and altitude (Z). Sixteen MLR models were tested and compared to the corresponding ETo estimates computed by FAO-56 Penman–Monteith (FAO PM) in a previous study, via statistical indices of error and agreement. The MLR5 model with ﬁve input variables outperformed the other models (RMSE = 0.28 mm d − 1 , adj. R 2 = 98.1%). Half of the tested models (two to six inputs) exhibited very satisfactory predictions. Models of one input (e.g., N, Rn) were also promising. However, the MLR with u 2 as the sole input variable presented the worst performance, probably because its relationship with ETo cannot be linearly described. The results indicate that MLR has the potential to produce very good predictive models of ETo for the Peloponnese, based on the literature standards.


Introduction
Reference evapotranspiration (ETo) as a climate parameter plays a pivotal role in climate crisis research and in water resources management. There are complex interactions between evapotranspiration and key environmental components, such as groundwater and streamflow, as well as with anthropogenic and climatic impacts, such as wildfires, air pollution, land use/land cover (LULC) change, crop intensification, and construction [1][2][3][4]. Moreover, ETo is directly related to the productive sector of agriculture, since crop water needs are usually estimated as a function of ETo with a crop-specific coefficient [5]. Irrigation system design and precision irrigation techniques demand accurate determination of ETo [6,7]. Since the management of finite resources such as water is challenging, the exploitation of accurate predictive tools of ETo, which are still easy to apply, is of major importance in an interdisciplinary context [8][9][10]. Another important contribution of ETo is its use in computing actual evapotranspiration, which is difficult to acquire [11,12].
The measurement of the ETo is demanding. Therefore, several methods for estimating ETo have been developed, ranging from simple empirical or physically based models [13,14] to complex algorithms and techniques, such as fuzzy logic and machine learning (ML) [15][16][17][18][19][20][21]. These methods employ data from meteorological stations, or retrieved data via remote sensors [22][23][24][25][26][27][28][29][30][31][32]. The FAO-56 Penman-Monteith (FAO PM) equation (Equation (1)) is the most established method used to compute ETo worldwide. How-Taylor, and Stephens and Stewart models in predicting daily pan evaporation, although the best values of the evaluating measures were R 2 = 61% and RMSE = 1.597 mm d −1 [53]. These MLR models were slightly less accurate than the corresponding comparative ANN models. Sanford et al. developed MLR equation with climate parameters (minimum and maximum temperature and precipitation) at an annual scale  for Virginia watersheds with R 2 = 84.4% [54]. The MLR equation was modified in a following study using these parameters to estimate the ET/P ratio, yielding an R 2 value of 86.74% and an RMSE = 0.067 mm d −1 for the best-fit parameters for the conterminous US [55]. The obtained values were considered satisfactory for explaining most of the variation in longterm average ETo across the conterminous US [55]. Although most recent articles do not employ MLR to the assessed models, Niaghi et al. applied MLR to datasets of six stations at Red River Valley [56]. They grouped the input combinations into air temperaturebased (Tmax, Tmin), mass transfer-based (Tmax, Tmin, wind speed), and radiation-based (solar radiation, Tmax, Tmin) measurements. The best performance for MLR models was for the latter inputs (RMSE = 0.68 mm d −1 , MAE = 0.51 mm d −1 , and R 2 = 88%). The MLR models were not sensitive to differences between local and spatial applications compared to the heuristic models, due to their inability to depict complex relationships [56]. Ohana-Levi et al. found that the six linear regression models had much higher RMSE values than the corresponding non-linear multivariate adaptive regression spline (MARS) models [57]. The correlation coefficients during training and testing were 0.84 and 0.85, respectively, for the linear models [57]. In MLR models where leaf area index (LAI) was added as an explanatory variable, the testing RMSE profoundly improved when using the Kc approach (the RMSE of 1.05 mm d −1 without LAI reduced to 0.81 mm d −1 ) [57]. Sharafi and Ghaleni compared the performance of six MLR models to empirical methods. They found that all the MLR models outperformed the latter [58]. The obtained RMSE values, ranging between 0.36 mm d −1 and 0.50 mm d −1 , were considered excellent, so that MLR was recommended for all climate types of Iran, from arid desert to humid [58]. An MLR model of precipitation, temperature, soil moisture, and NDVI as explanatory variables was applied to four basins in India [59]. The ETo variance was explained by 82−91% (adj. R 2 ) via the utilised MLR model [59]. The MLR stepwise modelling applied at the Inner Mongolia Autonomous Region  demonstrated that the ETo values in different regions were affected by different climatic parameters [60]. It is noteworthy that ETo was found insensitive to mean air temperature for the whole area [60]. For the Megecha catchment in Ethiopia, sunshine hours, wind speed, maximum temperature, and relative humidity were proven to be the best explanatory variables of ETo in MLR modelling [61].
This study examines sixteen MLR models utilizing data from sixty-two meteorological stations in the Peloponnese, a southwestern region of Greece (Table A1). The Peloponnese constitutes a challenging testbed for ETo research, since it presents considerable inhomogeneities regarding relief, LULC, altitude, etc. The empirical methods of a previous study exhibited inconsistency in terms of accuracy, with larger errors found for August (summertime) [36]. As described, MLR is not very popular in recent literature of ETo. This is probably because of the complex nature of ETo. However, the added value of MLR is the simplicity and the comprehensibility of the method and the equations produced. Furthermore, it is of importance to examine whether limited data (even one sole input variable) can predict ETo satisfactorily. The latter would be useful for cases with sparse data availability. Therefore, MLR for ETo estimation needs to be further investigated since it would be a promising alternative, being easily applicable and interpretable for interdisciplinary research and management purposes.

The Study Area
The Peloponnese peninsula of southwest Greece occupies about 1/6th of the Greek territory (21,439 km 2 ), with a population of 1,086,935 (census 2011; https://www.statistics. gr/el/statistics/-/publication/SAM03/2011 (accessed on 19 April 2022)). The area is mostly hilly and mountainous. Moving from coastline to the mainland, the altitude reaches 2407 m. Its lithology, along with tectonic activity and climatic conditions, has resulted in the formation of relief. The hydrographic network is well-developed, though with few large rivers [62]. The most populated urban area is located at the northernmost edge, while the broadest plain covers the northwest. Except from urban areas, which are sparsely distributed primarily along the coastline, the main LULC types are forests, transitional vegetation, and various crop plots ( Figure 1) [63]. According to the Köppen-Geiger classification, Peloponnese's climate is Mediterranean, warm, with dry summers and mild winters (classified as Csa) [64]. The annual average  precipitation, air temperature, and sunshine hours range between 400 to over 2000 mm, 8-20 • C, and 1900-3100 h, respectively (http://climatlas.hnms.gr/sdi/?lang=EN (accessed on 3 May 2022)).

The Study Area
The Peloponnese peninsula of southwest Greece occupies about 1/6th of the Gree territory (21,439 km 2 ), with a population of 1,086,935 (census 2011; https://www.stati tics.gr/el/statistics/-/publication/SAM03/2011 (accessed on 19 April 2022)). The area mostly hilly and mountainous. Moving from coastline to the mainland, the altitud reaches 2407 m. Its lithology, along with tectonic activity and climatic conditions, has r sulted in the formation of relief. The hydrographic network is well-developed, thoug with few large rivers [62]. The most populated urban area is located at the northernmo edge, while the broadest plain covers the northwest. Except from urban areas, which a sparsely distributed primarily along the coastline, the main LULC types are forests, tra sitional vegetation, and various crop plots ( Figure 1) [63]. According to the Köppen-Ge ger classification, Peloponnese's climate is Mediterranean, warm, with dry summers an mild winters (classified as Csa) [64]. The annual average  precipitation, a temperature, and sunshine hours range between 400 to over 2000 mm, 8-20 °C, and 1900 3100 h, respectively (http://climatlas.hnms.gr/sdi/?lang=EN (accessed on 3 May 2022)).

Methods
Ground-based datasets of daily scale from sixty-two meteorological stations running under the National Observatory of Athens, for the months August and December of 2016-2019 were utilised ( Figure 1, Table A1). The selected period exhibits interest in the context of climate crisis, since the two warmest years (2016 and 2019) since the pre-industrial era are included (https://climate.copernicus.eu/copernicus-2019-was-second-warmestyear-and-last-five-years-were-warmest-record (accessed on 23 June 2022)). August and December were selected as representative months of summertime and wintertime respectively (http://climatlas.hnms.gr/sdi/?lang=EN (accessed on 3 May 2022)), in methodological consistency with our previous study [36], aiming the results to be directly comparable. In the former study, ETo was computed for the Peloponnese by FAO PM, which serves as our reference method. In this study, multiple linear stepwise regression has been applied with seven explanatory (independent) variables, namely mean air temperature (Tmean), wind speed at 2 m distance from the surface (u 2 ), solar radiation (Rs), net radiation (Rn), sunshine hours (N), and vapour pressure deficit (es − ea). In addition to the climate variables, altitude (Z) was also used as an input variable, since it indirectly affects ETo [65]. Rs, Rn, N, and es − ea were previously calculated as in Zanetti et al. [40], based on FAO guidelines for missing climate data (functions of Julian date and station latitude) [33]. Since the need to limit the required variables in ETo modelling is underlined in the recent literature [66], the possibility of only a few variables or even only the most easy-to-acquire variable, Tmean, to produce satisfactorily accurate predictions of ETo for the Peloponnese was investigated. MLR as a statistical technique employs several explanatory variables and one response (dependent) variable, which is ETo in the present study. The aim of MLR is to model the linear relationship between the explanatory variables and the response variable, as an extension of ordinary least squares regression, since it incorporates more than one explanatory variable. In other words, MLR aims to find the linear function that minimizes the sum of the squares of errors (SSE) between the observed and the predicted data. An advantage of this method is the easy interpretation of the coefficients, which are generated in the model with low computational effort, in comparison to more complex techniques, such as energy balance methods and artificial intelligence algorithms [13][14][15][16][17][18][19][20][21][24][25][26][27][28][29][30][37][38][39][40][41][42][43][67][68][69][70][71][72][73][74][75]. For the MLR model, the response (dependent) variable y is assumed to be a function of k independent variables x i . The general form of the equation is computed as follows (Equation (2)): where b 0 and b i stand for the fitting constants; x i represents the ith observation of each of the explanatory variables, y i stands for the ith prediction of ETo, and e i is a random error term representing the remaining effects of variables on y, which are not covered by the model (residuals). The least squares criterion for the minimum sum of squares of error terms is usually applied to determine the fitting constants [61]. Stepwise MLR has the advantage of presenting the MLR models, beginning from the most influential parameter of the input combination (which explains the larger percentage of ETo variability, expressed by adj. R 2 ) [60]. It presents the models in ascending order of adj. R 2 , and it may omit several input variables in case they do not meaningfully contribute to the variability explanation. This meaningful contribution of one variable is determined via the adj. R 2 , as opposed to R 2 , which generally increases with the number of inputs. Therefore, adj. R 2 and R 2 are not directly comparable, since the latter would generally have a greater value than the former (Table 1).

Formulae of the Indices
The predicted ETo values from the regression models were then compared against the values of FAO PM via statistical indices (Table 1). Specifically, the error levels between predicted and reference values were computed via measures such as root mean square error (RMSE, mm d −1 ), normalised root mean square error (NRMSE, %), mean absolute error (MAE, mm d −1 ), and standard error of the predicted value (SE, mm d −1 ) ( Table 1, Equations (2)-(4)). The error values follow the magnitude of the computed ETo values, except for NRMSE, which is expressed in *100% (Table 1, Equation (3)). Moreover, two measures that express the percentage of ETo variability explained by the independent variables of the model and the agreement between predicted and reference values were utilised, namely the adjusted coefficient of determination (adj. R 2 ) and the index of agreement (IoA) ( Table 1, Equations (5) and (7)). These measures are suitable for statistical analyses of evapotranspiration, [57,76,77]. At last, residual analysis was performed using the Durbin-Watson index (D-W), Cook's Distance (Cook's D), and Centred Leverage. D-W is used to detect the presence of autocorrelation in the residuals. Values around 2.0 (usually between 1.50 and 2.50) express normality, meaning negligible correlation among residuals. Lower (higher) values express positive (negative) correlation. Cook's D values greater than 0.5 indicate potential outlier, and C. Leverage (between 0 and (k − 1)/k) is used to detect particular influential points.
In Table 1, p i stands for the ith value predicted by the regression model, r i stands for the ith reference value computed by FAO PM, r is the mean reference value, k is the number of the independent variables, and the sample size (n) is 290.

Results
MLR models with all the potential input combinations, from one to six explanatory variables, were tested. In total, sixteen models were examined. Models with altitude or latitude as sole input variables were not applicable, whereas models with seven inputs exhibited statistically insignificant results (p > 0.05). The obtained results are presented in Table 2.
As portrayed in Table 2, half of the tested models (8) yielded RMSE values below 0.35 mm d −1 and adj. R 2 values greater than 97.1% (Table 2). Among these models, MLR4, MLR5, and MLR6 obtained an RMSE below 0.29 mm d −1 , and an adj. R 2 between 98% and 98.1%. The results for the majority of the models are statistically significant at the 99% confidence level. Only four models (MLR10, MLR11, MLR13, MLR15) yielded one coefficient that was statistically significant at the 95% confidence level. Regarding the statistical analysis of the residuals performed for each model, C. Leverage and Cook's D were generally low, as were the SEs of the predicted values ( Table 2).
MLR5 and MLR6 with five and six independent variables, respectively, have the (same) highest adj. R 2 (98.1%). As anticipated, MLR6, which includes one extra independent variable, obtained slightly better RMSE/NRMSE values than MLR5 (0.276 mm d −1 /8.2% vs. 0.280 mm d −1 /8.3%). The two models have the same MAE (0.206 mm d −1 ) and IoA (99.5%). The model with the highest adj. R 2 (98.1%) with the fewer inputs is therefore MLR5, which is a function of N, Tmean, u 2 , es − ea, and Rn. The former means that these five variables play a significant role in the prediction of ETo for the Peloponnese. The addition of altitude (Z) as an extra independent variable of the model (MLR6) deteriorated the aforementioned indices, without increasing the explanation of the variance (adj. R 2 = 98.1%). The contribution of Z to the adj. R 2 change was about 0.03%. The equation produced by the model with the best performance (MLR5) is presented in Table 3.  Models MLR1, MLR7, MLR14, MLR15, and MLR16 were one-input models, which means simple linear regression models. The climate variables, namely N, Tmean, Rn, Rs, es − ea, and u 2 were the inputs of the aforementioned models, respectively. Among these models, MLR1 (N input) produced the best results (adj. R 2 = 96.0% and RMSE = 0.409 mm d −1 ), followed by MLR7 (Rn input) ( Table 2). The Tmean input model (MLR12) exhibited inferior performance. The produced MLR equations for the two best one-input models are displayed in Table 3. The model with the poorest performance was MLR15, with u 2 as the sole explanatory variable, with an adj. R 2 equal to 3%. The model with es − ea as the sole input (MLR16) also showed poor performance (Table 2).

Discussion
The ETo plays a critical role in the hydrological cycle, the climate crisis, as well as in water resource management and irrigation design. The established methods of estimation require the availability of a wide range of climate parameters, which is a serious drawback. MLR has been examined in search of a simple, affordable, and satisfactorily accurate modelling technique of ETo. Therefore, datasets of Tmean, Rn, Rs, es − ea, and u 2 , retrieved from sixty-two meteorological stations for the Peloponnese (Table A1) have been utilised. The altitude of the stations was also employed, since it affects ETo [65]. The months August and December of 2016-2019 were selected as typical months of summertime and wintertime, respectively, with considerable differences in climate variables between them (http://climatlas.hnms.gr/sdi/?lang=EN (accessed on 3 May 2022)). Furthermore, these two months were selected in methodological consistency with a previous study [36], the FAO PM estimates of which served as reference values for this study. MLR models with several input combinations of the seven available input parameters were tested. In total, sixteen models were examined, ranging from one to six explanatory variables, since models with seven inputs exhibited statistically insignificant results (p > 0.05), and were therefore omitted. As displayed in Table 2, the results of the majority of the tested models are significant at the 99% level of confidence. Models with altitude or latitude as sole input variables were not applicable. As a general rule, the values of error and agreement improve with an ascending number of input parameters. However, the magnitude of change differentiates based on the influence of each added parameter on ETo. As anticipated, the best performance was exhibited by the model with the most (six) inputs (MLR6). However, the MLR5 model is most recommended, since the extra parameter of MLR6 (Z) contributes to the interpretation of the variance only by 0.03% (adj. R 2 change). The variance of the ETo observations is explained by 98.1% in both cases. For MLR5, RMSE is only 0.280 mm d −1 and MAE is 0.206 mm d −1 , with an agreement between the FAO PM and the MLR5 ETo values of 99.5% (Table 2). Consequently, MLR5 is a satisfactorily accurate tool for predicting ETo for August and December of 2016-2019, with a simple linear formula ( Table 3). The major contribution of sunshine hours to the explanation of the data in MLR is in line with the results of Yirga for Ethiopia [61].
An interesting group of MLR models demonstrated in Table 2 are those with one sole independent variable as an input. In this case, the MLR transmits to simple linear regression. From a parsimonious perspective, the potential of only one variable being able to satisfactorily represent the ETo values would be useful in terms of (low) complexity, computational load and cost, as well as of increased applicability. The usefulness is amplified due to the rather inhomogeneous characteristics of the region, and the selected months that belong to different seasons. The only prerequisite, in this case, is that input data are available or easily retrieved. Among one-input models, MLR1 with N as the input variable exhibited the best performance, with 96% of the ETo values explained and an IoA of 99%. The RMSE/MAE and NMRSE are equal to 0.409/0.320 mm d −1 and 12.1%, respectively ( Table 2). According to the literature, these values are considered very good to excellent in terms of accuracy [58]. Sunshine hours is a parameter that was computed according to FAO guidelines for missing data, utilizing Julian day and station latitude [33], which might be a limitation in terms of accuracy compared to direct measurements. The methodological choice to include parameters computed based on measured parameters and the FAO procedure in ETo modelling is a common practice (e.g., in ANNs for ETo [40]). The second-best one-input model is MLR7, which is radiation-based (Rn). It explains data variability by 95.5%, with an IoA equal to 98.9%. The RMSE/MAE and NRMSE are 0.429/0.323 mm d −1 and 12.7%, respectively. Residual statistical values (i.e., Cook's D, C. Leverage) were almost the same as those of MLR1, except from the D-W value, which was better (2.09), ( Table 2). The latter value indicates that the residuals belong to a normal distribution.
However, the most easy-to-acquire datasets globally, by either ground-based data or remote sensing, are those of Tmean. MLR12 is a temperature-based, one-input model (Tmean). The variability explained (91.8%) is lower than those explained by N input and Rn input models, but is the same as the corresponding of the Rs input model (MLR14). The RMSE/MAE and NRMSE are equal to 0.582/0.461 mm d −1 and 17.2%. The poor performance of the Tmean input MLR model is also reported by Tabari et al. for Iran [52]. MLR14 (Rs) produced the same or slightly better values (MAE = 0.441 mm d −1 ) than the Tmean input model (Table 2). However, these values are almost identical to the results of Tabari et al. for the best MLR (four-input) model [52]. This accuracy is not satisfactory for applications such as precise irrigation design, but it remains a useful approach. It is noteworthy that half of the tested models (two to six inputs) yielded an RMSE below 0.35 mm d −1 , three of which were even below 0.29 mm d −1 , with adj. R 2 ≥ 97.1%. The produced linear equations for MLR5, and MLR1 and MLR7, which are, respectively, the model with the best performance, and the two best one-input models, are simple and demand low computational effort and a short time to apply (Table 3). These findings provide flexible potential choices regarding the available datasets. The worst performance was exhibited by MLR15, with u 2 as the sole explanatory variable. This model explains only 3% of the ETo observations. The error indices are not acceptable either (RMSE > 2.00 mm d −1 , MAE = 1.953, NRMSE = 59.2%, and the SE of the predicted values was greater than 15%) ( Table 2). The literature reports that near-surface wind speed is an influential parameter of ETo variability [78]. The influential role of u 2 in our study area was also noted in our previous work on ETo variability across the Peloponnese [36]. It was found that, mostly in August, (wind speed is generally very low in summer), in cases where increased u 2 values occurred, ETo was directly affected. This deduction is confirmed by the latest study on ETo across the Peloponnese, in which u 2 was proven to be the most influential parameter after Tmean [73]. In conclusion, it is probable that the relationship between u 2 and ETo is non-linear, thus the MLR model would not depict the established relationship. This inability of MLR models to portray complex relationships is anticipated due to their linear character, and has been denoted in the relevant literature [56].

Conclusions
MLR was employed to predict ETo with seven available climate parameters as inputs. Sixteen models, with all the potential input combinations, displayed statistically significant results (p ≤ 0.05). Among the tested models, MLR5 with five explanatory variables exhibited the best performance with less than the available input parameters (p ≤ 0.001). Therefore, MLR5, which constitutes a linear function of N, Tmean, Rn, es − ea, and u 2 , is recommended as a potential tool to predict ETo for the Peloponnese, after further investigation. Eight out of the sixteen tested models, with two to six inputs, displayed an RMSE below 0.35 mm d −1 , and explained the variance of ETo at least by 97.1%. Moreover, three of these models yielded an RMSE below 0.29 mm d −1 , and explained the variance by 98−98.1%.
Another noteworthy group of MLR models examined is the one-input models, which are simple linear equations. The N input model (MLR1) outperformed the rest (RMSE = 0.409 mm d −1 ), while it explained the variance of the ETo by 96%. The Rn model (MLR7) followed closely, with satisfactory results and data variance representation (RMSE = 0.429 mm d −1 , adj. R 2 = 95.5%). Probably, sunshine hours and net radiation significantly affect the ETo of the Peloponnese. The Tmean input model exhibited poorer performance. On the other hand, the model with the worst performance was MLR15, with u 2 as the sole input. Based on previous research, which reported the influential role of u 2 on the ETo for the Peloponnese, it seems that MLR could not capture the complex, non-linear relationship between u 2 and ETo. Provided that the Peloponnese is an area with distinguished differences over short distances, and that the regimes of the examined months vary considerably, these models have the potential to provide useful tools with flexibility regarding input data, applicable for interdisciplinary purposes. It is, hence, suggested that MLR models should be tested for longer time periods and larger areas in Greece, aiming at generalisation. Acknowledgments: The authors acknowledge the National Observatory of Athens (https://meteosearch. meteo.gr (accessed on 15 April 2022).) for ground-based data availability of sixty-two meteorological stations.

Conflicts of Interest:
The authors declare no conflict of interest.