Calibration for an Ensemble of Grapevine Phenology Models under Different Optimization Algorithms

: Vine phenology modelling is increasingly important for winegrowers and viticulturists. Model calibration is often required before practical applications. However, when multiple models and optimization methods are applied for different varieties, it is rarely known which factor tends to mostly affect the calibration results. We mainly aim to investigate the main source of the variability in the modelling errors for the ﬂowering timings of two important varieties of vine in the Douro Demarcated Region (DDR) of Portugal; this is based on ﬁve phenology model simulations that use optimal parameters and that are estimated by three optimization algorithms (MLE, SA and SCE-UA). Our results indicate that the main source of the variability in calibration can be affected by the initially assumed parameter boundary. Restricting the initial parameter distribution to a narrow range impedes the algorithm from exploring the full parameter space and searching for optimal parameters. This can lead to the largest variation in different models. At an identiﬁed appropriate boundary, the difference between the two varieties represents the largest source of uncertainty, while the choice of algorithm for calibration contributes least to the overall uncertainty. The smaller variability among different models or algorithms (tools for analysis) compared to between different varieties could indicate the overall reliability of the calibration. All optimization algorithms show similar results in terms of the obtained goodness-of-ﬁt: the RMSE (MAE) is 5–6 (4–5) days with a negligible mean bias and moderately good R 2 (0.5–0.6) for the ensemble median predictor. Nevertheless, a similar predictive performance can result from differently estimated parameter values, due to the equiﬁnality or multi-modal issue in which different parameter combinations give similar results. This mainly occurs for models with a non-linear structure compared to those with a near-linear one. Yet, the former models are found to outperform the latter ones in predicting the ﬂowering timing of the two varieties in the DDR. Overall, our ﬁndings highlight the importance of carefully deﬁning the initial parameter boundary and decomposing the total variance of prediction errors. This study is expected to bring new insights that will help to better inform users about the importance of choice when these factors are involved in calibration. Nonetheless, the importance of each factor can change depending on the speciﬁc situation. Details of how the optimization methods are applied and of the continuous model improvement are important.


Introduction
Plant phenology, the annually recurring pattern of plant development, has been identified as one key biological indicator of climate change [1].The relationship between the timing of phenological events and climatic factors, such as temperature, is fundamental for understanding the ecosystem's response to global warming [2].Phenology prediction is primarily driven by temperature, which is a pivotal part of process-based crop modelling.It is typically implemented as a phenological module within crop models [3,4].The accurate simulation of phenology development is a prerequisite for modelling other crucial physiological processes, such as transpiration, photosynthesis and yield formation, etc. [3,4].Therefore, many crop modelling studies focus specifically on phenology modelling [5][6][7][8].
Viticulture phenology modelling has received much attention in the last decade, as many studies have shown the potential of using simple temperature-based statistical or mechanistic models to successfully predict and monitor the vine phenology of commercial grape varieties [9][10][11][12].For instance, a simple Grapevine Flowering Veraison (GFV) model using the concept of growing degree days (GDD), was shown to be able to accurately (generally RMSE ≤ 7 days) reproduce the observed flowering and veraison timings of 95 and 104 common varieties, respectively, which spanned across 123 locations in Europe with a wide range of environmental conditions [9].Similarly, Leolini et al. [12] demonstrate the applicability of six common phenology models for predicting the budbreak of main grapevine varieties.Although many of these models were satisfactorily calibrated and evaluated for simulating vine phenology, deciding which model to use in a new environment is often difficult, as no single individual model could consistently outperform the others [13].Each model has its strengths and weaknesses, depending on the environmental conditions (e.g., climate and soil) and the type of cultivar being simulated [9][10][11][12].
A multi-model ensemble could be used to overcome the limitations of individual models, where the ensemble mean or median of the model outputs can be used to represent robust predictions [6,14].The multi-model ensemble can be directly integrated into decision support systems.For example, they could be conveniently coupled with existing real-time climate forecasts to provide early-season phenology predictions [15,16].This can support monitoring vine development, which is particularly necessary in the context of anticipating more frequent extreme weather conditions with the projected climate change [17][18][19][20][21]. Multimodel ensembles are also an integral part of the Agricultural Model Intercomparison and Improvement Project (AgMIP) [6,22,23].
Before any practical applications, phenology models need to be calibrated [24].During calibration, model-specific parameters are adjusted (optimized), in order to simulate the observed conditions as close as possible.Different models have different parameters, but most of these parameters are considered to reflect the genetic differences between cultivars, i.e., cultivar-dependent characteristics of phenology development [5,7,9,25].To estimate the optimal values of these parameters (i.e., parameter vector), a large diversity of calibration approaches are available, each having their own statistical assumptions, calibration steps, criteria of best-fit parameters and optimization algorithms [6][7][8]13,26].Variations in calibration approaches, model structures and input data represent three main sources of uncertainty in crop model (including phenology model) simulations [5,26,27].In particular, different optimization algorithms are available, but the algorithm selection, according to the calibration method, could considerably affect parameter estimations (including the associated uncertainties) and calibration results [7,8,13,26].
The best algorithm is often dependent on a specific context.For instance, Gao et al. [7] suggest that the importance of algorithms can vary depending on the priority of the study: if the priority is on the goodness-of-fit between the observations and simulations, the Ordinary Least Square (OLS) method is the most effective choice; meanwhile, in cases where an estimation of the parameter uncertainty and model prediction is more important, the Markov Chain Monte Carlo (MCMC)-based algorithms are more appropriate.Kawakita et al. [8] further report that uncertainties in the parameter estimations and the associated calibration performance are affected not only by parameter estimation methods or optimization techniques, but also by the model structure.However, these studies do not yet address, when multiple models and algorithms are available, whether the application of different algorithms for a given model or the assessment of different models for a certain algorithm is more important.Available information on this matter is important.For instance, if different optimization algorithms tend to give similar results, users may just choose the simplest one for their models.Some well-known examples of applied optimization algorithms include the Maximum Likelihood Estimation (MLE) [28], the Simulated Annealing (SA) [29] and the Shuffled Complex Evolution-University of Arizona (SCE-UA) [30].These algorithms have different strategies to search along the parameter space.Nevertheless, it is essential to specify the parameter search space, e.g., by defining the lower and upper boundary in an optimal way.Yet, this is often defined empirically by researchers [7,8,31,32], and it is rarely investigated whether the pre-defined boundary of parameter distribution is sufficient to warrant finding the optimal parameters by the algorithm.
In this study, we utilized five phenology models with varying degrees of complexity, which were all previously shown to be able to adequately reproduce vine phenology (see Section 2.2).These models were calibrated for simulating the flowering timing of two main grapevine varieties in Portugal, i.e., Touriga Franca (TF) and Touriga Nacional (TN), using three optimization algorithms (MLE, SA and SCE-UA).In general, the main objective represents an attempt to provide one plausible answer to the research question: when common factors of parameter estimations are explicitly taken into account, including models, algorithms and varieties (predicting events), which one tends to have the greatest effect on the final calibrations?This is carried out using Analysis of Variance (ANOVA) to identify the main source of variability (a larger magnitude of variations) among 5 models × 3 algorithms × 2 varieties.Note that a preliminary analysis with different assumptions for the initial parameter boundary is performed to determine the most appropriate constraint for the prior parameter distribution during parameter optimization.Based on the estimated parameter values from each optimization algorithm, the calibrated multi-model ensemble is constructed.For each algorithm, we then evaluate the goodness-of-fit between the observations and individual model simulations using a number of conventional statistical measures.Consequently, the ensemble median of the multi-model simulations is used to represent our best predictions against the observations.The variability in the resulting estimated parameter values among the different algorithms is also presented.

Study Region
The Douro Demarcated Region (DDR), known as the Douro/Port Wine Region, located in northeast Portugal, is one of the oldest winemaking regions in the world [33][34][35][36][37].It is recognized and included in the world heritage list by UNESCO for its unique mountainous landscape and long history (nearly 2000 years) of a wine-producing culture [36].DDR spreads over an area of ~250,000 ha and is divided into three subregions (Baixo Corgo, Cima Corgo and Douro Superior) that differ greatly from each other in their heterogeneous topography (Figure 1a).Viticulture is the main socioeconomic activity of the region and vineyards are traditionally grown on terraces over steep slopes along the Douro valley [36].It comprises an area of about 45,000 ha, accounting for approximately 22% of the total area of Portugal [34][35][36].DDR features a typical Mediterranean climate with warm dry summers and a mild wet autumn and winter.The low precipitation, high temperature and radiation amount in the summer months (June-August) frequently gives rise to situations of intense plant water stress [33].
22% of the total area of Portugal [34][35][36].DDR features a typical Mediterranean climate with warm dry summers and a mild wet autumn and winter.The low precipitation, high temperature and radiation amount in the summer months (June-August) frequently gives rise to situations of intense plant water stress [33].

Phenology Measurements
In the present study, we focused on the grapevine phenology stage of full flowering (BBCH65), i.e., the date when 50% flowering occurred, following the definition of the growth stages of mono-and dicotyledonous plants [38].Data for two main Portuguese grapevine varieties (Touriga Franca: TF; Touriga Nacional: TN) were collected over 2014-2021 in eight experimental vineyard plots within the DDR (Table 1), via a viticultural observatory network coordinated by the Association for the Development of Viticulture in the Douro Region (ADVID).The geographic locations of these plots are shown in Figure 1a.Across the study sites, there was a greater inter-annual variability in the observed flowering timing for TN (8)(9)(10) than for TF (4-6) (Table 1).To better represent the phenological characteristics of the variety, we pooled all site data (blend all) for a given variety;  In the present study, we focused on the grapevine phenology stage of full flowering (BBCH65), i.e., the date when 50% flowering occurred, following the definition of the growth stages of mono-and dicotyledonous plants [38].Data for two main Portuguese grapevine varieties (Touriga Franca: TF; Touriga Nacional: TN) were collected over 2014-2021 in eight experimental vineyard plots within the DDR (Table 1), via a viticultural observatory network coordinated by the Association for the Development of Viticulture in the Douro Region (ADVID).The geographic locations of these plots are shown in Figure 1a.Across the study sites, there was a greater inter-annual variability in the observed flowering timing for TN (8)(9)(10) than for TF (4-6) (Table 1).To better represent the phenological characteristics of the variety, we pooled all site data (blend all) for a given variety; thus, each variety had 32 observations from 4 sites × 8 years for calibration.The amount of data was comparable to a similar study that used 27 observations of rice phenology from different year-site-season combinations in order to estimate the optimal model parameters [5].

Climate Observations
The recent E-OBS (v24.0e)observational gridded meteorological dataset, with a spatial resolution of 0.1 • [39], was employed to extract daily temperature series for phenology modelling.The advantage of utilizing a gridded dataset, instead of data from site-specific weather stations, was its potential for large-scale applications in agro-environmental studies, e.g., producing regional maps [40,41], and its wide spatial coverage, which was especially important for data-scarce regions [42].Moreover, the applicability and reliability of the E-OBS dataset, when coupled with phenology models, has already been successfully demonstrated to reproduce the observed vine phenology in many Portuguese wine regions (including the DDR) well [31,43,44].The monthly temperature statistics for T min , T avg and T max , deriving from the E-OBS data, are shown in Figure 1b for the respective study sites.For the flowering stage, the critical months were mostly from April to June (end of spring to the beginning of summer), during which T avg could increase from 13 to 20 • C (T max from 18 to 28 • C), which was accompanied by a prolonged high-temperature event (T max ≥ 25 • C for at least 5 days) (Figure 1b).Such high-temperature events could extend through to half of May and all of June (Figure 1b).

Phenology Models
The five models selected were GDD [45], GDD-Richardson (Richardson for short) [46], Sigmoid [47], Triangular [48] and Wang [49].From an onset/starting date (t 0 ), these thermal forcing/driving models simulated the occurrence of a given phenology stage at the day (t s ), when the state of the forcing temperatures (S f ) reached a threshold value of F* (generally thought to be variety-dependent) [9,10]: where R f (x t ) was calculated as a function of the daily mean temperature (x t ).The phe- nology models differ depending on how R f (x t ) are applied.2).All models were initialized on January 1st, which was the initialization date adopted in other modelling studies for predicting the flowering timing of important varieties of vine in major Portuguese wine regions (including DDR) [10,50,51].Suboptimal photoperiod and vernalization effects are not considered.* GDD and Richardson models compute the F* with the degree.days−1 , while the other models compute F* (explained in Equation ( 1)) with temperature ratios (between 0 and 1).Accordingly, the initial F* search boundary is defined as 1100-1300 degree.days−1 for GDD and Richardson, and 40-80 for other models.For the GDD model, the base temperature (T b ) is constantly set at 0 • C.

GDD
The GDD model (Figure S1) is based on the classical thermal time concept [45].It includes two parameters (T b and F*) (Table 2), which generally assume a linear approach for the effective accumulation of daily thermal forcing when x t > T b (base temperature): We set T b at 0 • C (not calibrated) since this was successfully applied for the GDD in a simulation of the flowering timing of up to 95 grapevine varieties (over 123 locations) [9,11].Hence, only a single parameter, F*, was adjusted during parameter optimization (Table 2).

Richardson
The Richardson model (Figure S1) is a modified version of the GDD model.It includes three parameters (T b, T max and F*) (Table 2), which additionally consider a plateau above the high threshold temperature (T max ) [46]: As shown in previous studies, the Richardson model was able to accurately predict the flowering timing of both TF and TN for important wine regions in Portugal [50].

Sigmoid
The Sigmoid model (Figure S1) adopts the logistic function/sigmoid curve as the temperature-response curve [47].It includes three parameters (d, e and F*) (Table 2), which have a similar structure to the UniFORC model [10][11][12]: The sigmoid model has been calibrated and evaluated for simulating the flowering stage of numerous different grapevine varieties over multiple Portuguese wine regions [51].

Triangular
The Triangular model (Figure S1) uses the piecewise linear function with three temperature thresholds (T b, T opt, T max ) to define the temperature-response function [48].It includes four parameters (T b, T opt, T max and F*) (Table 2): This model has been well adapted to simulate the flowering timing of many white and red varieties in Portugal [10].

Wang
The Wang model (Figure S1) uses similar temperature thresholds (T b, T opt, T max ) to the Triangular model, but integrates the temperature effects with a non-linear response function [49].It includes four parameters (T b, T opt, T max and F*) (Table 2): This model has proven to be able to accurately represent the observed flowering timing of TF and TN in important Portuguese wine regions [50].

Maximum Likelihood Estimation (MLE)
The maximum likelihood estimation (MLE) is a simple optimization method intended to find the local maximum of the objective function [28].Therefore, the objective function is regarded as a likelihood function with the parameter vector representing function variables.Hence, the optimum parameter vector maximizes the likelihood.For the present study, MLE is implemented as a Markov chain Monte Carlo process based on the Metropolis-Hastings algorithm [52].In each iteration step, a random parameter vector is drawn from a Gaussian kernel in the neighborhood of the current best vector x old .If the new parameter vector x new outperforms the current one, the latter is exchanged and the next iteration cycle begins.MLE often only finds the local maxima of the objective function.Therefore, to improve the efficiency, the parameter space is scanned randomly to find the best initial parameter vector prior to the MLE iterations, i.e., burn-in samples [53].

Simulated Annealing (SA)
Simulated annealing is a probabilistic optimization method that is able to find the global optimum of the objective function [29].Its design resembles the annealing process in metallurgy.Simulated annealing utilizes the Metropolis algorithm [54].The optimum parameter is found iteratively by randomly drawing a new parameter set x new in the vicinity of the current optimum x old .The new parameter set is accepted if the objective function f (x new ) calculated from this set has increased from the previous optimum f (x old ).Otherwise, the non-optimal parameter set can still be accepted based on the Boltzmann distribution: Here, T represents a numerical temperature, intended to be reduced during iteration.Reducing the temperature will effectively reduce the probability of accepting non-optimal parameter sets.Simulated annealing has three parameters (initial temperature, number of draws at a fixed temperature level and rate of temperature reduction) that need to be carefully tuned for each problem [29].

Shuffled Complex Evolution-University of Arizona (SCE-UA)
The Shuffled Complex Evolution method, developed at The University of Arizona (SCE-UA).is an efficient, flexible and robust algorithm that is able to find the global optimum of the calibrated model [30].Following Duan et al. [30], SCE-UA embodies the properties that an optimization algorithm must possess: (1) global convergence when multiple local optimums can be found; (2) ability to avoid being trapped in non-optimum regions on the objective function surface; (3) robustness when the parameter sensitivities differ and parameter interdependencies exist; (4) non-reliance on the availability of an explicit expression for the objective function or the derivatives; and (5) high-parameter dimensionality handling.The method is based on four main concepts [30,55]: (1) a combination of deterministic and probabilistic methods; (2) systematic evolution of a 'complex' in a systematic search by drawing points spanning the parameter space, which is the global optimum; (3) competitive evolution; and (4) shuffling of the complex.Indeed, SCE-UA combines the algorithms of the simplex procedure [56], random search [57], competitive evolution [58] and complex shuffling [59].The parametrization of SCE-UA needs to be carefully tuned for each problem, and has been one of the most widely used algorithms in hydrological applications [53].

Application of Different Optimization Algorithms for Parameter Estimations
A brief summary of the important features and case-specific settings for the studied algorithms are presented in Table 3.The three algorithms differ in their approaches to deriving the initial best-guess value (and other initial settings), numeric methods and objective functions (Table 3).We have presented a diagram showing the general procedures to estimate the parameters for each model using any of these optimization algorithms (Figure 2).In brief, this could be described as follows: 1.
Define the initial distribution of the parameters.Here, we choose the uniform probability distribution function (Table 3); thus, parameter values were drawn with an equal probability between a given lower and upper boundary.

2.
Draw the initial parameter vector and choose the appropriate algorithm settings.In this step, we mostly adopt the default settings (Table 3) using the implementation software (Section 2.6) [53].Note that the step size parameter was used for the following parameter sampling.

3.
Sample the parameters around the current values, which vary mainly depending on the numeric method of the algorithm (Table 3).

4.
Perform model simulations with the sampled parameter vector and obtain the objective function by comparing the model output to the observed values.5.
Evaluate whether the new parameter vector should be kept or not, which differs in the criterion for different algorithms (Table 3).In general, a better objective function could result in a new parameter position [53].6.
Repeat steps 3-5 in an iterative process unless the finishing criterion of the algorithm has been reached (Figure 2).The details of how the calibration approach was applied could have important impacts on the estimated parameters, e.g., the initial lower and upper boundary of parameter distribution [26].Therefore, we have empirically defined the parameter boundary (baseline setting), shown in Table 2, according to the information collected from the literature data, modelling experts and local winegrowers or viticulturists.However, it could not be determined for certain that the specified boundary settings would guarantee that the algorithm would find optimal parameters.Therefore, we imposed in our experiment a variable range for the parameter boundary (Figure 2): we gradually expanded the baseline setting on each parameter range by a fixed percentage of 20%, 40%, 60%, 80% and 100%, respectively (i.e., 120-200% relative to baseline range).For each of these expanded range settings, we performed a full optimization run and computed the RMSE (modelling errors) between observations and simulations using the obtained optimized parameter vectors.Note that MLE and SA did not choose to minimize the RMSE as the objective function (Table 3).It was expected that a plateau/saturation in the modelling errors (RMSE) would be found, which would possibly imply that the appropriate parameter search space was found.
iable range for the parameter boundary (Figure 2): we gradually expanded the baseline setting on each parameter range by a fixed percentage of 20%, 40%, 60%, 80% and 100%, respectively (i.e., 120-200% relative to baseline range).For each of these expanded range settings, we performed a full optimization run and computed the RMSE (modelling errors) between observations and simulations using the obtained optimized parameter vectors.Note that MLE and SA did not choose to minimize the RMSE as the objective function (Table 3).It was expected that a plateau/saturation in the modelling errors (RMSE) would be found, which would possibly imply that the appropriate parameter search space was found.

Identification of the Main Source of Variability by ANOVA
ANOVA is a standard approach to quantifying different sources of uncertainty in crop modelling and climate impact assessment studies [5,21].To quantify the contribution of the models (5 levels), algorithms (3 levels) and varieties (2 levels) to the total variance in the prediction performance (fit to calibration data), a 3-factor ANOVA with a fixed experiment design was conducted.To obtain a uniform structure for the ANOVA, the predictive performance of the calibrated model was quantified, as the RMSE explained in Section 2.4.1.Note that the ANOVA was conducted per boundary setting.This could be useful in investigating whether the main source of variability (i.e., uncertainty) during optimization/calibration could vary with different boundary settings.

Parameter Variability between Algorithms
The parameter variability for each model was quantified as the variation (i.e., the coefficient of variation, CV) in the obtained optimal parameter values across the different optimization algorithms.The dimensionless nature of CV allows the comparison of parameters in different models.It also permits an assessment of whether a model has more unstable parameters than others when different algorithms are applied.Note that this was different from studies that computed the variability (uncertainty) in calibrated parameter values from either multiple folds of cross-validation [8], or from the obtained posterior distribution of parameters [7].

Evaluations of Goodness-of-Fit
A number of conventional goodness-of-fit measures were adopted to evaluate the agreement between the observation and simulation of the optimal parameters [5,7,8], including Mean Bias Error (MBE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and the coefficient of determination (R 2 ).These metrics mainly evaluated how well the model simulations, using the obtained parameters for each algorithm, agreed with the observations.Note that the evaluation was based on the same data that were used for calibration, meaning that the results are generally not indicative of the predictive performance for new and different situations.This could be useful for comparison with other studies [5,13].The ensemble median among different models was adopted as the best predictor, in accordance with a study with 27 different phenology modelling groups that showed a slightly better prediction performance for the ensemble median than that of the ensemble mean [6].Moreover, since the selected models had varying degrees of complexity, Akaike's Information Criterion (AIC) [60] was used to rate the prediction accuracy over the model structures.It aimed to evaluate the trade-off between model parsimony and efficiency, as it allowed the combination of information on the model complexity in terms of the number of parameters and the prediction accuracy in terms of the mean squared error [60].It has recently been recommended that its application is integrated into standard calibration guidelines [13].The lower the AIC score is, the better trade-off the model has: where O i and S i were the observed and simulated values using the obtained best-fit parameters, respectively, and n and k corresponded to the sample size of the observation and the number of parameters, respectively.

Overview of Optimization Performance
It is evident that all three algorithms tend to follow a very similar pattern, which consistently shows that the different boundaries of the initial parameter distribution have an impact on the optimization performance (Figure 3).However, it also depends on the phenology model in use.When the search space is expanded by 100% from the initial baseline setting (Table 2), the optimized final RMSE is considerably reduced from approximately 10-12 days (depending on the variety) to 6-7 days for the GDD model, while the remaining models have a relatively smaller variation of about 5-8 days (Figure 3).The plateau/saturation of the optimization performance for all the models and algorithms was when the expanded parameter range was between 80% and 100% (Figure 3).
A sufficient setting for the initial parameter distribution is important for the parameter optimization.The assumed uniform distribution is generally considered appropriate [7,32].Therefore, our focus turns to the lower and upper parameter limits.The results indicate that the initially defined parameter boundary (baseline settings) (Table 2) is insufficient, as smaller optimized RMSE values are found when it is expanded.An appropriate boundary is found when the initial range is expanded by 100% (80% is not chosen to avoid random variability between 60-80%, which is detected in MLE) (Figure 3).The corresponding updated parameter boundary is presented in Table S1, which should set the standard for the subsequent analysis.With this setting, a better fit to the calibration data is consistently found in TF rather in TN (Figure 3), which could be possibly due to the smaller temporal variability in the phenology observations for TF than for TN (Table 1).

Variations in the Main Source of Uncertainty
The ANOVA results quantify the different sources of variance to the total prediction errors (quantified using RMSE) of the calibrated models, which are presented in Table 4.The results indicate that the dominant source of variability can depend on the initial parameter boundary (Table 4).Variations among models explain most of the total variance when the initial boundary is expanded by up to 40% (Table 4).At a larger expanded boundary between 60% and 100%, the difference between the two varieties appears to account for most of the variance (Table 4).In comparison, the variations among the algorithms or the algorithm-related combinations (Algorithm × Model or Algorithm × Variety) contribute only a small fraction to the total variance throughout (Table 4).All factors, in- Many parameter estimation studies choose to define the initial parameter boundary using their empirical knowledge and information in the literature [7,8,31,32].For instance, in a case study from arid Northwest China, Ran et al. [32] suggest that the initial lower and upper parameter boundary can be directly set as ±30%, which is around the nominal value defined by the data in the literature; however, they found two key parameters with incorrect posterior distributions, resulting from their inaccurate initial nominal values [32].This implies that the empirical choice could sometimes be inaccurate.Hence, it is important to explicitly examine different boundaries.For all the algorithms, different boundary settings cause a larger variation in the optimized RMSE for the GDD model (for both varieties) than for the other models (Figure 3).GDD is a simple linear model and has only one parameter (F*); this is calibrated in our case (Table 2).Accordingly, more precise information on the initial parameter distribution is required.However, it also seems that the appropriate value of the thermal threshold (F*) for the flowering stage of the two varieties of vine is not covered by our initial baseline setting (Table 2).

Variations in the Main Source of Uncertainty
The ANOVA results quantify the different sources of variance to the total prediction errors (quantified using RMSE) of the calibrated models, which are presented in Table 4.The results indicate that the dominant source of variability can depend on the initial parameter boundary (Table 4).Variations among models explain most of the total variance when the initial boundary is expanded by up to 40% (Table 4).At a larger expanded boundary between 60% and 100%, the difference between the two varieties appears to account for most of the variance (Table 4).In comparison, the variations among the algorithms or the algorithm-related combinations (Algorithm × Model or Algorithm × Variety) contribute only a small fraction to the total variance throughout (Table 4).All factors, including their interactions, show statistical significance (p < 0.01), except for the interaction between algorithm and variety (Table 4).The ANOVA design is analogous to that of a study estimating the variance in model errors based on simulations with optimized parameters using three different calibration methods [7].It is important to decompose the total variance in model errors to better understand the contributions of different sources.Similarly, Kawakita et al. [8] present a study involving 3 optimization methods × 3 phenology models × 3 Japanese wheat cultivars, concluding that all three factors impact the predictive performance (quantified with RMSE).However, they do not reveal which factor tends to play a major role [8].In an attempt to fill this research gap, we demonstrate that the main source of uncertainty for the optimization results can vary with the assumed initial parameter boundary.The choice of model tends to play a major role when the initial parameter boundary is expanded by up to 40%, which can relate to the relatively high prediction errors of the GDD model (Figure 3).It appears that expanding the initial range by 40% (i.e., 140% relative to baseline setting) is still insufficient to optimize every model.
Given that the standard setting is to expand the initial baseline boundary by 100%, the "true" dominant source of uncertainty is identified as variety (Table 4).For a given variety, either the choice of the model or optimization method can be treated as a tool for analysis.Ideally, smaller uncertainties should be expected from them in order to better approximate observations of different varieties (predicting events), which can be considered as the external factor.This is especially important when there is a natural variability between observations, i.e., different variability in observed phenology between TF and TN (Table 1).Therefore, an overall reasonable calibration result is achieved in our study.Indeed, the difference between varieties represents the second largest source of uncertainty even when there is a still remarkable variability between models (boundary expansion ≤ 40%) (Table 4).In contrast, the selection of the optimization algorithm is identified as the smallest contribution to the overall uncertainty (Table 4), which can also be evidenced by a very similar pattern of evolution for the predictive performance shown in Figure 3. Gao et al. [7] come to similar findings using three different calibration approaches, namely Generalized Likelihood Uncertainty Estimation (GLUE), MCMC and OLS for the flowering timings of four different rice varieties in a multi-environmental trial in Madagascar and Senegal.
However, for another study simulating the heading date of Japanese winter wheat, the simulation errors of the calibrated models for new and different environments tend to vary considerably among different algorithms [8].Therefore, we assume that whether the results among different algorithms agree or not may depend on the subject, region and focus of the study.In our case, for predicting the flowering timing of two vine varieties, the focus is on the predictive performance of the same data used for calibration.The identified small variability among the algorithms could prompt users to just choose the simplest one, i.e., MLE (relatively simple in terms of code implementation), to calibrate their models.However, it could be of high interest to evaluate how well the calibrated models that use these algorithms predict new and different situations using independent data sets [6].Note that the spatial variability in the prediction performance can also contribute to the total prediction uncertainties; however, this is not separately quantified, as it is already included at the variety level.

Estimations of Parameter Variability
At the standard initial boundary setting (Figure 3), the estimated parameter values are considered optimal by a given algorithm, which is presented in Table S2.The resulting parameter variability among algorithms is shown to depend on the model structure (Figure 4).For both varieties, a relatively lower CV (≤20%) is found for models with near-linear functions (linear plateau for GDD and Richardson, and piecewise linear for Triangular), while the non-linear models (Sigmoid and Wang) present a much higher CV (up to 274%) (Figure 4).In particular, all the obtained optimal parameters diverge substantially among the algorithms for the Sigmoid model (CV is 25-75% for TF and 42-95% for TN), while for the Wang model, the unstable parameter among the algorithms mainly occurs for the T b parameter (CV is 30% for TF and 274% for TN) (Figure 4).In contrast, stable (CV ≤ 6%) parameters are obtained for the GDD and Richardson models, with a near-linear structure and a growing degree-day-based concept (Figure 4 and Table S2).S2.Note that the CV value for Tb parameter of the Wang model in TN is denoted as it exceeds 100%.

Similarity among Algorithms
For individual models, the evaluation results generally show satisfactory agreement between the observations and simulations using optimized parameters, regardless of the algorithm (Figures S2-S4).Differences in the MBE, MAE and RMSE among the algorithms are negligible (≤2 days), while no changes are detected for R 2 except for the Sigmoid model (Figures S2-S4).Specifically, the Sigmoid model optimized with SA (Figure S3) provides Selected optimization algorithms should ideally converge into the same or nearly the same parameter values for a given model.The variation in the final parameter values, if any, could be small and possibly result from inherently different numeric accuracies between the algorithms or the intrinsic randomness of each algorithm [13,53].In our case, similar parameter values occur for near-linear models (GDD, Richardson and Triangular).However, a larger parameter variability is found for Sigmoid and Wang.Parameter variability could be associated with the issue of equifinality or being multi-modal [61]: different parameter combinations (Figure 4 and Table S2) result in similar prediction errors among algorithms (Figure 3).This is a common phenomenon in phenology modelling [13,26,31].The underlying cause can be the issue of parameter correlation [8,25].Liu et al. [25] show that parameters with a higher uncertainty can have a stronger relationship with others, using data collected across major winter wheat production provinces in China.In most of the temperature-response functions of our models (Figure S1), a negative correlation is found between T b and F*, where a lower T b value can be offset by a higher F*.This means that an increased temperature accumulation rate (per day) due to a lower base temperature can be eventually offset by setting a higher thermal threshold (we confirm this point, but it is not shown).Therefore, calibration does not always lead to unique parameter values [8,13,26,27,31].
Moreover, our results illustrate that the variability in the estimated parameter values, using different optimization algorithms, may depend on the model structure.It was previously reported that models with fewer parameters can have smaller parameter uncertainties: the largest uncertainty mainly occurs for a sigmoidal and exponential function model (similar to our sigmoid model) with many mutually correlated parameters [8].Besides the number of parameters, our study further reveals that models with a more complex (e.g., non-linear) structure tend to have more unstable parameters than those with a simpler structure (e.g., near-linear) (Figure 4).In other words, equifinality or being multi-modal is more prone to occur during the optimization of non-linear models than for near-linear ones.It is possible that complex models tend to have a more complex surface of objective functions, leading to multiple local optimums with different parameter vectors.Nevertheless, higher parameter variability is generally found for TN than for TF (Figure 4), which can be associated with the higher inter-annual variability in the observed phenology of TN than for TF (Table 1), i.e., it is more difficult to find best-fit parameters given the observed data.

Similarity among Algorithms
For individual models, the evaluation results generally show satisfactory agreement between the observations and simulations using optimized parameters, regardless of the algorithm (Figures S2-S4).Differences in the MBE, MAE and RMSE among the algorithms are negligible (≤2 days), while no changes are detected for R 2 except for the Sigmoid model (Figures S2-S4).Specifically, the Sigmoid model optimized with SA (Figure S3) provides slightly less satisfactory predictions (R 2 = 0.5) than those (R 2 is 0.6-0.7)by coupling it with either MLE (Figure S2) or SCE-UA (Figure S4).Consequently, the ensemble median predictor results in a similar predictive performance over the algorithms, where the corresponding RMSE (MAE) is 5-6 (4-5) days, with minimized MBE (0-1 days) and moderately good R 2 (0.5-0.6) (Figure 5).
The evaluation of how well the calibrated model (with the obtained optimized parameters) can reproduce the data is an integral part of model calibration studies [5,7,8,25].Although these algorithms diverge in their obtained optimal parameter values (mostly for non-linear models) (Figure 4), they nearly converge in their prediction performance (Figures 5 and S2-S4).This is also consistent with the ANOVA results (Table 4).The algorithms are all demonstrated to be useful and effective for parameter estimations/calibrations for the multi-model ensemble, as evaluated by a range of goodness-of-fit metrics (Figures 5 and S2-S4).The results are generally comparable to those within the grapevine phenology modelling community [9][10][11]51].For instance, the RMSE is 5.9-6.3days for a differently parameterized Spring Warming model when predicting the flowering stage of 43 grapevine varieties based on 1092 observations across a wide range of European wine regions [11].RMSE ≤ 7 days is generally regarded as the key criterion for an acceptable predictive performance [9][10][11]51].This is obtained in the current study, together with almost zero mean bias and a small MAE (≤6 days); this is the case for either the individual models or for the ensemble median predictor.The multi-model ensemble median predictions are comparable to or nearly as good as those of the best model (Figures 5 and S2-S4).This aspect is consistent with other studies [6,14].Slightly poorer predictions for the Sigmoid model optimized by SA than those of the other two algorithms could be mainly due to the unrealistically small F* value (7.7 for TN) obtained by the SA, accompanied by a high e value (Table S2), i.e., more difficulties in obtaining effective temperature accumulations.SA, as a hyper-parameter algorithm, should be carefully tuned for each specific optimization task.Thus, it is advisable that the different parameter settings of SA are tested and explored in future studies.Moreover, the obtained R 2 (0.4-0.7) is less satisfactory (Figures S2-S4), and the ensemble median predictor can only explain up to 60% of the total spatial-temporal variations in the observations  S2) from (a) MLE, (b) SA and (c) SCE-UA.Touriga Franca (TF) and Touriga Nacional (TN) are denoted with orange and green symbols, respectively.Mean Bias Error (MBE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and the coefficient of determination (R 2 ) are conventional statistics for evaluating the goodnessof-fit.The geographic locations of the study sites for each variety are presented in Figure 1 and Table 1.Evaluations for individual model simulations are shown in the supplementary material (Figures S2-S4).
Slightly poorer predictions for the Sigmoid model optimized by SA than those of the other two algorithms could be mainly due to the unrealistically small F* value (7.7 for TN) obtained by the SA, accompanied by a high e value (Table S2), i.e., more difficulties in obtaining effective temperature accumulations.SA, as a hyper-parameter algorithm, should be carefully tuned for each specific optimization task.Thus, it is advisable that the different parameter settings of SA are tested and explored in future studies.Moreover, the obtained R 2 (0.4-0.7) is less satisfactory (Figures S2-S4), and the ensemble median predictor can only explain up to 60% of the total spatial-temporal variations in the observations (Figure 5).It should be noted that the current study utilized a gridded meteorological dataset (0.1 • ) [39] for phenology modelling.It is expected to have input errors during calibration.The overall moderate performance suggests the of the gridded dataset for modelling actual observed phenology, hinting at the potential for large-scale applications.Yet, to better capture the underlying spatial-temporal variability, models need to see continuous improvement (e.g., to incorporate the photoperiod and dormancy effects), along with better and high-quality field measurements (e.g., data from a target population) [6].

Difference between Modelling Types
AIC shows a similar pattern to that of previous evaluation results (Figures S2-S4), where a better performance is found for TF than for TN, and different algorithms result in very similar results but with variations among models (Figure 6).For TF, the AIC values are, on average, about 123 for degree-day (GDD, Richardson) and 118 for non-degreeday (Sigmoid, Triangular and Wang) models, respectively (Figure 6a).For TN, they are, respectively, 134 and 126 for degree-day and non-degree-day models (Figure 6b).Small variations are generally found within each model type (Figure 6).

Difference between Modelling Types
AIC shows a similar pattern to that of previous evaluation results (Figures S2-S4), where a better performance is found for TF than for TN, and different algorithms result in very similar results but with variations among models (Figure 6).For TF, the AIC values are, on average, about 123 for degree-day (GDD, Richardson) and 118 for non-degree-day (Sigmoid, Triangular and Wang) models, respectively (Figure 6a).For TN, they are, respectively, 134 and 126 for degree-day and non-degree-day models (Figure 6b).Small variations are generally found within each model type (Figure 6).
It is of particular interest to quantitatively evaluate individual model performances to understand the ensemble variability, which could indicate the ensemble quality, e.g., if some models outperform others.One of the somewhat commonly applied evaluation metrics for phenology modelling is AIC, which considers both the predictive performance and model complexity [11,12,50].To predict the flowering stage in the DDR, Costa et al. [50] report AIC values of 159-172 for TF and 170-180 for TN, using four phenology models optimized by SA.In comparison, our AIC values are considerably smaller for both varieties (<135) (Figure 6), suggesting that a better trade-off (parsimony and efficiency) is obtained for the currently applied model and algorithm combinations.Moreover, our results indicate that the non-degree-day-based models mostly outperform the degree-day-based models (Figure 6).Although GDD is preferentially selected over many other models as the best choice for simulating the flowering timing of 95 different grapevine varieties [9], it is not the favorite choice for TF and TN within the DDR, where high-temperature events are frequent (Figure 1b) [33].It is known that models based on the concept of "degreedays" have limited applicability in phenology predictions under extreme climate conditions [2].Meanwhile, for non-degree-day models, better performance is achieved with a non-linear structure (Sigmoid and Wang) (Figure 6).Although these models tend to have more unstable parameters when different algorithms are used (Figure 4), they seem to have more appropriate temperature-response functions for predicting the flowering stage of the two varieties in the DDR.Nevertheless, the best model and/or algorithm should be dependent on a specific situation, e.g., environment and/or variety.It is of particular interest to quantitatively evaluate individual model performances to understand the ensemble variability, which could indicate the ensemble quality, e.g., if some models outperform others.One of the somewhat commonly applied evaluation metrics for phenology modelling is AIC, which considers both the predictive performance and model complexity [11,12,50].To predict the flowering stage in the DDR, Costa et al. [50] report AIC values of 159-172 for TF and 170-180 for TN, using four phenology models optimized by SA.In comparison, our AIC values are considerably smaller for both varieties (<135) (Figure 6), suggesting that a better trade-off (parsimony and efficiency) is obtained for the currently applied model and algorithm combinations.Moreover, our results indicate that the non-degree-day-based models mostly outperform the degree-day-based models (Figure 6).Although GDD is preferentially selected over many other models as the best choice for simulating the flowering timing of 95 different grapevine varieties [9], it is not the favorite choice for TF and TN within the DDR, where high-temperature events are frequent (Figure 1b) [33].It is known that models based on the concept of "degree-days" have limited applicability in phenology predictions under extreme climate conditions [2].Meanwhile, for non-degree-day models, better performance is achieved with a non-linear structure and Wang) (Figure 6).Although these models tend to have more unstable parameters when different algorithms are used (Figure 4), they seem to have more appropriate temperature-response functions for predicting the flowering stage of the two varieties in the DDR.Nevertheless, the best model and/or algorithm should be dependent on a specific situation, e.g., environment and/or variety.

Conclusions
We have presented a study illustrating the influencing of common factors (model, optimization algorithm and variety) on the predictive performance of calibrated phenology models.The parameters of five models (GDD, Richardson, Sigmoid, Triangular and Wang) with varying degrees of complexity are optimized for simulating the flowering stage of two important grapevine varieties in a mountainous wine region (DDR), using three different optimization algorithms (MLE, SA and SCE-UA).We demonstrate that it is important to first define the appropriate boundary of the initial parameter distribution.A narrow range in the distribution could hinder the optimization algorithm from finding out the best-fit parameters, leading to a contrasting prediction performance between the model groups.Using the appropriate parameter boundaries, the variety in the simulation represents the largest contribution to the overall variability.Ideally, the variability among the models or optimization algorithms (considered as internal factors) should be smaller than that between varieties (considered as the external factor), in order to provide indicative results for a reasonable calibration.This is particularly necessary when there is natural variability between two varieties, as reflected by the different inter-annual variability in the observed phenology data.In contrast, the lowest contribution to the overall variability is the choice of the applied algorithm.All algorithms result in a similar magnitude of goodness-offit, with the calibrated multi-model ensemble median providing fairly good predictions.Within the multi-model ensemble, non-linear models (Sigmoid and Wang) outperform nearlinear ones, especially for those with the degree-day-based concept (GDD and Richardson).However, the former models tend to have more variable parameters than the latter ones when different algorithms are used, suggesting the link between parameter variability and model structure.
Overall, our study could provide new insights in order to better inform users about the importance of their choice when different optimization methods are available and when estimating cultivar-dependent parameters for multiple phenology models.Consequently, this study might contribute to formulating guidelines on the appropriate calibration practices for phenology models or crop models in general.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/agronomy13030679/s1, Figure S1 S1: Selected parameters of phenology models for calibration with the specified initial lower and upper boundary of parameter distribution corresponds to the expanded baseline setting of each parameter range in Table 2 by 100%; Table S2: The optimized parameter values of applied phenology models under different algorithms given the observed flowering data for the variety of Touriga Franca (TF) and Touriga Nacional (TN).These are

Figure 1 .
Figure 1.The spatial distribution of (a) study sites (numbered per Table 1) where the phenology data of two grapevine varieties (TF: Touriga Franca, TN: Touriga Nacional) are collected within the Douro Demarcated Wine Region and (b) the monthly mean of daily minimum (Tmin, °C), average (Tavg,°C) and maximum (Tmax, °C) temperatures over 1991−2020 (error bars: standard deviation of monthly means), along with the maximum duration (days) of consecutive hot days (Tmax ≥ 25 °C for at least 5 days) for each month.The meteorological statistics are based on the spatial average of all study sites.

Figure 1 .
Figure 1.The spatial distribution of (a) study sites (numbered per Table 1) where the phenology data of two grapevine varieties (TF: Touriga Franca, TN: Touriga Nacional) are collected within the Douro Demarcated Wine Region and (b) the monthly mean of daily minimum (T min , • C), average (T avg , • C) and maximum (T max , • C) temperatures over 1991−2020 (error bars: standard deviation of monthly means), along with the maximum duration (days) of consecutive hot days (T max ≥ 25 • C for at least 5 days) for each month.The meteorological statistics are based on the spatial average of all study sites.

Figure 2 .
Figure 2. A universal procedure to apply different optimization algorithms in order to estimate the parameters of phenology models.The iterative process is denoted by dash lines.

Figure 2 .
Figure 2. A universal procedure to apply different optimization algorithms in order to estimate the parameters of phenology models.The iterative process is denoted by dash lines.

Agronomy 2023, 13 , 679 13 of 23 Figure 3 .
Figure 3. Root Mean Squared Error (RMSE) computed between five phenology model simulations of the optimum parameters obtained from three global optimization algorithms (MLE: Maximum Likelihood Estimation; SA: Simulated Annealing; SCE-UA: Shuffled Complex Evolution) and the corresponding observed flowering data of two dominant Portuguese grapevine varieties (Touriga Franca: filled orange symbols; Touriga Nacional: filled green symbols).The orange and green symbols are slightly displaced in the x-axis to avoid the symbol overlap.The analysis is carried out by gradually expanding the initially defined range of distribution for each model parameter (baseline settings) by up to 100% (the percentage in the x-axis corresponds to each expanded percentage test category).An expanded range of 20-100% corresponds to 120-200%, relative to the initial baseline range.The observed phenology data correspond to blended multi-vineyard observations within the Douro region in the period of 2014-2021.

Figure 3 .
Figure 3. Root Mean Squared Error (RMSE) computed between five phenology model simulations of the optimum parameters obtained from three global optimization algorithms (MLE: Maximum Likelihood Estimation; SA: Simulated Annealing; SCE-UA: Shuffled Complex Evolution) and the corresponding observed flowering data of two dominant Portuguese grapevine varieties (Touriga Franca: filled orange symbols; Touriga Nacional: filled green symbols).The orange and green symbols are slightly displaced in the x-axis to avoid the symbol overlap.The analysis is carried out by gradually expanding the initially defined range of distribution for each model parameter (baseline settings) by up to 100% (the percentage in the x-axis corresponds to each expanded percentage test category).An expanded range of 20-100% corresponds to 120-200%, relative to the initial baseline range.The observed phenology data correspond to blended multi-vineyard observations within the Douro region in the period of 2014-2021.

Figure 4 .
Figure 4. Coefficient of variation (CV, %) for obtained optimal parameter values among the three applied algorithms for (a) Touriga Franca (TF) and (b) Touriga Nacional (TN).These are obtained with the expanded lower and upper boundary of parameter distribution by 100%, relative to the baseline settings.The obtained individual parameter values are presented in TableS2.Note that the CV value for Tb parameter of the Wang model in TN is denoted as it exceeds 100%.

Figure 4 .
Figure 4. Coefficient of variation (CV, %) for obtained optimal parameter values among the three applied algorithms for (a) Touriga Franca (TF) and (b) Touriga Nacional (TN).These are obtained with the expanded lower and upper boundary of parameter distribution by 100%, relative to the baseline settings.The obtained individual parameter values are presented in TableS2.Note that the CV value for T b parameter of the Wang model in TN is denoted as it exceeds 100%.

Figure 5 .
Figure 5.Comparison between observed flowering timing (DOY) and the median of the ensemble model simulations using optimized parameter values (TableS2) from (a) MLE, (b) SA and (c) SCE-UA.Touriga Franca (TF) and Touriga Nacional (TN) are denoted with orange and green symbols, respectively.Mean Bias Error (MBE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and the coefficient of determination (R 2 ) are conventional statistics for evaluating the goodness-offit.The geographic locations of the study sites for each variety are presented in Figure1and Table1.Evaluations for individual model simulations are shown in the supplementary material (FiguresS2-S4).

Figure 5 .
Figure 5.Comparison between observed flowering timing (DOY) and the median of the ensemble model simulations using optimized parameter values (TableS2) from (a) MLE, (b) SA and (c) SCE-UA.Touriga Franca (TF) and Touriga Nacional (TN) are denoted with orange and green symbols, respectively.Mean Bias Error (MBE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and the coefficient of determination (R 2 ) are conventional statistics for evaluating the goodnessof-fit.The geographic locations of the study sites for each variety are presented in Figure1and Table1.Evaluations for individual model simulations are shown in the supplementary material (FiguresS2-S4).

Figure 6 .
Figure 6.Evaluations with the Akaike Information Criterion (AIC) using optimized parameters over different algorithms and models for (a) Touriga Franca (TF) and (b) Touriga Nacional (TN).The

Figure 6 .
Figure 6.Evaluations with the Akaike Information Criterion (AIC) using optimized parameters over different algorithms and models for (a) Touriga Franca (TF) and (b) Touriga Nacional (TN).The lower the AIC value, the better the trade-off between the model parsimony and prediction accuracy.The analysis is carried out under the expanded lower and upper boundary of the parameter distribution by 100%, relative to the baseline settings.

:
Graphic representation of examples of different parameter sets provided for each studied phenology model; Figures S2-S4: Comparison between observed and simulated flowering DOY (day of year) by individual phenology model simulations with optimized parameter values from the Maximum Likelihood Estimation (MLE), Simulated Annealing (SA) and Shuffled Complex Evolution-University of Arizona (SCE-UA) respectively for Touriga Franca (TF) and Touriga Nacional (TN).Conventional statistics such as Mean Bias Error (MBE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and the coefficient of determination (R2) are calculated; Table

Table 1 .
Mean and standard deviation (sd) of the Day Of Year (DOY) for the observed flowering (BBCH65) of two main grapevine varieties (Touriga Franca: TF; Touriga Nacional: TN) within the Douro Demarcated Region (DDR) during the period of 2014-2021.

Table 2 .
Selected parameters of phenology models for calibration.The specified initial lower and upper boundary of uniform parameter distribution corresponds to the baseline setting.The "×" symbol indicates the parameter intended for the calibration of a given model.All model simulations are driven by the daily mean temperature from the first day of the year (t 0 ).

Table 3 .
A brief summary of the features and user-specified settings for the three employed optimization methods.

Table 4 .
Analysis of Variance (ANOVA) for RMSE computed between observations and simulations (with the optimal parameters) from different combinations of optimization algorithms, phenology models and grapevine varieties.ANOVA is conducted under different settings of parameter search space.Asterisks indicate the statistical significance of F statistics at p < 0.01 (df: degree of freedom).The main source of variance is highlighted in bold for each experimental setting.The results for intercept are not shown.Note computation of all statistics (e.g., F value) is based on the full numeric precision, but numbers are all rounded to only three decimal digits for simplicity in this table.