Early Prediction of Coffee Yield in the Central Highlands of Vietnam Using a Statistical Approach and Satellite Remote Sensing Vegetation Biophysical Variables

: Given the present climate change context, accurate and timely coffee yield prediction is critical to all farmers who work in the coffee industry worldwide. The aim of this study is to develop and assess a coffee yield forecasting method at the regional scale in Dak Lak province in the central highlands of Vietnam using the Crop Growth Monitoring System Statistical Tool (CGMSstatTool— CST) software and vegetation biophysical variables (NDVI, LAI, and FAPAR) derived from satellite remote sensing (SPOT-VEGETATION and PROBA-V). There has been no research to date applying this approach to this speciﬁc crop, which is the main contribution of this study. The ﬁndings of this research reveal that the elaboration of multiple linear regression models based on a combination of information from satellite-derived vegetation biophysical variables (LAI, NDVI, and FAPAR) corresponding to the ﬁrst six months of the years 2000–2019 resulted in coffee yield forecast models presenting satisfactory accuracy (Adj.R 2 = 64 to 69%, RMSEp = 0.155 to 0.158 ton/ha and MAPE = 3.9 to 4.7%). These results demonstrate that the CST may efﬁciently predict coffee yields on a regional scale by using only satellite-derived vegetation biophysical variables. This study ﬁndings are likely to aid local governments and decision makers in precisely forecasting coffee production early and promptly, as well as in recommending relevant local agricultural policies.


Introduction
Coffee is one of the most crucial agricultural products in the global market, playing a significant part in the economy of several developing countries in equatorial and subequatorial regions (Africa, America, and Asia) [1][2][3]. Currently, two coffee bean species, Coffee arabica L. (Arabica coffee) and C. canephora Pierre ex A. Froehner (Robusta coffee), account for 99% of coffee production in the global coffee trade [3,4]. Coffee is grown in approximately 80 tropical countries and contributes to the economic base of many of these countries. In addition, about 25 million farmer families produce coffee worldwide, with most being smallholders and families whose source of revenue largely depends on this crop [4]. Coffee is a climate-sensitive perennial plant likely to be highly influenced by changes in climate. Increasing climate variability may lead to coffee yield decrease and coffee area damage and threaten coffee production in producing areas worldwide [5,6]. province of Vietnam [21]. Furthermore, Molina et al. (2018) successfully calibrated the Aquacrop model to predict arabica coffee yield in Colombia [22]. The input databases for this application contained parameters associated with climatic variables, soil, crops, and management practices.
Most of the aforementioned models simulate coffee yields by accounting for the growing conditions. These models mainly depend on the data collected in the field, as well as observed weather data. The accuracy of such models also depends on accurate descriptions of crop management practices (e.g., crop variety, sowing date, fertilization, and irrigation), although collecting such data in a sufficiently accurate manner is difficult at the regional scale [23]. Furthermore, several years of experimental data are necessary to train and calibrate models to the local environmental conditions for these crop models, and when they are applied in other regions, they have to be recalibrated [23].
Due to the limitations of these models, statistical models, such as multiple linear regression, have been widely utilized to link crop yields to climate variables [24,25] or even intermediate output variables from process-based crop models [26]. Despite not being directly based on the mechanisms of plant growth, statistical models can effectively predict crop production [23]. The main benefits of statistical models are their limited dependence on field calibration data and their clear assessment of model uncertainties [27]. Statistical models typically perform better as the availability and quality of observable data improve [23].
Among the various tools and methods that enable the development of statistically based models, the "Crop Growth Monitoring System Statistical Tool" (CGMSstatTool-CGMS statistical tool-CST) [28] is independent software for forecasting crop yield based on initial indicator databases derived from crop models, climate data, or remote sensing data [29]. The CST was developed by the MARS (Monitoring Agriculture with Remote Sensing) unit of the European Union (EU) Joint Research Centre (JRC) to support the development and selection of crop yield forecast models in order to assist national or subnational crop yield forecasting activities [28]. The CST plays a crucial role in scientific decision making in the EU agricultural economy [28]. The CST has been effectively applied to growth monitoring and yield forecasting of some main crops in northeast China [30]. Fall et al. (2021) also used the CST to predict millet yield at a regional scale in Senegal with input data containing weather data combined with variables derived from remote sensing indicators (NDVI) [23]. The CST enables simple examination of data quality, analysis of crop yield time trend, and construction of crop yield forecasting models through three methods: (1) multivariate regression analysis; (2) scenario analysis, which is a method of forecasting that looks for the previous years that are most similar to the current year based on a set of indicators and combines their yields [23]; and (3) the moving average analysis model, which is based on the average yields of the most recent years preceding the target year. The CST calculates a number of statistics that facilitate the choice of the best crop yield forecast model for a given region and time of prediction. Another advantage of the CST is its ability to rapidly test multiple models [23,28].
With the development of satellite imagery, agricultural monitoring systems have used indices derived from the spectral reflectance of vegetation to provide timely and concise information about seasonal vegetative growth [31]. Remote-sensing-derived vegetation indices (e.g., the normalized difference vegetation index, NDVI) and biophysical variables (e.g., the fraction of absorbed photosynthetically active radiation (FAPAR) and the leaf area index (LAI)) can be used to predict crop yield, either directly or indirectly [32,33]. In addition, remote sensing vegetation variables enable estimation of crop growth variability to quantify the relative development and health conditions of crops [34]. Such vegetation indices and biophysical variables are the most common satellite products utilized for these purposes [31]. At the national and regional levels, satellite systems can contribute effectively to early warning of crop stress during the growing period and in forecasting harvest yields [31,35]. In a study of the largest coffee-exporting province in Brazil using a dataset covering the 2002-2009 period, Bernardes et al. (2012) [36] observed correlations between variations in the yield of coffee plots and variations in MODIS-derived EVI and NDVI vegetation indices computed from pure coffee crops (250 m pixels) overlapping the same coffee plots. The vegetation index metrics best correlated to yield were the amplitude and the minimum values over the growing season. The best correlations were obtained between the variation in yield and variation in vegetation indices of the previous year (R 2 = 0.55). In another study, Nogueira (2018) [37] evaluated the relationships between coffee productivity of some coffee plantations in Brazil and values of NDVI, SAVI, and NDWI vegetation indices derived from LANDSAT-8-OLI sensors for different coffee phenological phases. They concluded that the best phenological phases of coffee to determine coffee productivity from spectral indices were the stages of dormancy and flowering. The results also indicated that the NDVI was the best index to estimate the productivity of coffee trees, with the coefficient of determination (R 2 ) ranging from 0.58 to 0.90.
The objective of this research is to develop and assess a coffee yield forecasting method at the regional scale in Dak Lak province in the central highlands of Vietnam using Crop Growth Monitoring System Statistical Tool (CGMSstatTool-CST) software and vegetation biophysical variables (NDVI, LAI, and FAPAR) derived from satellite remote sensing (SPOT-VEGETATION and PROBA-V).
The findings of this study are expected to assist local governments and decision makers in accurately forecasting coffee yields early and in a timely manner, as well as in recommending appropriate strategies for local agriculture.

Study Area
The study was carried out in Dak Lak province, located in the central highlands of Vietnam in the Lower Mekong River Basin. The total area is 13,125 km 2 , and in 2019, the population of Dak Lak province was 2.127 million people. Currently, Dak Lak province includes Buon Ma Thuot city, Buon Ho town, and 13 districts.
In Dak Lak province, agriculture is the main source of local livelihoods. The area's geographic coordinates are from 107 • to 109 • east longitude and from 12 • to 13 • north latitude ( Figure 1), with an average elevation range of 400-800 m. Dak Lak province is dominated by a humid tropical climate. Generally, the area climate varies depending on the altitude: below 300 m, it is hot all year round, in the range of 400-800 m it is hot and humid, and it is cool at altitudes above 800 m [38]. There are two distinct seasons in Dak Lak province: a rainy season from May to October, with approximately 80-85% of annual rainfall, and a dry season from November to April, which is generally dry and sunny (15-20% of annual rainfall). Dak Lak province is an agricultural area with perennial crops such as coffee, pepper, cashew, and fruits, which play an important part in its economy. The region also produces annual crops such as rice, maize, sweet potato, vegetables, sugarcane, groundnut, and soybean [39]. Dak Lak has 209,955 ha of coffee area, accounting for nearly 31% of the country's coffee area [40]. In Dak Lak province, coffee exports represent 86% of total agricultural exports and more than 60% of the total yearly provincial income. In addition, coffee production employs more than 300,000 direct workers and more than 100,000 indirect workers [41]. The vast majority of coffee trees are part of coffee tree plantations, where coffee trees are the main vegetation story. Irrigation has been applied one to four times per year since 2008 across robusta coffee crops in Dak Lak province (on average, 1345 L/tree/year, i.e., 148 mm/year). Irrigation quantities vary based on rainfall patterns during the coffee growing season [42].
The general methodological workflow followed in the present research is presented in Figure 2 and further detailed below.
14, x FOR PEER REVIEW 5 of 21 The general methodological workflow followed in the present research is presented in Figure 2 and further detailed below.

Vegetation Biophysical Variables
Coffee yield forecasting is based on satellite imagery from Copernicus Hub 2022 (source: https://land.copernicus.vgt.vito.be (accessed on 22 December 2021)), namely NDVI, LAI, and FAPAR, available at a decadal (10-day) time step for the entire study area in Dak Lak province and the same years as the official coffee yield statistics ( Table 1). The 2000-2020 time series of decadal LAI, NDVI, and FAPAR products (21 years × 36 dekads/year) derived from the SPOT-VEGETATION and PROBA-V instruments were used in this study. The products are freely available at a 1 km global spatial resolution.   NDVI has theoretical values ranging from −1 to +1, where negative values mostly correspond to clouds, water, and snow, whereas values near zero primarily correspond to rocks and bare soil [23]. NDVI increases progressively with vegetation development.
"The Leaf Area Index is defined as half the total area of green elements of the canopy per unit horizontal ground area. The satellite-derived value corresponds to the total green LAI of all the canopy layers, including the understory which may represent a very significant contribution, particularly for forests. Practically, the LAI quantifies the thickness of the vegetation cover." (source: https://land.copernicus.eu/global/products/lai (accessed on 22 December 2021)) "The FAPAR quantifies the fraction of the solar radiation absorbed by live leaves for the photosynthesis activity. Then, it refers only to the green and alive elements of the canopy." (source: https://land.copernicus.eu/global/products/FAPAR (accessed on 22 December 2021)) In Vietnam, coffee phenology can be divided into five periods: (i) the flower-bud initiation and blooming season is from January to March; (ii) the fruit-setting period is from April to May; (iii) the cherry development period is from May to August; (iv) the maturity stage is from September to October; and (v) the ripening/harvest period is from October to December [3]. Two periods were considered in this study to compute the explanatory variables used in the search for coffee yield prediction models. The first period corresponds to 11 dekads, from mid-February to the end of May (dekads 5 to 15). This period was considered because it corresponds to the crucial period of growth and development of the coffee bush [43]. February to May are normally dry months in Vietnam, and coffee requires irrigation to guarantee blossom and cherry settings [14]. The second period corresponds to 18 dekads, from January to June (dekads 1 to 18). This period was considered because a longer period may be more representative of the global coffee development conditions and consequently result in variables that have a higher explanatory power. Additionally, because the objective of the methodology developed in this research is to produce models that enable forecasting of coffee yield well in advance compared to the harvest period of October to December, we decided to set the coffee yield forecast as the end of June at the latest. Using the first six months of the year to predict coffee yield will give planners sufficient time to consider or find solutions before the end of the coffee season.

Processing of Satellite Images in SPIRITS Software
The NDVI, LAI, and FAPAR satellite image time series were processed in the free Software for Processing and Interpreting of Remote Sensing Image Time Series (SPIRITS) [44] ( Figure 2).
First, images were imported and temporally smoothed with the SWETS algorithm [45], which was set with a maximum of 75% of missing values in each pixel profile, and the lowest physical value, Ymin, for cloud-free land pixels was kept at the default.
Second, the 11 phenological variables presented in Table 2 were computed from each of the 3 biophysical products (NDVI, LAI, and FAPAR), considering 2 periods (dekads 5 to 15 and dekads 1 to 18) by using the "time statistics" function of SPIRITS, which resulted in phenological images ( Figure 2).
Third, zonal statistics were extracted for these phenological variable images for the perennial agricultural vegetation zone of Dak Lak province thanks to an extraction mask derived from the official 2015 land use map of Dak Lak province collected by the Department of Agriculture and Rural Development of Dak Lak province ( Figure 1). This land use map did not contain a class specific to coffee plants but only a broad class relative to agricultural perennial plants accounting for approximately 62.5 to 68.2% of coffee from 2015 to 2018 [40,[46][47][48]. No pure coffee crop mask was available, and it was not possible for the authors to produce such a mask within the framework if this study. The extracted statistics corresponded to the final 33 coffee yield predictors.

Official Coffee Yield Datasets
The coffee yields considered in this study were provincial coffee yields and were computed by dividing the official provincial coffee production by the official provincial coffee area according to the Dak Lak Statistical Yearbook (2009,2014,2018, and 2020) [40,[46][47][48]. The period from 2000 to 2020 was considered. These coffee yields correspond to coffee dry grain yield.

Crop Yield Forecasting Model in the CST Software
In this study, the free software "Crop Growth Monitoring System Statistical Tool" (CgmsStatTool-CGMS statistical tool-CST) [28] was used to generate the coffee yield forecasting models.
The CST approach that we used in this study was multivariate regression analysis in order to assess the linear relationship between coffee yield (Y) and one or more independent variable(s) (the predictor(s) X 1 , X 2 . . . ) with the following Equation (1): In Equation (1), ε is the random error assumed to follow a normal distribution of mean 0 and constant variance σ 2 . Errors for different years are assumed to be independent. In the annotation for the X variables, the subscript n represents which X variable it is. β 0 . . . β n are the regression coefficients to be calculated through the ordinary least square method, minimizing the difference between the observed and fitted yield values. The CST tests various models, potentially using the crop yield time trend and between 1 and 4 independent variables; then, standard statistics and plots are exported to enable assessment of the quality of these models.
CST analysis was carried out as follows: (1) check for possible errors in the database of official yields and indicators, (2) assess both linear and quadratic crop yield time trends at a significance level of 0.025, (3) assess the correlation between the indicators with and without time trend (if any), and (4) search for the best multivariate regression models.
"CST takes the potential time trend into account by adding a term in the model that corresponds to that time trend, if applicable. To increase numerical precision, the regression coefficient for the linear time trend is for "year-offset" rather than "year" itself. The offset is fixed at 1965 by default in CST. Likewise, the regression coefficient for the quadratic time trend is for (year-offset) 2 ." [28] The period of 2000-2019 was used to search for and build the best models through the multivariate regression method of the CST.
As the CST can only work with a database containing a maximum of 30 variables, 3 databases of 30 variables were built from the 33 input variables and were used sequentially. With 20 years of calibration data (2000-2019), the CST allows 16 variables to be tested at a time in the regression analysis. Therefore, we iterated a random selection of 16 input variables to find the best models ( Figure 2).
The automatic selection and ordering of the best models by CST at each CST iteration for a given set of candidate variables was based on the root mean square error of prediction (RMSEp) (Equation (2)). RMSEp indicates the model's quality under prediction conditions [23]. RMSEp calculated by the CST is based on the leave-one-out residual or PRESS residual [28]. Predictions become increasingly precise as RMSEp approaches 0 and R 2 approaches 1. With each CST iteration, the 3 to 4 best sub-models were manually selected based on the RMSEp and the adjusted coefficient of determination (Adj.R 2 ) (Equation (3)). Adj.R 2 is a statistical measure of the model's goodness of fit in a regression model, which shows the proportion of variation explained by the estimated regression line.
where: P i and O i are the predicted and observed values for year i, respectively; O and P express the means of observed and predicted values, respectively; n is the number of samples (years); and k is the number of independent variables in the regression equation. * Note: in Equation (2), P i − O i is the difference between the ith observation and the predicted value for the ith observation based on a model fit to the remaining observations, i.e., without the ith observations (adapted from [28]).
Four other statistical parameters (Equations (4)- (7)) were also used to appreciate the models' performance but not to select them.
R squared (R 2 ) corresponds to the percentage of variance explained by the model (Equation (4)) [23].
The relative root mean square error (RRMSE) is calculated by dividing RMSEp by the mean value of observed data (Equation (5)).
The residual standard deviation (RSD) is the square root of the residual mean square [28] (Equation (7)).
where df is degrees of freedom. Here, df is equal to the sample size minus the number of parameters in the model. For example: where n is the sample size, and the number of parameters is 3.
The year 2020 was used to assess the performance of selected models with an independent year not used in model calibration by comparing the observed and predicted yield for 2020 and computing the related residuals.
The final selection of the best models was based on a combination of model performance in calibration (2000-2019) and in prediction for 2020.

Model Performance
According to the CST time trend analysis mode, Dak Lak province showed a significant  Details of the eight best coffee yield models for Dak Lak province obtained with the multivariate regression method of the CST for the two periods considered (mid-February to end of May and early January to end of June) are presented in Table 3 and Figure 4.    Details of the eight best coffee yield models for Dak Lak province obtained with the multivariate regression method of the CST for the two periods considered (mid-February to end of May and early January to end of June) are presented in Table 3 and Figure 4.
Overall, the forecast coffee yield models performed satisfactorily in both time periods, with the RMSEp varying between 0.155 and 0.178 ton/ha, the RRMSE varying between 7.5% and 8.6%, and the Adjusted-R 2 varying between 62.8% and 68.8% (Table 3 and Figure 4). The models built on the 18-dekad period provided systematically better results than those built on the 11-dekad period when considering RMSEp and RRMSE only. For models computed based on 18 dekads, the RMSEp ranged from 0.155 to 0.158 ton/ha, the Adj.R 2 was between 64.2 and 68.8%, and the RRMSE ranged from 7.5 to 7.6%. For models calculated based on 11 dekads, the RMSEp ranged from 0.174 to 0.178 ton/ha, the Adj.R 2 was between 62.8 and 67.6%, and the RRMSE ranged from 8.4 to 8.6%. s.e. = standard error, Adj.R 2 = adjusted R squared, R 2 = R squared, RSD = residual standard deviation, RMSEp = root mean square error for prediction, RRMSE = relative root mean square error (%).
It seems difficult to clearly identify one best model among those of the 18-dekad period, given that they all achieved very similar global statistical performance when considering all statistical parameters. For example, for the18-dekad period, the best model according to the Adj.R 2 (model 4, Adj.R 2 of 68.8%) is the worst according to the RMSEp (0.158 ton/ha). s.e. = standard error, Adj.R 2 = adjusted R squared, R 2 = R squared, RSD = residual standard deviation, RMSEp = root mean square error for prediction, RRMSE = relative root mean square error (%). Overall, the forecast coffee yield models performed satisfactorily in both time periods, with the RMSEp varying between 0.155 and 0.178 ton/ha, the RRMSE varying between 7.5% and 8.6%, and the Adjusted-R 2 varying between 62.8% and 68.8% (Table 3 and Figure 4). The models built on the 18-dekad period provided systematically better results than those built on the 11-dekad period when considering RMSEp and RRMSE only. The results show that model 0, which corresponded to the linear time trend only, performed less efficiently (RMSEp = 0.202 ton/ha, RRMSE = 9.7%, Adj.R 2 = 41.8%) than models combining a linear time trend with phenological variables derived from the remote sensing data (Table 3 and Figure 4).
For the period considering dekads 1-18 (from January to June), all selected models used three variables, in addition to the time trend. Models 1 and 2 used only the LAI variables, model 3 combined the LAI and NDVI variables, and model 4 combined the LAI and FAPAR variables. For the period considering dekads 5-15, all models used four explanatory variables, in addition to the time trend. Model 5 combined the LAI and NDVI variables, whereas models 6-8 combined the LAI, NDVI, and FAPAR variables. Among the 8 best models, LAI-derived variables occur 18 times, whereas NDVI-derived variables occur 6 times and FAPAR-derived variables occur only 4 times. This observation suggests that LAI-derived variables are more efficient than NDVI-and FAPAR-derived variables for coffee yield forecasting. A relatively high negative or positive correlation was observed between some variables selected in some of the best models (R varies in the range from −0.863 (for Adn_LAI and Dmn_LAI in model 5) to 0.795 (for Vmn_LAI and Dmx_LAI in model 1)) Figure 5). When considering the period of dekad 1 (start of January) to dekad 18 (end of June) of the years 2000 to 2019, the analysis of the Pearson correlation coefficient of the 11 phenological variables between the three biophysical satellite products, LAI, NDVI, and FAPAR (Figure 6), shows a highly variable level of correlation between these phenological variables, from, in absolute values, 0.00 to 0.97, i.e., from no correlation to a very high level of correlation. For this period, the phenological variables derived from FAPAR and NDVI are the most correlated (average absolute correlation of 0.59; third column of Figure 6), whereas those derived from LAI are less correlated to NDVI and FAPAR variables, especially for FAPAR (average absolute correlation of 0.30; first column of Figure 6). The low correlation values observed for at least some phenological variables in each pair of biophysical products (LAI and FAPAR, LAI and NDVI, and FAPAR and NDVI) suggest that these three products may provide some non-redundant (uncorrelated) information and thus be complementary at some point and, consequently, that it is relevant to consider the three of them in the search for the best coffee yield prediction models. The results also showed that models utilizing satellite data from January to June (models 1 to 4) were more suitable for estimating coffee yields in Dak Lak province than models using satellite data from mid-February to May (lower RMSEp and higher Adj-R 2 for models 1 to 4).  column. On the diagonal is the histogram of each variable, which shows the lowest locally fit regression line. The plots below the diagonal are the bivariate scatter plots of each pair of variables. These scatter plots show an ellipse around the mean (the red point), with the axis length reflecting one standard deviation of the column and row variables. The red line is the smoothed regression lines of the bivariate scatter plots of each pair of variables.  Table 4 shows the residuals and percentage residuals of predicted coffee yields for the target year 2020 for the eight selected models. For models based on dekads 1 to 18 (models 1 to 4), the absolute residuals were in the range of 0.054 to 0.134 ton/ha, and the absolute percentage residuals were in the range of 2.2 to 5.5%. For models based on dekads 5 to 15, three models presented absolute residuals in the range of 0.248 to 0.571 ton/ha and absolute percentage residuals in the range of 10.2 to 23.6%, and one model (model 5) with a better performance presented a residual of 0.082 ton/ha and a percentage residual of 3.4%. The best model in terms of prediction for 2020 was model 3, with a residual of 0.054 ton/ha and a percentage residual of 2.2%. The 2020 residuals for models 1-4 (Table 4) were all smaller than the corresponding RMSEp of the period 2000-2019 (Table 3), whereas the 2020 residuals for models 5-8 (Table 4) were generally much higher than the corresponding RMSEp of the period 2000-2019 (Table 3).   Table 4 shows the residuals and percentage residuals of predicted coffee yields for the target year 2020 for the eight selected models. For models based on dekads 1 to 18 (models 1 to 4), the absolute residuals were in the range of 0.054 to 0.134 ton/ha, and the absolute percentage residuals were in the range of 2.2 to 5.5%. For models based on dekads 5 to 15, three models presented absolute residuals in the range of 0.248 to 0.571 ton/ha and absolute percentage residuals in the range of 10.2 to 23.6%, and one model (model 5) with a better performance presented a residual of 0.082 ton/ha and a percentage residual of 3.4%. The best model in terms of prediction for 2020 was model 3, with a residual of 0.054 ton/ha and a percentage residual of 2.2%. The 2020 residuals for models 1-4 (Table 4) were all smaller than the corresponding RMSEp of the period 2000-2019 (Table 3), whereas the 2020 residuals for models 5-8 (Table 4) were generally much higher than the corresponding RMSEp of the period 2000-2019 (Table 3). Observed versus model-predicted coffee yields for the period 2000-2020 are presented in a series of scatterplots in Figure 7. The predicted values used in these plots are those predicted with the models calibrated for the 2000-2019 period.

Coffee Yield Predictions for 2020
These plots revealed the highest R 2 (0.76) for model 4, combining ddn-LAI, rsd-FAPAR, vmn-LAI, and yield linear time trend as predictor variables (Figure 7). For models based on data from January to June, the four selected models (model 1 to 4) indicated an R 2 in the range of 0.73 to 0.76 and a p-value of <0.0001 (Figure 7). The models based on data from mid-February to May presented an R 2 ranging from 0.66 to 0.75 and a p-value of <0.0001 (Figure 7).
In 2006, most models underestimated the official yields by approximately 0.249 to 0.470 ton/ha.

Discussion
The observed positive coffee yield time trend in Dak Lak province over the past 20 years can be explained by a combination of factors, including investments in irrigation infrastructure, a heavy reliance on irrigation on coffee farms, the affordability of fertilizer, and the increasing adoption of new management techniques in the province [3,42].
Existing coffee models usually simulate and forecast coffee yields at a local or regional level by including, as parameters, the main growth and development processes impacted by climate variations. However, in this study, the results showed that it is possible to predict coffee yield at a regional scale in Dak Lak province of Vietnam six months before the harvest based on remote sensing data only. Thus, such models could be an essential tool for indirectly assessing the impacts of weather variability or improvements in farmer practices on coffee yields at the provincial or district level in Vietnam or any other coffee-growing regions or countries under climate change conditions.  [3], most of the models developed in this study achieved a higher accuracy (RMSEp = 0.155 to 0.178 ton/ha, RMSE = 0.123 to 0.134 ton/ha, RRMSE = 7.5 to 8.6%, and MAPE = 3.9 to 5.3%) ( Table 3). The model of Kouadio et al. (2018), which consisted of an extreme learning machine (ELM) model using soil organic matter (SOM), available potassium, and available sulfur as explanatory variables, provided coffee yield estimates for Lam Dong province (in the central highlands of Vietnam) with an RMSE of 0.496 ton/ha, an RRMSE of 13.6%, and MAPE = 7.9% [21]. In addition, the simple processbased model developed by Kouadio et al. (2021) for simulating and forecasting robusta coffee yield at the regional scale in Vietnam showed a RMSE of 0.24 to 0.33 ton/ha and an MAPE = 9 to 14%, and that model was successfully tested using satellite remote sensing data (LAI) and model-based gridded climate data (maximum and minimum temperatures, solar radiation, and rainfall): MAPE ≤ 12% and RMSE ≤ 0.29 ton/ha [3].
The method proposed in this study enabled the development of models that can satisfactorily forecast coffee yield with a low RMSEp and a high Adj.R 2 for Dak Lak province. We also showed that the period considered for the production of the model explanatory variables (dekads 1 to 18 versus dekads 5 to 15) has an important impact on the accuracy of the resulting models, with better accuracy for those considering a more extended period. The models based on a longer period were also composed of fewer explanatory variables (three variables + the time trend) than those based on a shorter period (four variables + the time trend).
When selected in models, the variables aup-FAPAR, dmx-LAI, dmx-FAPAR, and dmx-NDVI presented systematically negative values, which may mean that the smaller the "largest increase between subsequent periods" of the FAPAR and the sooner the "date of maximum" LAI, FAPAR, and NDVI, the higher the coffee yield will be. Furthermore, the variables derived from the LAI product were demonstrated to be more efficient for coffee yield forecast models than those derived from NDVI and FAPAR, although some complementarity was observed between these products for some models (Table 3). This is the first research to date combining NDVI, FAPAR, and LAI remote-sensingderived phenological variables in the CST to create a coffee yield prediction model, which is the main contribution of this study. The data used in this study were derived from the SPOT-VEGETATION and PROBA-V instruments at a 1 km global spatial resolution. Therefore, future studies should consider using more recent and similar products derived at a 300 m spatial resolution.
The main technical limitations we encountered in this research are related to the fact that the CST cannot handle a database containing more than 30 variables and that with 20 years of data used for model calibration, the CST cannot consider more than 16 variables at a time during the multiple linear regression model search. These two technical limitations of the CST make its use more difficult than it should be.
In this research, we used a multiple linear regression technique in order to produce coffee yield prediction models. Such a technique is particularly suited to identify and use the linear relationships between the predictors and the dependent variable. However, the linearity of the relationship between coffee yield and the phenological variables was not assessed in this study, and it may be possible that some variables present a nonlinear relationship with yield. Consequently, further research should involve testing other nonlinear modelling approaches for predicting coffee yield from biophysical variables, such those used in this study.
The findings of this study show that satellite data, such as the NDVI, LAI, and FAPAR products provided by the Copernicus Global Land Service are a good source of information for estimating and forecasting coffee yield in a challenging situation, where there is a deficit of information about management practices, soil characteristics, irrigation schedule, phenology of coffee trees, etc.
We think that a main source of improvement of the coffee yield forecast model developed in this research would probably be the use of a more detailed land use map containing a class specific to coffee. The official 2015 land use map of Dak Lak province used to extract satellite-derived vegetation biophysical variables did not contain a class specific to coffee plants but only a broad class relative to agricultural perennial plants, accounting for approximately 62.5 to 68.2% of coffee from 2015 to 2018 [40,[46][47][48].

Conclusions
The present study is the first research on the development and assessment of a coffee yield forecasting method at the regional scale for Dak Lak province in the central highlands of Vietnam using the Crop Growth Monitoring System Statistical Tool (CGMSstatTool-CST) software and vegetation biophysical variables (NDVI, LAI, and FAPAR) derived from satellite remote sensing data (SPOT-VEGETATION and PROBA-V).
The findings of this research reveal that the elaboration of multiple linear regression models based on satellite-derived vegetation biophysical variables (LAI, NDVI, and FAPAR) corresponding to the first six months of the years 2000-2019 resulted in coffee yield forecast models presenting satisfactory accuracy (Adj.R 2 = 64 to 69%, RMSEp = 0.155 to 0.158 ton/ha, and MAPE = 3.9 to 4.7%). These results demonstrate that the CST may efficiently predict coffee yields on a regional scale by using only satellite-derived vegetation biophysical variables. Our findings are likely to aid local governments and decision makers in precisely forecasting coffee production early and promptly, as well as in recommending relevant local agricultural policies.
Further research should consider applying the developed method to search for coffee yield forecast models at other scales (at district and national levels) with enhanced input data (finer spatial resolution for satellite images and more accurate coffee maps) and with other explanatory variables.  Data Availability Statement: The in situ datasets generated and/or analyzed during the current study and the satellite derived database are available from the corresponding author upon reasonable request. The geographical boundaries of the fields analyzed during the current study are not publicly available due to privacy protection. The satellite images are freely available at https://land.copernicus. vgt.vito.be/.