Long-Term Hindcasts of Wheat Yield in Fields Using Remotely Sensed Phenology, Climate Data and Machine Learning

: Satellite remote sensing offers a cost-effective means of generating long-term hindcasts of yield that can be used to understand how yield varies in time and space. This study investigated the use of remotely sensed phenology, climate data and machine learning for estimating yield at a resolution suitable for optimising crop management in ﬁelds. We used spatially weighted growth curve estimation to identify the timing of phenological events from sequences of Landsat NDVI and derive phenological and seasonal climate metrics. Using data from a 17,000 ha study area, we investigated the relationships between the metrics and yield over 17 years from 2003 to 2019. We compared six statistical and machine learning models for estimating yield: multiple linear regression, mixed effects models, generalised additive models, random forests, support vector regression using radial basis functions and deep learning neural networks. We used a 50-50 train-test split on paddock-years where 50% of paddock-year combinations were randomly selected and used to train each model and the remaining 50% of paddock-years were used to assess the model accuracy. Using only phenological metrics, accuracy was highest using a linear mixed model with a random effect that allowed the relationship between integrated NDVI and yield to vary by year ( R 2 = 0.67, MAE = 0.25 t ha − 1 , RMSE = 0.33 t ha − 1 , NRMSE = 0.25). We quantiﬁed the improvements in accuracy when seasonal climate metrics were also used as predictors. We identiﬁed two optimal models using the combined phenological and seasonal climate metrics: support vector regression and deep learning models (R 2 = 0.68, MAE = 0.25 t ha − 1 , RMSE = 0.32 t ha − 1 , NRMSE = 0.25). While the linear mixed model using only phenological metrics performed similarly to the nonlinear models that are also seasonal climate metrics, the nonlinear models can be more easily generalised to estimate yield in years for which training data are unavailable. We conclude that long-term hindcasts of wheat yield in ﬁelds, at 30 m spatial resolution, can be produced using remotely sensed phenology from Landsat NDVI, climate data and machine learning.


Introduction
Crop management decisions depend on our understanding of how changes to management will affect yield, but management is only one of many determinants of yield. Crop yield varies in space and time according to soil types, local weather conditions and seasonal climate variability as well as management. Moreover, spatial and temporal variation in yield is often much larger than effects of management. Precision agriculture (PA) aims to optimise crop management across farms and fields to sustainably improve yield and profit [1,2]. While many sources of information inform PA decision-making, the primary source is geographically referenced yield data recorded by yield monitors mounted on harvesting machinery [3]. While yield monitors are standard in most modern harvesting temporal gaps caused by cloud contamination during the growing season [48]. SWGC estimates crop growth curves using data from a local neighbourhood around each cell, where the data are weighted according to geographical distance from the central cell. It combines spatial smoothing by use of spatial weights with temporal smoothing by growth curve estimation to improve estimation of land surface phenology from Landsat NDVI. Because it is not dependent on individual cells having sufficient cloud-free images within the growing season, SWGC enables phenology detection at more cells than non-spatial approaches which typically exclude cells with insufficient observations (e.g., [44,49]).
To support decision-making about how to optimise crop management within fields, this study aims to determine whether phenological and seasonal climate metrics obtained from SWGC estimation have utility for estimating wheat yield. We use the SWGC estimated growth curves to identify the timing of phenological events and derive phenological metrics that describe the timing and degree of crop growth stages occurring at each cell. Detected phenology is then combined with daily weather data to produce seasonal climate metrics that describe the water availability and growing degree days during different growth stages of the crop. We aim to use the metrics as predictors of wheat yield. Previous studies that combined phenological metrics with climate data to predict yield reported large differences in accuracy using different statistical and machine learning models [24,31,32]. We therefore compare six of these models for our purpose: multiple linear regression, linear mixed models, generalised additive models, random forests, support vector regression using radial basis functions and deep learning.
The specific objectives of this study are to: (1) Investigate relationships between phenological and seasonal climate metrics derived from Landsat NDVI using SWGC estimation with wheat yield; (2) assess and compare statistical and machine learning models for estimating wheat yield using phenological metrics as predictors; (3) quantify improvements in accuracy of yield estimation when seasonal climate metrics are also used as predictors; and (4) identify an optimal model and produce long-term hindcasts of wheat yield at 30 m resolution.

Study Area
The study area is located in the Western Australian grainbelt at around 31.28 S and 118.16 E (Figure 1a). It is approximately 17,000 ha in size. Grain crops are grown in a dryland system that is heavily reliant on winter rainfall during the May to October growing season. Average growing season rainfall is between 200 and 300 mm with large interannual variability (Table 1). Wheat is the main crop grown, with barley, lupins, canola and pasture included in cropping rotations. Besides crop fields, the study area includes areas of remnant vegetation (native trees and shrubs) and salt scalds caused by dryland salinity.

Landsat Data
All bands of Landsat-7 (path/row: 111/082) data from 2003 to 2019 were obtained from the United States Geological Survey (USGS) web site (https://earthexplorer.usgs. gov accessed on 2 February 2021). There were 366 Landsat-7 images available during 2003 to 2019. Data pre-processing including geometric correction, top of atmosphere reflectance correction and surface reflectance correction were completed using the USGS Earth Resources Observation and Science (EROS) Centre Science Processing Architecture (ESPA) online interface (https://espa.cr.usgs.gov accessed on 3 February 2021). Cells with cloud contamination or cloud shadow were removed using the cell quality control band. The corrected red and near-infrared reflectances were used to calculate NDVI for all available image dates.

Wheat Yield Data
Yield monitor data were obtained for 44 paddocks (5281 ha) across the study area ( Figure 1b) during 2003 to 2019. The yield data were cleaned to remove high (above the 99th percentile) and low (below the 1st percentile) outliers and kriged to produce 30-m resolution raster maps using the R 'gstat' package [50,51]. The number of wheat paddocks and the total area of wheat grown vary from year to year ( Table 1). The total number of paddock-year combinations is 426. Yields ranged from 0 to 4 tonnes per hectare (t ha −1 ) with substantial spatial and year to year variation. The total number of 30-m resolution cells containing yield data during the 17-year period was 421,766.

Wheat Yield Data
Yield monitor data were obtained for 44 paddocks (5281 ha) across the study area ( Figure 1b) during 2003 to 2019. The yield data were cleaned to remove high (above the 99th percentile) and low (below the 1st percentile) outliers and kriged to produce 30-m resolution raster maps using the R 'gstat' package [50,51]. The number of wheat paddocks and the total area of wheat grown vary from year to year ( Table 1). The total number of paddock-year combinations is 426. Yields ranged from 0 to 4 tonnes per hectare (t ha −1 ) with substantial spatial and year to year variation. The total number of 30-m resolution cells containing yield data during the 17-year period was 421,766.

Climate Data
Point-source and gridded weather data, at 5-km resolution, are available for Australia via the Long Paddock SILO data base (https://www.longpaddock.qld.gov.au/silo accessed on 3 February 2021). Because wheat yield in dryland cropping systems can be highly influenced by intense winter rainfall events, we use point-source weather data from an actively recording weather station instead of gridded data. Daily point-source weather data from 2002 to 2019 were obtained from the Long Paddock 'Patched Point' Database for the nearest recording weather station to the study area, Nungarin (ID = 10112, 31.18 S and 118.10 E). 'Patched Point' data have had temporal data gaps filled with an estimate obtained by interpolating data from surrounding weather stations. Consequently, they form a complete daily data record with no missing data.

Spatially Weighted Growth Curve Estimation
The spatially weighted growth curve (SWGC) estimation method estimates growth curves for each year using data from a spatial neighbourhood around each cell, where the data are weighted according to distance from the central cell using a distance-decay kernel [48]. Spatial weights are applied using a truncated or 'moving window' Gaussian kernel, where the distance weights are set to zero for all cells (x, y) outside of a rectangular region centred on the location for which the growth curve is being estimated. The moving window size is specified by MAXD and has width and length equal to (2 * MAXD + 1) metres (m). Within the window, distance weights are given by: where w x,y is the weight for cell (x, y), d x,y is the distance of (x, y) from the location for which the growth curve is to be estimated and b is the Gaussian kernel bandwidth. To ensure filling of spatial gaps in Landsat-7 data caused by the scanline corrector failure in 2003, we use bandwidth b = 60 m and MAXD = 200 m. This bandwidth ensures sufficient data with non-zero weights in the estimation of SLC failure gaps. MAXD is chosen to be as small as possible while retaining all data with weights within three decimal places from zero. Crop growth is modelled using an asymmetric double Lorentz function for the two main phases of crop development as measured by NDVI, which we refer to as the vegetative phase (from germination to maximum vegetative growth) and the grain-fill phase (from maximum growth to senescence). The curve takes the form: where x is the day of year, 0 ≤ c ≤ 0.9 is the minimum NDVI, 0.1 ≤ d ≤ 1 is the maximum NDVI, 0 ≤ e ≤ 260 is the day of year at which maximum NDVI is observed and b and f are the shape parameters for the curve before and after maximum NDVI is observed. Figure 2a demonstrates SWGC estimation using spatial weights and an asymmetric double Lorentz function.

Phenological Metrics
Phenological metrics (PM) are derived from the estimated SWGC growth curves as follows. We define the start of growing season (SOS) and end of growing season (EOS) as the days of the year when NDVI is 20% higher than minimum NDVI (c in Equation (2)). The peak of season (POS) is the day of the year that NDVI is at its maximum (e in Equation (2)). Peak of season vegetative growth (POSV) is identified by maximum NDVI (d in Equation (2)). Three integrated NDVI measures are created that sum NDVI in three stages of the crop growth cycle: iNDVI for the whole growing season (SOS to EOS), VLAD for the vegetative stage (SOS to POS) and GLAD for the grainfill phase (POS to EOS). Growth periods and metrics are shown in Figure 2b. Figure 2a demonstrates SWGC estimation using spatial weights and an asymmetric double Lorentz function.

Phenological Metrics
Phenological metrics (PM) are derived from the estimated SWGC growth curves as follows. We define the start of growing season (SOS) and end of growing season (EOS) as the days of the year when NDVI is 20% higher than minimum NDVI (c in Equation (2)). The peak of season (POS) is the day of the year that NDVI is at its maximum (e in Equation (2)). Peak of season vegetative growth (POSV) is identified by maximum NDVI (d in Equation (2)). Three integrated NDVI measures are created that sum NDVI in three stages of the crop growth cycle: iNDVI for the whole growing season (SOS to EOS), VLAD for the vegetative stage (SOS to POS) and GLAD for the grainfill phase (POS to EOS). Growth periods and metrics are shown in Figure 2b.

Seasonal Climate Metrics
Seasonal climate metrics (SCM) are produced by combining detected phenology with daily weather data. Water availability has been shown to be the main driver of wheat yield in dryland cropping systems of Western Australia [52]. It is defined as the sum of growing season (May to October) rainfall and one-third of summer (November to April) rainfall. We modify this using estimated phenological dates to define annual water availability (AWavail) as the sum of rainfall falling between SOS and EOS and one-third of rainfall falling during the 180-day period preceding SOS. We also consider total rainfall falling during the growing season (GSR), vegetative phase (VR) and grainfill phase (GR). In addition, growing degree days, defined as the sum of one half of daily maximum temperature minus minimum temperature, are calculated for the three periods: growing season (GSDD), vegetative phase (VDD) and grainfill phase (GDD).

Data Exploration
We explore the relationships between the phenological and seasonal climate metrics and yield using the Pearson correlation coefficient (R). We consider both the correlation between phenological and seasonal climate metrics and across all years and the mean of their annual correlations with yield. To visualise differences in relationships between the metrics and yield and how the relationships vary with seasonality, we plot the metrics against yield for each year.

Seasonal Climate Metrics
Seasonal climate metrics (SCM) are produced by combining detected phenology with daily weather data. Water availability has been shown to be the main driver of wheat yield in dryland cropping systems of Western Australia [52]. It is defined as the sum of growing season (May to October) rainfall and one-third of summer (November to April) rainfall. We modify this using estimated phenological dates to define annual water availability (AWavail) as the sum of rainfall falling between SOS and EOS and one-third of rainfall falling during the 180-day period preceding SOS. We also consider total rainfall falling during the growing season (GSR), vegetative phase (VR) and grainfill phase (GR). In addition, growing degree days, defined as the sum of one half of daily maximum temperature minus minimum temperature, are calculated for the three periods: growing season (GSDD), vegetative phase (VDD) and grainfill phase (GDD).

Data Exploration
We explore the relationships between the phenological and seasonal climate metrics and yield using the Pearson correlation coefficient (R). We consider both the correlation between phenological and seasonal climate metrics and across all years and the mean of their annual correlations with yield. To visualise differences in relationships between the metrics and yield and how the relationships vary with seasonality, we plot the metrics against yield for each year.

Statistical and Machine Learning Models
We test six predictive models for yield estimation: multiple linear regression (MLR), linear mixed models (LMM), generalised additive models (GAM), random forests (RF), support vector regression using radial basis functions (SVR) and deep learning using multilayer perceptrons (DL). We perform all analyses in the R environment [53].
Multiple linear regression (MLR) is used as a baseline against which to compare other predictive models. Feature selection for MLR is performed using a stepwise forward selection process to find the optimal set of explanatory variables to use in each model. The forward selection process starts with the intercept only model and adds predictors to the model one at a time. At each step, the single variable that improves the goodness-of-fit of the model as measured by the Akaike Information criterion (AIC) [54] is added until a minimum AIC is attained. The ordering in which predictors are added provides a measure of the importance of each predictor.
Linear mixed models (LMMs) are an extension of MLR that allows both fixed and random effects [55,56]. Fixed effects have a common linear relationship for all the data, as is the case for predictors in MLR. Random effects can be used to account for structure in the data. We estimate LMMs using the optimal set or predictors identified by stepwise forward selection for MLR plus one random effect that allows the relationship between a single important predictor and yield to vary by year. The predictor with maximum mean annual correlation with yield is used as the random effect. We use the 'lmer' function from R package 'lme4 [57] to perform linear mixed modelling.
Generalised additive models (GAMs) are generalised linear models with a linear predictor composed of smooth functions of the predictors [58,59]. Practical variable selection for GAMs can be performed by adding penalty terms that can shrink components of the linear term to zero, thereby eliminating predictors from the model [60]. We adopt this approach, using thin plate regression splines [61] for the smooths. We use the 'gam' function from R package 'mgcv' [62] for estimation of GAMs. To prevent overfitting and improve generalisation, we also impose an additional constraint to enforce smoother models by setting the 'gamma' parameter equal to six.
A regression tree is a hierarchic structure, where a test is applied at each level to either one predictor or a linear combination of predictors that may have one of two outcomes. The result is a partitioning of the predictor space into disjoint regions that each correspond to a single prediction. Random forests (RF) perform nonlinear regression by model averaging of many regression trees where each tree uses a random number of predictors sampled with replacement according to a uniform probability distribution [63]. We use the 'ranger' function with default parameters from R package 'ranger' for random forests [64]. Support vector regression (SVR) identifies the optimal regression hyperplanes by minimising the linear regression coefficients within specified error tolerances. Nonlinear regression is performed by first transforming the data into a high-dimensional feature space using a nonlinear kernel function [65]. In this study, we use Gaussian radial basis kernels to transform the data. We use the 'svm' function from R package 'kernlab' to estimate the kernel scale parameter and perform epsilon-sensitive loss regression [66].
Deep learning neural networks are artificial neural networks or multilayer perceptrons with multiple inner layers [67]. We train a deep learning network with three inner layers with 64, 128 and 64 nodes respectively. Rectified linear unit (ReLU) activation functions are used for the first two inner layers, and a linear activation function is used for the third. To prevent overfitting and improve generalisation, we perform regularisation by dropping out 20% of the nodes in each layer during training [68]. Estimation is performed using root mean square propagation (RMSProp) [69]. We use the Tensorflow interface [70] for training and prediction of deep learning models, accessed via R package 'keras' [71].

Yield Predictors
Using cells in all paddocks with wheat data, we test two sets of predictors for yield prediction: (1) Phenological metrics derived from the SWGC estimated growth curves; and (2) phenological metrics combined with seasonal climate metrics ( Table 2). To avoid collinearity, for all models except linear mixed models, we do not use predictors that are sums of other predictors. These include iNDVI (equal to VLAD + GLAD), GSR (VR + GR) and GSDD (VDD + GDD). For linear mixed models, we include one random effect that is chosen to be the predictor that has maximum mean annual correlation with yield. That predictor may be one of the summed predictors. Table 2. Sets of yield predictors tested: phenological (PM) and phenological plus seasonal climate metrics (PM + SCM). Model comparison aims to determine the optimal machine learning model for producing long-term spatial hindcasts of yield to improve our understanding of crop response to environmental conditions and management. We use a 50-50 train-test split on paddockyears where 50% of paddock-year combinations are randomly selected and used to train each model. The remaining 50% of paddock-years are used as test data to assess model accuracy. Accuracy is measured using the coefficient of determination (R 2 ) between observed (y i ) and predicted (ŷ i ) yields, mean absolute error (MAE), root mean square error (RMSE) and normalised root mean square error (NRMSE), defined as follows:

Predictor Set Metrics
The coefficient of determination provides the proportion of variance in the observed data explained by a model, relative to observed mean. The MAE and RMSE provide measures of the model error. NRMSE is the ratio of the model error and mean observed value. The optimal model is identified by having highest R 2 and lowest MAE, RMSE and NRMSE.

Optimal Model Validation
After the optimal model is identified we perform further validation of the results using the test data. To understand how the model performs in different years, we consider scatterplots of observed versus predicted yield for each year.

Yield Hindcasts
The machine learning model with highest R 2 and lowest MAE, RMSE and NRMSE is selected as the optimal yield prediction model. The model is then trained using all the yield data, and the estimated model parameters are used to predict yield in each year to form a set of long-term yield hindcasts.

Spatially Weighted Growth Curve Estimation
The SWGC method for spatially weighted growth curve estimation was applied to each cell in the study area (n = 189,280) for 2003-2019. The SWGC method for spatially weighted growth curve estimation was applied to each cell in the study area (n = 189280) for 2003-2019. Figure 3 shows an example of the estimated growth curves for a cell in a field where wheat was grown in 10 years out of 17 years. It shows that there are different numbers of cloud-free Landsat images in different years. In 2003, there were only two cloud-free images during the May-October growing season. Years 2005 and 2009 have few cloud-free images during the vegetative growth phase, while 2006, 2011, 2014 and 2018 have few images during the grainfill phase. In some years, when there were no cloud-free images available around the peak of season, such as 2011 and 2012, SWGC estimates higher peak NDVI values than observed.

Phenological and Seasonal Climate Metrics
We calculated the phenological and seasonal climate metrics for all cells where wheat was grown. Maps of all metrics are included in the Supplementary Material to this article (Supplementary Material 1). Table 3 shows the correlation between phenological and seasonal climate metrics and yield across all years and the mean of their annual correlations. Figure 4 shows scatter plots of the phenological and seasonal climate metrics against yield for each year, with data coloured to show their correlation. Annual water availability (AWavail) has maximum correlation with yield across all years (R = 0.64), but its annual correlation with yield is lower and varies from year to year. This indicates that AWavail has potential usefulness for explaining temporal yield variability. Integrated NDVI over the entire growing season (iNDVI) has the next highest overall correlation with yield (R = 0.56) with strong positive annual correlations with yield that range from 0.10 in 2014 to 0.72 in 2015, with a mean annual correlation of 0.45. This indicates that iNDVI has usefulness for explaining both temporal and spatial variability. As seen in Figure 4, yield tends to increase with increasing iNDVI until it reaches a maximum, sometimes linearly (e.g., 2003, 2008), but it can appear that the rate of increase slows and the amount of yield variability increases with increasing values of iNDVI. The relationship between the peak of season NDVI (POSV) and yield is similar to the relationship between iNDV and yield but is less strong. Vegetative phase (early season)-integrated phenological metric (VLAD) has higher correlation with yield than grainfill phase (late season)-integrated metric (GLAD) across all years, but lower mean annual correlation with yield. Similarly, vegetative phase rainfall (VR) has higher correlation with yield than grainfill phase rainfall (GR) across all years, but VR has lower mean annual correlation with yield. However, grainfill phase degree days (GDD) has higher correlation with yield across all years and annually than vegetative phase degree days (VDD), so the relative importance of vegetative phase and grainfill phase metrics varies depending on the metric. Start of season timing (SOST) is negatively correlated with yield and end of season timing (EOST) is positively correlated with yield, indicating that early sowing and late harvest result in increased yield, but this is not the case in all years.  Table 4 shows the accuracy statistics for estimating yield, calculated for the independent test data, for each of the machine learning models. The time taken to estimate each model is also shown. The nonlinear models (GAM, RF, SVR and DL) had higher accuracy than MLR. The LMM included a random effect that estimated different linear iNDVI-yield relationships for each year. This resulted in the highest accuracy of all models that used only phenological metrics as predictors (R 2 = 0. 67 Table 4 shows the accuracy statistics for estimating yield, calculated for the independent test data, for each of the machine learning models. The time taken to estimate each model is also shown. The nonlinear models (GAM, RF, SVR and DL) had higher accuracy than MLR. The LMM included a random effect that estimated different linear iNDVI-yield relationships for each year. This resulted in the highest accuracy of all models that used only phenological metrics as predictors (R = 0.67, MAE = 0.25 t ha −1 , RMSE = 0.33 t ha −1 , NRMSE = 0.25).

Model Comparison
Use of seasonal climate metrics improved accuracy for all models except LMM. By estimating different iNDVI-yield relationships for each year, the LMM using only phenological metrics appeared to capture much of the interannual information provided by the seasonal climate metrics to the other models. Adding seasonal climate metrics to the model provided little improvement. Use of seasonal climate metrics improved accuracy for all models except LMM. By estimating different iNDVI-yield relationships for each year, the LMM using only phenological metrics appeared to capture much of the interannual information provided by the seasonal climate metrics to the other models. Adding seasonal climate metrics to the model provided little improvement.
Two equally optimal models were identified: support vector regression with radial basis functions (SVR) and deep learning (DL) models (R 2 = 0.68, MAE = 0.25 t ha −1 , RMSE = 0.32 t ha −1 , NRMSE = 0.25) using combined phenological and seasonal climate metrics. There is little difference between the predictions of the two models across all years (R = 0.98). Figure 5 shows that the annual error (observed−predicted) distributions using the SVR and DL models are also similar. However, model estimation for SVM was more computationally intensive, taking almost 20 times as long to run than DL (Table 4). Because the SVM and DL models perform similarly but DL is faster, we select DL as the optimal machine learning model. = 0.32 t ha −1 , NRMSE = 0.25) using combined phenological and seasonal climate metrics. There is little difference between the predictions of the two models across all years (R = 0.98). Figure 5 shows that the annual error (observed−predicted) distributions using the SVR and DL models are also similar. However, model estimation for SVM was more computationally intensive, taking almost 20 times as long to run than DL (Table 4). Because the SVM and DL models perform similarly but DL is faster, we select DL as the optimal machine learning model.   Figure 6 shows the scatterplots of observed versus predicted yield using the deep learning model. The accuracy of the DL model varies across different years. It explained nearly 70% of the yield variability in 2015 (as it does overall, see Table 4), but much less  Figure 6 shows the scatterplots of observed versus predicted yield using the deep learning model. The accuracy of the DL model varies across different years. It explained nearly 70% of the yield variability in 2015 (as it does overall, see Table 4), but much less in many years. R 2 was lower in low-yielding years (e.g., 2007, 2010, 2012, 2019) than in high-yielding years (e.g., 2015, 2017, 2018), but there are exceptions: 2005 and 2016 had high yields and low R 2 . It might be expected that the error statistics MAE and RMSE would be lower in lower-yielding years, and this was generally the case. In general, the predicted yields had a lower range than observed and the model over-predicts low yields and under-predicts high yields.

Yield Hindcasts
Yield hindcasts for all wheat fields from 2003 to 2019 were produced using the deep learning (DL) model trained on all available data. Figure 7 shows the maps of the observed and predicted yield for each year. High-resolution maps can be viewed in the Supplementary Material to this article (Supplementary Material 2). There is evidence of reduced range of predicted versus observed yields, as suggested by Figure 6. However, the predicted yield maps show spatial consistency with the observed yield maps, with each showing similar spatial patterns of yield variation. There is a degree of spatial smoothing evident in the predicted maps, when viewed closely. This is due to the spatial smoothing component of spatially weighted growth curve estimation [48], which is propagated into the predicted yields.
Remote Sens. 2021, 13, x FOR PEER REVIEW 13 of 21 in many years. R was lower in low-yielding years (e.g., 2007, 2010, 2012, 2019) than in high-yielding years (e.g., 2015, 2017, 2018), but there are exceptions: 2005 and 2016 had high yields and low R . It might be expected that the error statistics MAE and RMSE would be lower in lower-yielding years, and this was generally the case. In general, the predicted yields had a lower range than observed and the model over-predicts low yields and under-predicts high yields.

Yield Hindcasts
Yield hindcasts for all wheat fields from 2003 to 2019 were produced using the deep learning (DL) model trained on all available data. Figure 7 shows the maps of the observed tary Material to this article (Supplementary Material 2). There is evidence of reduced range of predicted versus observed yields, as suggested by Figure 6. However, the predicted yield maps show spatial consistency with the observed yield maps, with each showing similar spatial patterns of yield variation. There is a degree of spatial smoothing evident in the predicted maps, when viewed closely. This is due to the spatial smoothing component of spatially weighted growth curve estimation [48], which is propagated into the predicted yields.

Discussion
Motivated by the need for long-term sequences of yield data to support precision agriculture, this study investigated the use of Landsat NDVI for estimating wheat yields over a 17,000 ha study area in WA for 17 years from 2003 to 2019. For this purpose, we tested the use of a new method for estimating crop growth curves from sequences of Landsat NDVI that may contain spatial and temporal gaps caused by cloud contamination

Discussion
Motivated by the need for long-term sequences of yield data to support precision agriculture, this study investigated the use of Landsat NDVI for estimating wheat yields over a 17,000 ha study area in WA for 17 years from 2003 to 2019. For this purpose, we tested the use of a new method for estimating crop growth curves from sequences of Landsat NDVI that may contain spatial and temporal gaps caused by cloud contamination during the growing season: spatially weighted growth curve (SWGC) estimation. We used the estimated growth curves to identify the timing of phenological events and derive phenological metrics to describe the timing and degree of crop growth stages occurring at each cell. We then combined the detected phenology with climate data to produce seasonal climate metrics that summarise water availability and growing degree days during different growth stages of the crop. We investigated the relationships between the phenological and seasonal climate metrics and yield, and found that in general, the remotely sensed phenological metrics tend to correlate more highly with annual yields than the seasonal climate metrics. While annual water availability (AWavail) had maximum correlation with yield across all years, its correlation with yield in any one year was lower and varied from year to year. In contrast, integrated NDVI over the growing season (iNDVI) had high correlation with yield across all years and annually, but the relationship between iNDVI and yield varied from year to year. We interpret this as suggesting that iNDVI has use for explaining both spatial and temporal variability and AWavail may be useful for explaining temporal yield variability or the interannual variability in the iNDVI-yield relationship. In general, the remotely sensed phenological metrics tend to correlate more highly with annual yields than the seasonal climate metrics. Across all years, SOST is negatively correlated with yield and EOST is positively correlated with yield, which reflects agronomic knowledge about the impact of early sowing and growing season length on yield [14][15][16][17][18]. However, this is not the case in all years, suggesting that other unmeasured factors affect these relationships.
We used the phenological metrics as predictors of yield in six statistical and machine learning models: multiple linear regression (MLR), linear mixed models (LMM), generalised additive models (GAM), random forests (RF), support vector regression using radial basis functions (SVR) and deep learning (DL). The nonlinear models (GAM, RF, SVR and DL) all had higher accuracy than MLR. The LMM included a random effect that estimated different iNDVI-yield relationships for each year. This resulted in the highest accuracy of all models using phenological metrics only (R 2 = 0.67, MAE = 0.25 t ha −1 , RMSE = 0.33 t ha −1 , NRMSE = 0.25). We then added the seasonal climate metrics to each of the models and quantified improvements in accuracy using the combined set of predictors. Use of seasonal climate metrics improved the accuracy for all models except LMM. By estimating different iNDVI-yield relationships for each year, the LMM using only phenological metrics appeared to capture much of the interannual information provided by the seasonal climate metrics to the other models. The nonlinear models all performed similarly. This contrasts with the range of accuracies reported across machine learning models by other studies that have compared machine learning models for yield estimation [24,31,32]. This may be because we have used a smaller study area than used in those national-scale studies, and the range of yields recorded for our study area is likely smaller than the range of yields across nations.
Two equally optimal models were identified: SVR and DL models (R 2 = 0.68, MAE = 0.25 t ha −1 , RMSE = 0.32 t ha −1 , NRMSE = 0.25) using combined phenological and seasonal climate metrics. They offer only marginally higher accuracy than the LMM using phenological metrics only. This means that long-term hindcasts of yield can be performed using only phenological metrics derived from Landsat NDVI. However, the nonlinear models that also use seasonal climate metrics have an additional advantage over the LMM. The LMM relies on having sufficient data from each year so that linear iNDVI-yield relationships can be estimated for each year. This means that the model cannot be used to estimate yield in years for which there is no training data. It cannot be used to extrapolate farther into the past, or to predict yield in future years without obtaining additional training data and re-fitting the model. Because they do not explicitly encode annual iNDVI-yield relationships but instead use the seasonal climate metrics to distinguish between different seasonal patterns, the nonlinear SVM and DL models can be used to make predictions for other years. This is advantageous for operational yield hindcasting and while we have focused on generating hindcasts in this study, it also has implications for yield forecasting. Yield forecasts are generally produced prior to the end of the growing season, when actual yield is unknown. Early forecasts, made before the crop is sown, can help farmers make decision about crop management, such as which crop and cultivar to plant, and how much fertiliser to apply at sowing. However, early forecasts cannot make use of in-season remotely sensed information and generally rely on the use of seasonal forecasts and crop models [72,73]. In contrast, remotely sensed phenology can contribute to forecasts made during the growing season that are useful for management decision about fertiliser, weed and disease management and also for industry-wide decision-making to support logistics, such as scheduling of grain transportation by road and rail, marketing and policy-making [10,[74][75][76]. Use of phenological metrics, climate data and machine learning has shown promise for forecasting wheat yield with 2-months lead time in Australia [31]. Our work supports this, and to enable the possibility of extending our work from considering hindcasts only to also considering forecasts, we recommend using climate metrics in nonlinear models, such SVM and DL found optimal by our study, rather than LMMs.
Of the two models identified as optimal, we prefer DL because it is considerably faster to implement. Validation of the DL model showed that it performed better in some years than others ( Figure 5). Lower R 2 was observed for low-yielding years, suggesting that the phenological and seasonal climate metrics may not provide sufficient information to capture the complex relationship between crop growth and yield in dry years. For some years, poorer accuracy may be due to fewer cloud-free Landsat images being available during the growing season ( Figure 2). For example, in 2003 there was only one cloud-free image between May and October. In 2005, 2009 and 2014 there were a few cloud-free images during the vegetative or early part of the growing season. In 2011 and 2012, there were a few cloud-free images available during the peak of season. Accuracy was highest in years with a reasonable number of cloud-free images regularly distributed throughout the growing season: 2008, 2013, 2015, 2017 and 2018. This suggests that yield could be mapped with higher accuracy if more cloud-free images were available. This could be achieved by combining data from multiple sensors [12,77,78], or by fusing Landsat data with MODIS data which has a more frequent revisit capacity [79][80][81].
We trained a DL model using all the available data and produced hindcast maps of yield at 30 m resolution for each year from 2003 to 2019. The maps showed spatial and temporal consistency with observed yield. However, the range of predicted yields in some years was lower than observed, indicating that the model tends to under-predict high yields and over-predict low yields. This is also demonstrated in plots of observed versus predicted yields for the test data. This is a common result of using statistical and machine learning models that fit observations in an average sense: higher errors are observed for extreme observations. Moreover, it is not uncommon for yield estimation [29,30]. There is no evidence of systematic error or bias evident in some other studies, such as under-prediction of high yields [32], or over-prediction of yields [24].
There is a degree of spatial smoothing in the yield hindcasts caused by the use of SWGC for growth curve estimation (see Evans and Shen [48] for discussion of the smoothing effect of SWGC). This affects the precision of the hindcasts, but not their value. A model has value if it helps the user make a better decision. Our goal of producing long-term hindcasts of yield at within-field resolution is to support precision agriculture decisions about how to vary crop management within fields. Currently, most decisions about in-field crop management are based on interpolated yield maps that have been smoothed [2], thereby reducing the yield range and the yield variability. Moreover, farmers may not have a sufficiently long record to quantify effects of seasonal variability and climate change. While our hindcasts may not be as spatially precise as yield monitor data, they do form a longer temporal record that can be used to understand impacts of seasonal climate conditions on yield.
Although we are primarily interested in estimating yield within fields, estimation of yield at field-scale is important to agricultural planning and policy-making. We therefore averaged estimated and predicted yields over each field-year combination so that field-average results could be considered. The accuracy statistics for field-average yield prediction using the deep learning model, calculated for the test data, was R 2 = 0.85, MAE = 0.14 t ha −1 and RMSE = 0.19 t ha −1 . This suggests an added utility of yield estimation using this approach beyond our original goal of supporting on-farm decision-making: accurate estimates of field-average yields can be produced over large areas to support off-farm decision-making.

Conclusions
To support decision-making about how to optimise crop management within fields, this study aimed to determine whether phenological and seasonal climate metrics obtained from SWGC estimation have utility for estimating wheat yield.
We investigated the relationships between phenological and seasonal climate metrics and yield and found that integrated NDVI over the growing season (iNDVI) could explain the spatial and temporal variability in yield, but the relationship between iNDVI and yield varied from year to year. Annual water availability had the highest correlation with yield across all years, suggesting its potential for explaining temporal variability in yield or for explaining the interannual variability in the iNDVI-yield relationship.
We assessed and compared six statistical and machine learning models for estimating wheat yield using two sets of predictors: phenological metrics only and combined phenological and seasonal climate metrics. Using only phenological metrics, accuracy was highest using a linear mixed model with a random effect that allowed the relationship between iNDVI and yield to vary by year (R 2 = 0.67, MAE = 0.25 t ha −1 , RMSE = 0.33 t ha −1 , NRMSE = 0.25). For all other models, accuracy was higher when seasonal climate metrics were also used as predictors. We identified two equally optimal models using the combined phenological and seasonal climate metrics: support vector regression and deep learning models (R 2 = 0.68, MAE = 0.25 t ha −1 , RMSE = 0.32 t ha −1 , RMSE = 0.25). While the linear mixed model using only phenological metrics performed similarly to the nonlinear models that also used seasonal climate metrics, the nonlinear models can be more easily generalised to estimate yield in years for which training data are unavailable. We selected the deep learning model as optimal because it is faster to implement than the support vector regression. We performed further validation of the deep learning model by comparing observed and predicted yields for each year and found that it performed better in higher-yielding years and in years with a reasonable number of cloud-free images regularly distributed throughout the growing season. We used the model to produce yield hindcasts for all years from 2003 to 2019.
We conclude that long-term hindcasts of wheat yield in fields, at 30 m spatial resolution, can be produced by using SWGC estimation to detect remotely sensed phenology from Landsat NDVI and creating phenological and seasonal climate metrics to use as predictors of yield in machine learning models. However, better accuracy could be obtained if more regular time-sequences of NDVI were available. This might be achieved by combining remotely sensed data from multiple sources or by fusing Landsat data with more frequent MODIS observations.