Developing and Testing Remote-Sensing Indices to Represent within-Field Variation of Wheat Yields: Assessment of the Variation Explained by Simple Models

: One important issue faced by wheat producers is temporal and spatial yield variation management at a within-field scale. Vegetation indices derived from remote-sensing platforms, such as Landsat, can provide vital information characterising this variability and allow crop yield indicators development to map productivity. However, the most appropriate vegetation index and crop growth stage for use in yield mapping is often unclear. This study considered vegetation indices and growth stages selection and built and tested models to predict within-field yield variation. We used 48 wheat yield monitor maps to build linear-mixed models for predicting yield that were tested using leave-one-field-out cross-validation. It was found that some of the simplest models were not improved upon (by more complex models) for the prediction of the spatial pattern of the high and low yielding areas (the within-field yield ranking). In addition, predictions of longer-term average yields were generally more accurate than predictions of yield for single years. Therefore, the predictions over multiple years are valuable for revealing consistent spatial patterns in yield. The results demonstrate the potential and limitations of tools based on remote-sensing data that might provide growers with better knowledge of within-field variation to make more informed management decisions.

within-field variation on a single year, it is also important to identify areas with consistently low yields year after year.
Analysing the spatial variation of crop yields (and its drivers) over multiple seasons requires multi-year crop growth information. Combine-mounted yield monitors can provide this information, but long-term records for fields are not common. One alternative to providing the information is remote-sensing imagery from earth-observing satellites. The Landsat series of satellites is of particular use for this because of its long history, that started in 1972. The long history has proven Landsat's excellence in providing a window into the past, useful for any agricultural monitoring and modelling [4]. Specifically in this study, together with the spatial resolution of 30-m pixels, this long history makes the Landsat dataset valuable for identifying stable long-term patterns of spatial variation within large broadacre cropping fields.
Information derived from satellite imagery can help develop crop yield indicators to map field-specific productivity [1]. These yield indicators often directly capture crop growing conditions, such as biomass, using extracted vegetation indices. According to previous studies, the most widely used vegetation index that correlates with yield is the Normalised Difference Vegetation Index (NDVI) [5]. However, over time, other vegetation indices have been developed that provide alternative options to NDVI depending on the user's purpose and considerations. For instance, Enhanced Vegetation Index (EVI) was developed to be more sensitive in areas of high biomass [6]. Determining a suitable vegetation index is a crucial step in any study seeking to develop yield indices. In some cases, a combination of more than one vegetation index might improve analysis results [7]. Furthermore, the crop growth stage at which imagery is selected should be considered before extracting vegetation indices to explore the crop yield predictive ability within a growing season [8]. The selection of an appropriate growth state is essential since it relates to how dense the biomass is that hypothetically reflects the wheat yield.
Improved prediction of actual yield does not necessarily correspond to improved prediction of the within-field spatial pattern of yield. Therefore, a particular focus of the current study is on whether areas of a field predicted as high/low yielding correspond to areas of high/low actual yields. If this spatial pattern can be assessed well for individual seasons, then, in the future, predictions might be used to identify, for instance, areas of consistent poor growth over multiple growing seasons, which might relate to soil constraints. In this regard, this study looks not only at how well the spatial variation of yield for individual growing seasons is assessed, but also at how well the spatial variation of average yield over multiple growing seasons is assessed.

Which vegetation indices perform best? Can combinations of vegetation indices
representing information about biomass and chlorophyll provide improvements compared to a single index? 2. What growth stage, automatically detected based on analysis of remote-sensing time series, is best? Can data from multiple growth stages help predictions? 3. Can the raw Landsat bands add further information to improve predictions compared to those based on the vegetation indices alone, or do pre-formulated vegetation indices contain most of the valuable information?
These questions are addressed while considering a variety of metrics to assess different aspects of the predictions of within-field yield variation: 1. How well do predictions represent the within-field variation of actual yields and of yield ranking?
2. How well do predictions represent within-field yield variation over multiple years as compared with for a single year?
Findings from this study will be important for further work looking at the drivers of this variation and for underpinning software (e.g., ConstraintID; https://constraintid.net.au/) that can overlay remote-sensing data with soil data, with the aim of diagnosing reasons for consistent spatial variation of crop yields within fields.

Study Area
The study area is in Australia's northern grain-growing region (as defined by the Grains Research and Development Corporation of Australia, GRDC), encompassing Queensland and New South Wales (Figure 1). This work focuses on the winter wheat crop, commonly planted from April to July and harvested from September into December.

Crop Yield Monitored Data
This study used 48 wheat yield maps collected from 23 fields between 2001 to 2016. The original yield monitor data were cleaned and block kriged to 30-m grids in previous work [12]. A subset of those data used, as some of them were excluded because (i) there appeared to be differential within-field management (since our ultimate interest with undertaking the current study is the within-field variation of soil constraints, which would be confounded by the presence of within-field variation in management); or (ii) there was a substantial portion of missing values. This pre-processing resulted in the 48 yield maps summarised in Table 1.  [23], and cloud and cloud-shadow were masked from images using the Fmask algorithm [24]. The data all sit on grids with 30-m spacings, and here the data from six bands of each satellite were used, representing the blue, green, red, near-infrared, and two shortwave infrared portions of the electromagnetic spectrum. Only images with at least 75% coverage of each yield map's pixels were included; images with cloud coverage of more than 25% of a yield map were excluded. Incomplete images occurred because of either (i) partial cloud coverage or (ii) the 'SLC (Scan Line Corrector)-off' issue with Landsat-7. Where images were incomplete but had ≥75% coverage, gaps were filled by regression kriging as follows. First, the image (from the same season as the incomplete image) with the highest correlation with the incomplete image was selected as the covariate (provided it covered the missing pixels). Next, a linear model was fitted to predict the missing pixels before the residuals from this linear model were kriged and added to the linear function to give the fill values.

Selection of Vegetation Indices
Eight vegetation indices were tested for studying within-field yield variation. A study by [7] classified the indices into two groups, including indices that provide a representation of canopy structure (VISTR) and indices that are good for describing crop photosynthetic activities (VICHL). The indices are shown in Table 2.

Chlorophyll-related indices (VICHL)
Chlorophyll Vegetation Index (CVI) × The VISTR indices, NDVI, EVI, EVI2, RVI, VARIgreen, and DVI, represent the canopy structure and are positively correlated with leaf area index (LAI) and biomass. They are formulated based on the captured light within the visible and near-infrared wavelengths, primarily through positive relationships with vegetation reflective bands, such as NIR and green reflectance, and negative relationships with vegetation absorption bands, such as red and blue reflectance. Based on a literature review, the chosen indices have shown good yield prediction in previous crop analyses. The NDVI is a benchmark because it is the most widely used and known index for vegetation fluctuations. The EVI and EVI2 indices were considered because they are sensitive to very dense biomass, able to reduce the disturbance effect from the canopy background and the atmosphere, reduce aerosol perturbation of leaf chlorophyll, and are less prone to saturation [25]. The RVI is a good predictor of leaf water content and is sensitive to green vegetation, so it is well-correlated with the LAI and leaf biomass [26]. The VARIgreen index is less sensitive to atmospheric effects, is suitable for wheat classification, and has an excellent correlation with biomass variation over the entire wheat growing period, so it will not saturate at high leaf biomass [27], [28]. Finally, DVI is sensitive to soil background change [29]. The VICHL indices, CVI and, GCVI, indicate nitrogen supply and are a good predictor for crop yield [7]. They are formulated mostly based on two vegetation reflective bands, the NIR, and green bands, which are primarily reflected by healthy vegetation. Both CVI and GCVI have a high correlation with crop chlorophyll content [30]. On the other hand, GCVI is not saturated at high leaf biomass and has a good relationship with the LAI of cereal crops.

Imagery Selection
This work examined the performance of vegetation indices at different stages, where remotely-sensed imagery would have the most potential to hold information indicative of final crop yields. The duration of these stages was fixed at 25 days to allow data capture from up to three separate Landsat images (potentially data for the same field every 8 days, when two of the Landsat satellites are operational). Each interval was defined relative to each vegetation index's peak values in a time series of field medians; broadly speaking, these peaks correspond to the greatest biomass period. Therefore, the stages were defined as pre-peak, peak, and post-peak stages. For instance, the peak NDVI stage was defined as the 25-day period centred on the date of the peak field-median NDVI; the pre-peak and post-peak NDVI periods were defined as the preceding and following 25-day periods, respectively. Note that the data representing any stage could be a composite of up to three images from the 25-day period.

Set of Covariates
We created models (see Section 2.3 for modelling details) based on various combinations of vegetation indices and stages. We started with a single index from a single-stage and compared different ways of using information from multiple indices or multiple stages. The combinations investigated in this analysis were:


A single index, one stage (e.g., NDVI-peak stage);  Two indices from different VI groups (see Table 2), both from the same stage (e.g., a combination of NDVI and CVI in peak stage);  A single index, all stages (e.g., NDVI from the pre-peak, peak, and post-peak stages);  Combinations selected via a stepwise analysis, see Section 2.3.2, allowing any combinations of vegetation index and stage or raw Landsat bands and stage.

The Linear Mixed-Effect Model
The linear model provides a simple general form for relationships between covariates and yield. Typically, this linear model's residuals (the data minus the linear function's predicted values) are assumed to be independent. However, in the context of the yield maps used as training data in this work, this assumption is not appropriate since the errors at locations within any particular yield map will be more closely related than the errors from locations in distinct yield maps. The linear-mixed model framework [31] is designed to represent such data; here, the linear function is referred to as the fixed effects, while the residuals, assumed to be normally distributed with some correlation structure, are referred to as the random-effects. Covariate combinations and yield monitored data were paired at each pixel location. A random-effects component (random intercept and slopes) was included for each specific yield map. Models were fitted using the lme4 [32] package for R 4.0.2 [33].

Stepwise Analysis
As noted previously, in addition to manually selected covariates, a stepwise analysis was applied to select combinations of up to six covariates to predict yield. This work used a well-known model-selection criterion, namely the Akaike Information Criterion (AIC) [34], to determine the covariate to be added to the linear-mixed model at each step of the algorithm. This resulted in six models (from the six steps) of increasing complexity. The lower the AIC, the better the fitted model. However, model assessment via crossvalidation (see Section 2.3.3) was still applied to each of the six models to provide a better indication of predictive performance.

Cross-Validation
Predictions from the linear-mixed model approach were assessed using a form of cross-validation; this was done for all sets of manually selected covariates (see Section 2.2.5) and for the sets of covariates from each step of the stepwise analysis (Section 2.3.2). Since this work's main aim was to investigate approaches to predict and map within-field yield variation based on remote-sensing data, a leave-one-yield-map-out cross-validation [8,12] method was adopted (which this work refers to from here on as simply crossvalidation). In this approach, each of the 48 yield maps, in turn, was withheld from the training dataset and a linear-mixed model fitted based on the remaining training data. The concordance correlation coefficient [35] was used to assess the relationship between observed and predicted values of the withheld data. The CCC can range from −1 to 1, where the closer the value to 1, the closer the agreement between observed and predicted values, and values close to 0 indicate a lack of association between observed and predicted values. The CCC was first calculated based on the agreement between observed and predicted yields, CCCY. Since a primary objective in this work was to determine how well the predictions represent the spatial patterns of within-field variation of yield, the CCC was also calculated based on the agreement between ranked predictions and ranked withheld data (i.e., the n withheld data, representing all the pixels within any given yield map, were ranked from 1 to n, as were the predictions of these data), referred to here as CCCRank. The CCCRank measures the agreement between pixel rankings based on observed and predicted yields, and as such provides a way to assess how well the high predicted yields corresponded with high actual yields, and the low predicted yields corresponded with low actual yields; in this work, it is this measure that this work focus on most. Each of these measures was calculated for each withheld yield map in turn, and the spread of the 48 CCC values was summarised.
In addition to assessing predictions for any given year, an approach to investigate predictions of longer-term average yields was also applied. Our dataset included 15 fields with more than one year of yield data (Table 1). For a given field with yield maps from T years, a total of P pixels were identified that were present in all T yield maps. The withheld data and predictions for this field were labelled as and (for in 1, … , and in 1, … , ), respectively. Then the predictions (for each of the P pixels) of the T-year average yield were calculated as the average of over the T years, and compared with the average of over the T years (again using CCCRank). For simplicity, this validation measure was applied only in the final comparison of the validation study.

Single-Index Models
This work applied cross-validation for models based on a single vegetation index in each stage (pre-peak, peak, and post-peak). In most cases, using vegetation index data from the peak stage gave the highest median CCCRank (Figure 2). From the peak stage, NDVI, EVI, EVI2, RVI (all from the VISTR group of indices), and GCVI (from the VICHL group) gave the highest median CCCRank values (0.59-0.63), all giving reasonably similar results. Table 3a shows another performance metric (CCCY) for the models with the five largest CCCRank values. The values of the median CCCY were all less than 0.2, indicating that predictions of actual yield were not very reliable.

Multi-Index Models
Multi-index models were assessed for combinations of indices from the two vegetation index groups; six VISTR and two VICHL indices (Figure 3; see Appendix A and B for the full results). Similarly, to the single-index analysis, all VISTR-VICHL combinations showed the best performance in the peak stage. In this stage, there were generally only small differences between models based on the different indices. Furthermore, the results did not show any notable improvements over those based on a single vegetation index, with the best median CCCRank value being 0.63. This was also the case in terms of CCCY (Table 3b), which showed similar ranges of values to those from the single-index analysis.

Multi-Stages Model
In the multi-stage analysis, this work used data from a single vegetation index from all three stages (pre-peak, peak, and post-peak) in the same model. Similarly, in the singlestage analysis, the multi-stage analysis also showed only relatively small differences in CCCRank between indices (Figure 4; see Appendix C for full results). Most of the indices showed a slightly better agreement in the peak stage alone, rather than in the multi-stage analysis (NDVI, EVI, EVI2, VARIgreen, and DVI). This was also the case in terms of actual yield predictions (assessed by the median CCCY; Table 3c).

Stepwise Analysis.
Aside from the analyses that used manually selected covariates (a single-index, multiple indices, and multiple stages), this work also applied a stepwise method to select models with up to six covariates. The covariates involved were selected based on the lowest AIC among all the models and subsequently assessed using cross-validation. The first covariate selected was GCVI from the post-peak stage (Table 4), which gave a median CCCRank of 0.56 (Table 3d and Figure 5). The variables selected in the second and third steps were from the pre-peak and peak stages, which improved the median CCCRank value to over 0.6. At the fifth and sixth steps, data from the SWIR1 band were selected, first from the post-peak stage, and next from the peak stage. Cross-validation results from the models selected in steps two, three, four, and five were not significantly different ( Figure  5). Based on the cross-validation results, the model from step two, which involved the combination of GCVI post-peak and EVI pre-peak, gave the largest median CCCRank (0.63). This was still no better than the best model from the single-stage analysis (RVI Peak), indicating no better representation of the spatial pattern of yield. In terms of actual yield predictions, results did improve slightly but the best median CCCY values were only around 0.2 (Table 3d). Results showed the best models when considering both metrics were those from Steps 2-4 of the stepwise algorithm, which showed CCC values close or similar to those of the best model in terms of CCCRank (RVI Peak, GCVI post-peak, and EVI pre-peak) and in terms of CCCY (NDVI peak). Table 4. AIC summary for models in stepwise analysis. Covariate sets are referred to as S1-S6.

Comparing Multi-Year Average Prediction and Single Year Prediction
A performance measure was calculated for fields with more than one year of yield data. For this analysis, this work focused on just the model of NDVI peak because it was considered as a well-known vegetation index and had quite a high median CCCRank compared to other models. As shown in the three years of data, similar spatial patterns of yield were observed (and predicted) for all three years. In this field, there were 825 pixels in all three years, and only these common pixels are analysed here and shown in Figure  6. These median values (0.47 for CCCY and 0.75 for CCCRank) indicate how well yields for individual years were predicted in this field. An indication of how well longer-term average yields were predicted is provided by directly comparing the three-year-average maps (those on the final row of Figure 6); the CCCY and CCCRank for the three-year-average maps were 0.52 and 0.83, respectively.
Results in Figure 6 illustrate the process of multi-year validation for one field; Table  5 summarises the results from similar analyses of all 15 fields where multi-year data were available. The table shows that all the final mean and median CCC values for both metrics were larger in the multi-year analysis than in the single-year analysis. In particular, the CCCRank from the multi-year analysis was on average 0.12higher than that from the singleyear analysis (p < 0.05, from a paired t-test). Moreover, over the 15 fields analysed for Table  5, the CCCRank for individual fields ranged from 0.11 (Field 3, Year 2008) to 0.89 (Field 9, Year 2002), whereas the multi-year CCCRank values ranged from 0.42 for Field 3 to 0.86 for Field 13. Thus, the multi-year results suggest that using multi-year predictions reduces the risk of very poor predictions.

Selection of Vegetation Indices and Stages for Simple Yield-Prediction Models
Several of the vegetation indices tested performed similarly in terms of the median CCCRank. Figure 2 depicts this similarity for the best five CCCRank within all stages (NDVI, EVI, EVI2, RVI, and GCVI). All five indices ranged from 0.51-0.55 during the pre-peak stage, 0.59-0.63 during the peak stage, and 0.54 to 0.56 during the post-peak stage. These top five models came from both groups of indices, where NDVI, EVI, EVI2, and RVI are categorised as canopy structural-related indices (VISTR), and GCVI is a chlorophyll-related index (VICHL). Thus, this study did not find any evidence to suggest that there is an advantage to using a canopy structural-related index (VISTR) over a chlorophyll-related index (VICHL) or vice versa. A previous study from [7] also reported similar findings, where the VISTR and VICHL gave a similar performance for predicting actual wheat yields.
Two possible ways of incorporating extra information (compared with a singleindex, single-stage model) were investigated: using data from two different vegetation indices (one canopy structural-related index and one chlorophyll-related index, both from the same stage; multi-index models) and using data from all stages (for the same vegetation index; multi-stage models). Neither of these approaches resulted in large improvements in predictions (a best value of CCCRank of 0.63 from the multi-index models and of 0.64 for the multi-stage models) compared with the single-index single-stage models. A similar result was also reported by [7], who also found only a small increase in predictive power when assessing a combination of structural and chlorophyll-related indices [7] assumed that this similarity occurred because structural and chlorophyllrelated indices respond to different aspects of the crop (morphological and physiological). In our study, one possible reason for the lack of improvement might be the overlap in information (correlation) between the vegetation index data from the same index (different stages) or from the same stage (different indices). However, results from a stepwise analysis-where data from different vegetation indices, different stages, and the raw Landsat bands could be added to the model-also gave a best value of the median CCCRank of 0.63. In this case, the information added to the model at each step would not be so correlated with information already in the model. Therefore, results here (based on this dataset of yield monitor data and the pool of simple linear models of vegetation indices tested) suggest linear functions of a single vegetation index (any of NDVI, EVI, EVI2, RVI, and GCVI) from around the time of its season peak would give reasonably accurate representations of the spatial patterns of within-field yield variation (a median CCCRank of around 0.6).

Another Metric Assesing Properties of Yield Predictions
In terms of actual yield predictions, the best value of the median CCCY was only 0.2. The validation statistics in terms of this metric were smaller than for yield rankings because it is more challenging to predict actual yields and their within-field variation (based on just remote-sensing data) than to predict the within-field yield rankings. One possible reason for the very modest results in terms of CCCY (in comparison with other studies, e.g., [7]) is the cross-validation approach used. This form of cross-validation would provide a sterner and more relevant prediction test than internal metrics (e.g., R 2 ) or cross-validation with random splitting. Other studies also recommended the use of this cross-validation strategy since it provides a realistic and accurate prediction [3,4]. In any case, the results indicated that the predictions of actual yield for any single year, based on the models involved in this work, should be used cautiously.

Stepwise Results Revealed Some Simple Models
Initially, this work hypothesised that the best single-covariate model would be one consisting of a vegetation index coming from the peak stage. In terms of the median CCCRank from cross-validation (Figure 2), our results backed up this hypothesis. However, the first variable to be added in the stepwise analysis was GCVI from the post-peak stage, which gave the smallest AIC of all the single-covariate models. Although selected based on the AIC, this model did not perform so well in terms of the median CCCRank from crossvalidation; this highlights the importance of validation methods and metrics tailored to assess the properties that are considered most important for a particular application, which in this case was defined as an assessment of spatial patterns of within-field yield variation.
There were two interesting observations that can be made from this stepwise analysis: (i) the first three covariates to be added were all different vegetation indices from different stages and (ii) the peak stage information was not included until step three. The first three indices were GCVI post-peak, EVI pre-peak, and EVI2 peak. A possible reason is that there might be overlapping information in different vegetation indices from the same stage or in the same vegetation index from different stages. Therefore, the most useful models include different indices and different stages. Furthermore, in steps 5 and 6, data from a raw band (SWIR1) were added to the model. The selection of SWIR1 was possibly because, in the previous steps, all the vegetation indices added to the model were formulated from the visible and NIR bands. Therefore, the SWIR1 band might provide additional information, such as moisture information [36], that was not included in the previous steps.
In terms of CCCRank, there was no notable improvement from the stepwise models (Table 3d) compared with the single-index models (Table 3a). However, in terms of CCCY, the cross-validation results suggested that the models from steps 2-4 gave the best allaround performance. The CCCY values were around 0.2, which were slightly higher than other model performances. Therefore, if one is willing to use a model of more than a single index, then one of these models might be preferable.

Multi-Year Analyses and Implications
There are many possible reasons for poor prediction performance. Notably, for a particular year, there could be issues with the yield monitor data or with the failure of the remote-sensing data to capture the information most relevant to yield variation (because of the timing of imagery or an index that is not sensitive within a particular field and date). These issues can lead to predictions that show a good correlation with the data for a particular year, but a poor correlation for the same field in other years. Therefore, in this work, a long-term-average analysis was also applied for some fields that presented more than one year of yield data (fields with two or three years in our dataset). The multi-year yield analysis (applied with the NDVI peak single-covariate model) showed that the mean and median of both metrics were better for the multi-year than for the single-year results. These results suggested that a single year's validation might provide a conservative assessment of long-term-average yield prediction performance. In terms of the mean over all the multi-year fields, the improvement from the multi-year results was 0.05 for CCCY and 0.12 for CCCRank.
Based on these multi-year improvements, it might be reasonable to expect predictions of averages over more years, such as five or ten years (our dataset had fields with at most three years of data), to give larger improvements. Besides, the results also give confidence that predictions from remote-sensing vegetation indices over multiple years are a valuable tool for precision agriculture to reveal consistent spatial patterns in yield. Previous studies [2,37] have used consistent yield patterns to target soil sampling to identify the reason behind the low yielding areas; given that persistently low yield can indicate the presence of a soil constraint. Therefore, the methodology identified from this work could be used to develop tools to provide growers with important information about the potential presence and location of soil constraints on a paddock scale. In further work, the current analysis will be built on by using predictions based on models in this paper together with data on other factors that potentially drive the spatial variation on crop yield (e.g., topography, climate, soil).

Conclusions
This work concluded that there were only marginal differences in the performance of the different vegetation indices tested. The well-known indices, such as NDVI, RVI, EVI, and EVI2, mostly showed good predictions of the spatial pattern of yield, but only modest performance in terms of predictions of actual yield (when assessed via withinfield metrics with leave-one-yield-map-out cross-validation). In terms of vegetation index group, indices from both the structural-related group (VISTR) and the chlorophyll-related group (VICHL) gave a similar performance in the single-index, single-stage analysis. In terms of stages, data from the peak stage gave the best performance, and combining data from the same vegetation index but from multiple stages did not improve predictions. Results from a stepwise analysis revealed some simple combinations of different vegetation indices from different stages could have better predictions in terms of actual yield predictions but did not improve over the single-index models in terms of yield ranking predictions alone. Moreover, the results showed that longer-term average yield predictions were generally more accurate than those of yield for single years, and also reduced the risk of having predictions with poor performance.