Winter Wheat Yield Assessment from Landsat 8 and Sentinel-2 Data: Incorporating Surface Reflectance, Through Phenological Fitting, into Regression Yield Models

A combination of Landsat 8 and Sentinel-2 offers a high frequency of observations (3–5 days) at moderate spatial resolution (10–30 m), which is essential for crop yield studies. Existing methods traditionally apply vegetation indices (VIs) that incorporate surface reflectances (SRs) in two or more spectral bands into a single variable, and rarely address the incorporation of SRs into empirical regression models of crop yield. In this work, we address these issues by normalizing satellite data (both VIs and SRs) derived from NASA’s Harmonized Landsat Sentinel-2 (HLS) product, through a phenological fitting. We apply a quadratic function to fit VIs or SRs against accumulated growing degree days (AGDDs), which affects the rate of crop development. The derived phenological metrics for VIs and SRs, namely peak, area under curve (AUC), and fitting coefficients from a quadratic function, were used to build empirical regression winter wheat models at a regional scale in Ukraine for three years, 2016–2018. The best results were achieved for the model with near infrared (NIR) and red spectral bands and derived AUC, constant, linear, and quadratic coefficients of the quadratic model. The best model yielded a root mean square error (RMSE) of 0.201 t/ha (5.4%) and coefficient of determination R2 = 0.73 on cross-validation.


Introduction
The combination of data acquired by Landsat 8 and Sentinel-2 remote sensing satellites can provide a high temporal resolution (3-5 days) [1], which is critical for various applications requiring dense satellite data time series. Previously, such (or better) high temporal resolution was available mainly for Earth observation sensors, which acquire daily data over Earth's surface, but at a coarser spatial resolution (>250 m) [2,3]. The latter, for example, includes space-borne remote sensing sensors, such as Moderate Resolution Imaging Spectroradiometer (MODIS), Visible Infrared Imaging Radiometer Suite (VIIRS), Advanced Very High Resolution Radiometer (AVHRR), and SPOT-VEGETATION. Taking into account an increased frequency of observations at moderate spatial resolution (<30 m), the assumption Remote Sens. 2019, 11, 1768 3 of 19 existing remote sensing-based studies for crop yields are focused on using vegetation indices and do not exploit spectral information.
Though a combination of Landsat 8 and Sentinel-2 offers a high frequency of observations, discrepancies in available cloud-free data, however, will still exist. With more and more applications transitioning to a higher spatial resolution, it is important to explore and analyze the effect of those discrepancies on the quality of the resulting products. In the case of crop yield assessment, one of the important questions is as follows: How do discrepancies in the dates of satellite observations influence the performance of empirical crop yield models at the regional scale? The present paper aims to address this question by documenting the results of a study building empirical models for winter wheat yield assessment with Landsat 8 and Sentinel-2 data in Ukraine. Not performing satellite data smoothing may lead to poorer performance of the empirical crop yield model, which would be attributed not to the lack of correlation with satellite-derived variables but rather to observation irregularities. We will not limit satellite data fitting to vegetation indices only, and will also explore surface reflectance values in multiple spectral bands for the assessment of winter wheat yields. Finally, we will evaluate the use of multi-source data, namely Landsat 8 and Sentinel-2, for crop yield assessment. In particular, we assess quantitively how the performance of an empirical model of crop yield depends on the combined use of Landsat 8 and Sentinel-2 data, compared to a single-sensor usage.

Study Area and Reference Data
The study was performed for Kirohohradska oblast in Ukraine for 2016 to 2018. An "oblast" is a high-level administrative division of the country, and each oblast is further divided into districts. There are 24 oblasts in Ukraine and Autonomous Republic of Crimea. Kirovhradska oblast is located in the central part of Ukraine and consists of 21 districts with a geographical area ranging from 65 to 165 thousand ha and cropland area ranging from 27 to 112 thousand ha ( Figure 1). all existing remote sensing-based studies for crop yields are focused on using vegetation indices and do not exploit spectral information. Though a combination of Landsat 8 and Sentinel-2 offers a high frequency of observations, discrepancies in available cloud-free data, however, will still exist. With more and more applications transitioning to a higher spatial resolution, it is important to explore and analyze the effect of those discrepancies on the quality of the resulting products. In the case of crop yield assessment, one of the important questions is as follows: How do discrepancies in the dates of satellite observations influence the performance of empirical crop yield models at the regional scale? The present paper aims to address this question by documenting the results of a study building empirical models for winter wheat yield assessment with Landsat 8 and Sentinel-2 data in Ukraine. Not performing satellite data smoothing may lead to poorer performance of the empirical crop yield model, which would be attributed not to the lack of correlation with satellite-derived variables but rather to observation irregularities. We will not limit satellite data fitting to vegetation indices only, and will also explore surface reflectance values in multiple spectral bands for the assessment of winter wheat yields. Finally, we will evaluate the use of multi-source data, namely Landsat 8 and Sentinel-2, for crop yield assessment. In particular, we assess quantitively how the performance of an empirical model of crop yield depends on the combined use of Landsat 8 and Sentinel-2 data, compared to a single-sensor usage.

Study Area and Reference Data
The study was performed for Kirohohradska oblast in Ukraine for 2016 to 2018. An "oblast" is a high-level administrative division of the country, and each oblast is further divided into districts. There are 24 oblasts in Ukraine and Autonomous Republic of Crimea. Kirovhradska oblast is located in the central part of Ukraine and consists of 21 districts with a geographical area ranging from 65 to 165 thousand ha and cropland area ranging from 27 to 112 thousand ha ( Figure 1). The reason for selecting this region is that it is a top 10 wheat producer in Ukraine and because of the availability of reference crop yield and harvested area data at the district scale for 2016 to 2018. Winter wheat is one of the major crops in Kirovhradska oblast, accounting for 20% of the production of all crops in the region. Winter wheat is mainly rain-fed in the region and usually planted in September to October. After dormancy during the winter, it re-starts its growth in early spring, reaching maturity by the end of June. The harvest of winter crops is typically undertaken in July. The reason for selecting this region is that it is a top 10 wheat producer in Ukraine and because of the availability of reference crop yield and harvested area data at the district scale for 2016 to 2018. Winter wheat is one of the major crops in Kirovhradska oblast, accounting for 20% of the production of all crops in the region. Winter wheat is mainly rain-fed in the region and usually planted in September to October. After dormancy during the winter, it re-starts its growth in early spring, reaching maturity by the end of June. The harvest of winter crops is typically undertaken in July.
Reference data on the crop yield and harvested area at the district level were collected from the Department of Agro-Industry Development of Kirovohrad State Administration (http://apk.kr-admin.gov. ua). The data were made available online as the harvest progressed and were based on farm surveys of all large agricultural enterprises (that account for more than 90% of all winter crops produced in the region) and samples of household farms, the same way that official statistics are collected [33]. The final estimates for winter crop yields and harvested areas were available at the end of November and were used as a reference in this study. The reference data went through a quality-control procedure, which resulted in two districts in 2017 and five districts in 2018, being excluded from the analysis, because the reported production was >+/−10% than the area multiped by yields. The average winter wheat yield among the districts was 4.0 ± 0.3 t/ha in 2016, 3.5 ± 0.4 t/ha in 2017, and 3.6 ± 0.4 t/ha in 2018.

Landsat-8/OLI and Sentinel-2A /MSI Datasets
Remote sensing images acquired by the Operational Land Imager (OLI) instrument aboard the Landsat-8 satellite and by the Multi-Spectral Instrument (MSI) aboard the Sentinel-2A/B satellites were used in the study. Landsat 8/OLI captures images of the Earth's surface in nine spectral bands at a 30 m spatial resolution (15 m for panchromatic band) [34], while Sentinel-2A/B/MSI captures images of the Earth's surface in 13 spectral bands at a 10, 20, and 60 m spatial resolution [35]. We used NASA's Harmonized Landsat Sentinel-2 (HLS) product [36], which provides a Level-2 Nadir BRDF (Bidirectional Reflectance Distribution Function)-Adjusted surface Reflectance (NBAR) at a 30 m spatial resolution. HLS combines, in a single data set, observations of the land surface from the Landsat 8's OLI [37] and Sentinel-2's MSI [38]. MSI data are also spectrally adjusted to OLI spectral bands, so to generate a single harmonized dataset. We used HLS version 1.4. A detailed description of HLS product generation is described in [37].
The HLS product is organized using the Sentinel-2 110 km × 110 km tiles. The following tiles cover the study region ( HLS data were used to extract surface reflectances (SRs) and to calculate corresponding vegetation indices (VIs). We used spectral bands common to Landsat 8 and Sentinel-2A/B, namely: Green (wavelength~0.560 µm), red (~0.660 µm), near infrared (NIR) (~0.865 µm, we used band 8A from Sentinel-2), and two short-wave infrared (SWIR) (~1.5 and~2.2 µm). This is an advantage of using analysis ready data (ARD), such as HLS, since data have been harmonized and as if they are acquired by a single sensor. We used a quality assessment (QA) layer provided in HLS to disregard unreliable pixels identified either as cloud, adjacent to cloud, shadow, snow/ice, or water.

Meteorological Data
We used air temperature derived from NASA's Modern-Era Retrospective analysis for Research and Applications (MERRA2) reanalysis data set [42] to compute the growing degree days (GDDs) for winter wheat. Santamaria-Artigas et al. (2019) [43] showed that MERRA2 is one of the best data sources for computing GDD for winter wheat in terms of error (<2K compared to in situ measurements), spatial resolution, and low delay in data availability. The data are provided on a regular grid that has 576 points in the longitudinal direction and 361 points in the latitudinal direction, corresponding to a resolution of 0.625 • × 0.5 • . We used the time-averaged two-dimensional data collection (M2SDNXSLV) to extract daily averaged two-meter air temperature (T2MMEAN). The data sets were extracted from the netCDF format, transformed to the geo-referenced raster, subset for study areas, and linearly interpolated to the 30-m spatial resolution.

Meteorological Data
We used air temperature derived from NASA's Modern-Era Retrospective analysis for Research and Applications (MERRA2) reanalysis data set [42] to compute the growing degree days (GDDs) for winter wheat. Santamaria-Artigas et al. (2019) [43] showed that MERRA2 is one of the best data sources for computing GDD for winter wheat in terms of error (<2K compared to in situ measurements), spatial resolution, and low delay in data availability. The data are provided on a regular grid that has 576 points in the longitudinal direction and 361 points in the latitudinal direction, corresponding to a resolution of 0.625° × 0.5°. We used the time-averaged two-dimensional data collection (M2SDNXSLV) to extract daily averaged two-meter air temperature (T2MMEAN). The data sets were extracted from the netCDF format, transformed to the geo-referenced raster, subset for study areas, and linearly interpolated to the 30-m spatial resolution.  The crop-specific masks were generated on a yearly basis, so the signal for building empirical yield models was extracted for winter crops only. Satellite data, namely SRs and VIs, were first smoothed before incorporating them into the crop yield model. This was performed through phenological fitting against meteorological data, namely accumulated GDD (AGDD), since temperature is a primary factor affecting the rate of crop development [44]. The following subsections describe in detail the methodologies used.

Winter Crop Type Mapping
In order to provide in-season winter crop maps (2016-2018), we implemented an automatic approach, which was initially developed for the coarse spatial resolution MODIS sensor [45] and further extended and prototyped for Landsat 8/OLI and Sentinel-2/MSI at a 30 m spatial resolution [22]. The approach uses a phenological metric during the green-up stage of winter crop development, namely the maximum NDVI, which is used as a main feature to discriminate winter crops from spring and summer crops. A cropland mask [46] is an essential data input for the method, to constrain the analysis to cultivated areas only, and to reduce potential confusion with natural vegetation and The crop-specific masks were generated on a yearly basis, so the signal for building empirical yield models was extracted for winter crops only. Satellite data, namely SRs and VIs, were first smoothed before incorporating them into the crop yield model. This was performed through phenological fitting against meteorological data, namely accumulated GDD (AGDD), since temperature is a primary factor affecting the rate of crop development [44]. The following subsections describe in detail the methodologies used.

Winter Crop Type Mapping
In order to provide in-season winter crop maps (2016-2018), we implemented an automatic approach, which was initially developed for the coarse spatial resolution MODIS sensor [45] and further extended and prototyped for Landsat 8/OLI and Sentinel-2/MSI at a 30 m spatial resolution [22]. The approach uses a phenological metric during the green-up stage of winter crop development, namely the maximum NDVI, which is used as a main feature to discriminate winter crops from spring and summer crops. A cropland mask [46] is an essential data input for the method, to constrain the analysis to cultivated areas only, and to reduce potential confusion with natural vegetation and forested areas. To perform discrimination in an automatic way, a Gaussian mixture model (GMM) [47] is applied with the number of GMM components set to 2. Setting this number is based on the assumption that there will be two modes in the distribution of maximum NDVI (phenological metric), which would correspond to winter and non-winter crops, respectively. This method was run in an automatic unsupervised way during the vegetation season for 2016 to 2018 without the need to provide ground truth. However, this method requires ancillary data inputs and a priori knowledge about the crop calendar. We refer readers to [22,45] for details.

Winter Wheat Yield Assessment
Crop yield assessment was done through the development of an empirical regression model between satellite-derived features (regressors) and reference crop yields at the district scale ( Figure 1). Satellite-derived features were averaged over the crop mask (Section 3.2) for each district and correlated with the available reference winter wheat yields (Section 2.1). Before averaging satellite-based features over pixels identified as winter crops, one has to ensure that satellite data are phenologically normalized, so averaging is performed for the same stages of crop growth development. Though a combination of Landsat 8 and Sentinel-2 offers a high frequency of observations, discrepancies in available cloud-free data, however, will still exist ( Figure 3).
We used a quadratic model between satellite data (VI or SR) and AGDD [26,48,49]: where y sat is the VI (NDVI, EVI2, or DVI) or SR (green, red, NIR, SWIR1, or SWIR2); and a 0 , a 1 , and a 2 are the constant, linear, and quadratic coefficients, respectively. GDD was defined as the average daily maximum (T max ) and minimum temperatures (T min ) minus a base temperature (T base ): where if [(T max + T min )/2 < T base ], then GDD = 0 [50], and with T base = 0. AGDD was calculated by summing GDDs (Equation (5)).
Having derived per-pixel relationships between VI/SR and AGDD (Equation (4)), it is possible to derive various phenological metrics to correlate with yields or incorporate coefficients (a 0 , a 1 , a 2 ) directly into the yield model. The generalized relationship between crop yields and satellite-derived features can be expressed as follows: where y l yield is the crop yield for district l; n r is the number of regressors used; ω j n r j=0 are regression coefficients, and r l j are satellite-derived features averaged for district l over the crop mask: where A l is the set of winter crop pixels; r l j,p is the satellite-derived feature for the winter crop pixel at location p; and |·| is the cardinality number, i.e., the number of elements in the set.
Crop yield assessment was done through the development of an empirical regression model between satellite-derived features (regressors) and reference crop yields at the district scale ( Figure 1). Satellite-derived features were averaged over the crop mask (Section 3.2) for each district and correlated with the available reference winter wheat yields (Section 2.1). Before averaging satellite-based features over pixels identified as winter crops, one has to ensure that satellite data are phenologically normalized, so averaging is performed for the same stages of crop growth development. Though a combination of Landsat 8 and Sentinel-2 offers a high frequency of observations, discrepancies in available cloud-free data, however, will still exist ( Figure 3). We used a quadratic model between satellite data (VI or SR) and AGDD [26,48,49]: Multiple approaches were considered in the study that varied in the regressors used (VI, SR, or combination of SRs) and the number of regressors, as well as phenological metrics. For VIs, two phenological metrics were derived from the quadratics relationship: Peak and area under curve (AUC) (Figure 4). where ysat is the VI (NDVI, EVI2, or DVI) or SR (green, red, NIR, SWIR1, or SWIR2); and a0, a1, and a2 are the constant, linear, and quadratic coefficients, respectively. GDD was defined as the average daily maximum (Tmax) and minimum temperatures (Tmin) minus a base temperature (Tbase): where if [(Tmax + Tmin)/2 < Tbase], then GDD = 0 [50], and with Tbase = 0. AGDD was calculated by summing GDDs (Equation 5).
Having derived per-pixel relationships between VI/SR and AGDD (Equation 4), it is possible to derive various phenological metrics to correlate with yields or incorporate coefficients (a0, a1, a2) directly into the yield model. The generalized relationship between crop yields and satellite-derived features can be expressed as follows: where is the crop yield for district l; is the number of regressors used; | =0 are regression coefficients, and are satellite-derived features averaged for district l over the crop mask: where is the set of winter crop pixels; , is the satellite-derived feature for the winter crop pixel at location p; and |•| is the cardinality number, i.e., the number of elements in the set.
Multiple approaches were considered in the study that varied in the regressors used (VI, SR, or combination of SRs) and the number of regressors, as well as phenological metrics. For VIs, two phenological metrics were derived from the quadratics relationship: Peak and area under curve (AUC) (Figure 4). The peak approach assumes that yield is correlated to the peak biomass during a crop growth season [7-9,23,37,51,52], which would result in the following regression (we removed l notations for simplicity): The peak approach assumes that yield is correlated to the peak biomass during a crop growth season [7][8][9]23,37,51,52], which would result in the following regression (we removed l notations for simplicity): where p VI is the peak of the VI during the vegetation season. The accumulation, or integrated, approach assumes that yield is proportional to the accumulated biomass over the crop growth season [18,[53][54][55][56]: where AGDD min and AGDD peak define ranges, for which accumulation is performed. AGDD min was determined based on the minimum VI values, namely 0.05 for DVI and 0.15 for NDVI and EVI2. These thresholds for VIs were determined empirically to make sure that accumulation was performed for vegetated stages and not bare land. AGDD peak is the AGDD value, for which the VI's peak is achieved and is equal to − a 1 2a 2 (Equation (4)). The p-value was also calculated to test the null hypothesis that a coefficient for this regressor is equal to zero, i.e., the regressor has no effect on the dependent variable.
It should be noted that almost all existing approaches dealing with satellite data and crop yields utilize VIs (e.g., Equations (1)-(3)), which incorporate two or more spectral bands into a single variable. With fitted SRs, one can explore trajectories in a multi-dimensional space of SRs for different spectral bands. Figure 5 shows examples of 1-D and 2-D trajectories for DVI and NIR-red, respectively, for different winter wheat yield values. The accumulation, or integrated, approach assumes that yield is proportional to the accumulated biomass over the crop growth season [18,[53][54][55][56]: where and define ranges, for which accumulation is performed. was determined based on the minimum VI values, namely 0.05 for DVI and 0.15 for NDVI and EVI2. These thresholds for VIs were determined empirically to make sure that accumulation was performed for vegetated stages and not bare land.
is the AGDD value, for which the VI's peak is achieved and is equal to − 1 2 2 ⁄ (Equation 4).
The p-value was also calculated to test the null hypothesis that a coefficient for this regressor is equal to zero, i.e., the regressor has no effect on the dependent variable.
It should be noted that almost all existing approaches dealing with satellite data and crop yields utilize VIs (e.g., Equations (1)-(3)), which incorporate two or more spectral bands into a single variable. With fitted SRs, one can explore trajectories in a multi-dimensional space of SRs for different spectral bands. Figure 5 shows examples of 1-D and 2-D trajectories for DVI and NIR-red, respectively, for different winter wheat yield values. Therefore, the following question arises: What is the added value that can be achieved, when incorporating surface reflectances into empirical regression crop yield models? However, adding more regressors in the model (Equation 6), while having a limited number of reference data, might lead to model overfitting-the model's performance will improve on training (calibration) data, while failing to generalize, i.e., to provide a reasonable performance on independent (previously unseen) data. In order to prevent overfitting of the model (Equation 6) with multiple regressors, we used a ridge regression [57] with the L2 (or Tikhonov's) regularization [58,59]: where Y, R, and W are vectors of yields, regressors, and weights of the regression, respectively, and is the regularization coefficient. The regularization term, ‖ ‖ 2 , prevents coefficients of the regression increasing during the minimization process, and therefore prevents overfitting, and might improve the generalization ability of the model. Such a constraint can either force the coefficients, W, to be "sparse" or reflect prior knowledge, namely correlation between regressors. Table 1 summarizes the regression models used in the study. The optimal regularization parameter, α, is determined during the training (calibration) process by dividing calibration data into training and testing datasets, where α is selected in such a way that it maximizes a model's performance for testing data. Therefore, the following question arises: What is the added value that can be achieved, when incorporating surface reflectances into empirical regression crop yield models? However, adding more regressors in the model (Equation (6)), while having a limited number of reference data, might lead to model overfitting-the model's performance will improve on training (calibration) data, while failing to generalize, i.e., to provide a reasonable performance on independent (previously unseen) data. In order to prevent overfitting of the model (Equation (6)) with multiple regressors, we used a ridge regression [57] with the L 2 (or Tikhonov's) regularization [58,59]: where Y, R, and W are vectors of yields, regressors, and weights of the regression, respectively, and α is the regularization coefficient. The regularization term, α W 2 , prevents coefficients of the regression increasing during the minimization process, and therefore prevents overfitting, and might improve the generalization ability of the model. Such a constraint can either force the coefficients, W, to be "sparse" or reflect prior knowledge, namely correlation between regressors. Table 1 summarizes the regression models used in the study. The optimal regularization parameter, α, is determined during the training (calibration) process by dividing calibration data into training and testing datasets, where α is selected in such a way that it maximizes a model's performance for testing data. Note that validation data were not used for selecting the optimal regularization parameter. Table 1. Summary of the types of regression models used in the study.

Implementation and Performance Evaluation
VIs (NDVI, EVI2, and DVI-Equations (1)-(3)) and SRs (green, red, NIR, SWIR1, SWIR2) were derived from HLS over the study area. Regression models of yields as a function of satellite-derived features were evaluated using a standard set of metrics, such as coefficient of determination (R 2 ), root mean square error (RMSE) between estimate and reference, and relative RMSE (RRMSE): where y est and y re f are estimate and reference values, respectively; y re f = 1 n n i=1 y re f is the average of the reference values, and n is the number of samples.
The developed crop yield models were evaluated with two versions of leave-one-out cross-validations (CVs). Within spatial CV, we iteratively removed all samples for a particular district, which was reserved for testing purposes, and the model was calibrated on the rest of the districts using all three years (2016-2018). Within temporal CV, the model was trained on two years, while the third year was reserved for testing. Spatial CV allowed us to explore the spatial robustness of the empirical models, while temporal CV allowed us to explore the temporal robustness. Figure 6 shows results of the validation for 2016 to 2018, when comparing the areas of winter crops from official statistics and those derived from satellite data, namely Landsat 8 and Sentinel-2, using the GMM approach.

Winter Crop Type Mapping
Overall, there is a good correspondence between satellite-derived and reference areas, with an average R 2 of 0.91 and a bias of +4.3 thousand ha. This bias might be attributed to the confusion between winter crops and early spring cereal crops [60].
The developed crop yield models were evaluated with two versions of leave-one-out crossvalidations (CVs). Within spatial CV, we iteratively removed all samples for a particular district, which was reserved for testing purposes, and the model was calibrated on the rest of the districts using all three years (2016-2018). Within temporal CV, the model was trained on two years, while the third year was reserved for testing. Spatial CV allowed us to explore the spatial robustness of the empirical models, while temporal CV allowed us to explore the temporal robustness. Figure 6 shows results of the validation for 2016 to 2018, when comparing the areas of winter crops from official statistics and those derived from satellite data, namely Landsat 8 and Sentinel-2, using the GMM approach.    All VIs (DVI, EVI2, and NDVI) showed a very good correspondence to a quadratic model (Equation (4)), with an average R 2 > 0.9. Among SRs, the best performance was for NIR, with an average R 2 = 0.88. Other SRs, except green, showed a variable performance, with R 2 ranging from 0.7 to 0.9, with the green spectral band yielding an average R 2 of 0.6. Nevertheless, all VIs and SRs (except green) exhibited good performance with a relative error of fitting less than 10% (for the green band, the error was 10.2%). It should also be noted that the fitting results were stable across multiple years.

Comparison Between VI-Based Yield Models
A comparison of VI-based empirical models for different years is presented in Table 3. The biggest impact of smoothing satellite data for peak-VI models was observed in 2016, where fewest cloud-free data were available compared to 2017 and 2018 (Figure 3a). The coefficient of determination, R 2 , was 0.179, 0.056, and 0.088 for the peak-DVI, EVI2, and NDVI models, respectively, without phenological fitting, when the peak was calculated directly from available HLS datasets; those R 2 values increased to 0.332, 0.282, and 0.485, when calculating the peak p VI (Equation (8)) from phenological fitting. The best performance for 2016 was, however, for the AUC-DVI approach, with an R 2 = 0.588. This approach gave the best performance among all VI-based approaches considered, providing a consistent performance of R 2 = 0.589 and R 2 = 0.608 for 2017 and 2018, respectively (Figure 7). without phenological fitting, when the peak was calculated directly from available HLS datasets; those R 2 values increased to 0.332, 0.282, and 0.485, when calculating the peak (Equation (8)) from phenological fitting. The best performance for 2016 was, however, for the AUC-DVI approach, with an R 2 = 0.588. This approach gave the best performance among all VI-based approaches considered, providing a consistent performance of R 2 = 0.589 and R 2 = 0.608 for 2017 and 2018, respectively (Figure 7). Performance of EVI2 and NDVI varied within 2016 to 2018, yielding a good performance for one year and failing to produce good results for another year. Interestingly, all models consistently improved from 2016 to 2018 thanks, most probably, to the availability of more cloud-free data (Figure 3a). On average, the number of cloud-free observations for winter crops within the March to June period for 2018 was 1.3 and 1.8 times larger compared to 2017 and 2016, respectively. This shows the importance of having dense and high-frequency satellite observations at a moderate spatial resolution (10-30 m) for crop yield studies.

Impact of the Combined Use of Landsat 8 and Sentinel-2 Data
We used the AUC-DVI crop yield model as a benchmark to evaluate the combined use of Landsat 8/OLI and Sentinel-2/MSI (both A and B satellites) data as compared to a single sensor usage. Performance metrics, namely R 2 and RRMSE, are shown in Figure 8. As with the results from the previous subsection, the biggest impact was observed in 2016. With a single sensor usage (OLI or MSI), it was impossible to derive satellite-based features for some districts in 2016, because of not having enough cloud-free observations. In 2016 (March-June period), there were on average 2.6 ± 1.4 cloud-free observations from Landsat 8/OLI and 3.6 ± 1.6 cloud-free observations from Sentinel-2/MSI. With a single sensor usage in 2016, there was a poor correlation observed between crop yields and satellite-derived features, even with phenological fitting. Performance of EVI2 and NDVI varied within 2016 to 2018, yielding a good performance for one year and failing to produce good results for another year. Interestingly, all models consistently improved from 2016 to 2018 thanks, most probably, to the availability of more cloud-free data (Figure 3a). On average, the number of cloud-free observations for winter crops within the March to June period for 2018 was 1.3 and 1.8 times larger compared to 2017 and 2016, respectively. This shows the importance of having dense and high-frequency satellite observations at a moderate spatial resolution (10-30 m) for crop yield studies.

Impact of the Combined Use of Landsat 8 and Sentinel-2 Data
We used the AUC-DVI crop yield model as a benchmark to evaluate the combined use of Landsat 8/OLI and Sentinel-2/MSI (both A and B satellites) data as compared to a single sensor usage. Performance metrics, namely R 2 and RRMSE, are shown in Figure 8. As with the results from the previous subsection, the biggest impact was observed in 2016. With a single sensor usage (OLI or MSI), it was impossible to derive satellite-based features for some districts in 2016, because of not having enough cloud-free observations. In 2016 (March-June period), there were on average 2.6 ± 1.4 cloud-free observations from Landsat 8/OLI and 3.6 ± 1.6 cloud-free observations from Sentinel-2/MSI. With a single sensor usage in 2016, there was a poor correlation observed between crop yields and satellite-derived features, even with phenological fitting. In 2017 and 2018, the combined use of Landsat 8/OLI and Sentinel-2/MSI outperformed a single sensor usage, though less impact was observed in 2018, when both Sentinel-2A and -2B provided a nominal five-day revisit cycle. Though Landsat 8/OLI has a 16-day revisit cycle (8 days for overlapping scenes) compared to 5 days from both Sentinel-2's, it provided a valuable contribution, accounting for, on average, 30% to 40% of total cloud-free observations in the March to June period in the study area. These results highlight the importance of multi-source satellite data usage for crop yield studies.

Comparison Between VI-and SR-Based Yield Models
Results of the cross-validation (CV) of different models incorporating VIs, SRs, and derived AUC In 2017 and 2018, the combined use of Landsat 8/OLI and Sentinel-2/MSI outperformed a single sensor usage, though less impact was observed in 2018, when both Sentinel-2A and -2B provided a nominal five-day revisit cycle. Though Landsat 8/OLI has a 16-day revisit cycle (8 days for overlapping scenes) compared to 5 days from both Sentinel-2's, it provided a valuable contribution, accounting for, on average, 30% to 40% of total cloud-free observations in the March to June period in the study area. These results highlight the importance of multi-source satellite data usage for crop yield studies.

Comparison Between VI-and SR-Based Yield Models
Results of the cross-validation (CV) of different models incorporating VIs, SRs, and derived AUC (Equation (9)) and coefficients a 0 , a 1 , and a 2 of the parabola (Equation (4)) are presented in Table 4. Table 4. Results of the cross-validation of different models. For each group, the best models are marked in italics. Among the single regressor-based models, the best performance was achieved using an NIR spectral band, with an RMSE = 0.236 t/ha (6.3%) and R 2 = 0.63 for temporal CV. This model outperformed VI-and SR-based models. Among VIs, the best performance was achieved using DVI, with an RMSE = 0.257 t/ha (6.9%) and R 2 = 0.6.

Spatial CV
Having two AUC-SR-based regressors did not improve on the model compared to the AUC-NIR based model. The best results were achieved for the AUC-NIR + green model (RMSE = 0.244 t/ha, 6.5%, R 2 = 0.62), which did not improve on the single regressor AUC-NIR; however, it did improve the AUC-VI-based models. This shows the importance of adding separate SR values, rather than using VIs with pre-calculated weights for spectral bands. Surprisingly, the AUC-NIR outperformed the AUC-NIR + red model and other combinations of spectral bands, meaning that AUC-derived features for green, red, and SWIR1 spectral bands did not add much information in terms of winter wheat yields. SWIR2 data were not included in the analysis, because of considerable correlation between SWIR1 and SWIR2.
The addition to the AUC of the coefficients of the quadratic relationship (Equation (4)) as regressors, which define the full dynamics in the multi-dimensional spectral space ( Figure 5), improved estimations of winter wheat yields. In particular, the AUC, coefficients-NIR + red model showed the best overall results, with RMSE = 0.201 t/ha (5.4%) and R 2 = 0.73 (Figure 9). It shows that not only is the AUC important but also the full trajectories itself, defined by coefficients a 0 , a 1 , a 2 . Since this model has eight regressors (AUC, a 0 , a 1 , a 2 for NIR and red), it easily overfits for a small sample size, such as in this study, and, therefore, the role of regularization (Equation (10)) becomes extremely important. The derived optimal regularization coefficient was α = 10 −5 . Without regularization, the AUC, coefficients-NIR + red model yields an RMSE = 0.296 t/ha (7.9%) and R 2 = 0.49 performance, which is at least 30% worse than with regularization. Adding all spectral bands (NIR, red, green, SWIR1) to the AUC, coefficients-based model did not improve the AUC, coefficients-NIR + red model; however, it showed the second best result among all the models considered in the study along with AUC, coefficients-DVI model. It is also worth noting that performance metrics within spatial CV were consistently better than metrics for temporal CV, meaning a better robustness of the models in the spatial context, rather than temporal. important. The derived optimal regularization coefficient was = 10 −5 . Without regularization, the AUC,coefficients-NIR + red model yields an RMSE = 0.296 t/ha (7.9%) and R 2 = 0.49 performance, which is at least 30% worse than with regularization. Adding all spectral bands (NIR, red, green, SWIR1) to the AUC,coefficients-based model did not improve the AUC,coefficients-NIR + red model; however, it showed the second best result among all the models considered in the study along with AUC,coefficients-DVI model. It is also worth noting that performance metrics within spatial CV were consistently better than metrics for temporal CV, meaning a better robustness of the models in the spatial context, rather than temporal. Figure 9. Estimates vs. reference winter wheat yields within the temporal cross-validation for the best AUC,coefficients-NIR + red model.

Discussion
Remote sensing has always been a valuable source of information for crop yield. Previous success has been achieved chiefly by using coarse spatial resolution satellite data, thanks to its high temporal resolution (daily), which is needed to capture the behavior of crop growth. Vegetation indices that incorporate two or more spectral bands with pre-defined weights have usually been applied to correlate with crop yields. With recent advancements in technologies, high temporal resolution satellite data have been available at moderate (10-30 m, combination of free Landsat 8 and Sentinel-2) and high (~3 m, commercial Planet) spatial resolutions. However, most existing studies continue to exploit vegetation indices, disregarding mainly surface reflectance values (SRs) in the available spectral bands. This study aimed to address this question and explored the use of SRs, namely green (~0.560 µ m), red (~0.660 µ m), NIR (~0.865 µ m), and SWIR1 (~1.5 µ m), when estimating winter wheat yields. In order to account for discrepancies in satellite data acquisitions, we applied a quadratic relationship (Equation (4)) to SRs of individual spectral bands against accumulated GDD (Equation (5)). This relationship was previously applied to VIs only and has never been applied to SRs. Our results showed that this quadratic relationship can be effectively applied to SRs as well, with a relative error of fitting ranging from 5% to 12%, and R 2 from 0.59 to 0.91. The best results of phenological fitting for SR values were observed for NIR (average R 2 = 0.88), while the green spectral band showed the worst performance of R 2 = 0.6. While these results show the adequacy of a quadratic function, we would like to emphasize the potential limitations of this fitting model. The use of a parabola implies that the upward phase of crop growth is mirrored in the senescence, which is the case for our study area. However, in the regions, where water is more limiting to crop growth than sunlight, such an assumption might be incorrect. Therefore, when fitting the parabola to the satellitederived VIs or SRs, one should always check the fitting performance, and in case of poor fitting, other Figure 9. Estimates vs. reference winter wheat yields within the temporal cross-validation for the best AUC, coefficients-NIR + red model.

Discussion
Remote sensing has always been a valuable source of information for crop yield. Previous success has been achieved chiefly by using coarse spatial resolution satellite data, thanks to its high temporal resolution (daily), which is needed to capture the behavior of crop growth. Vegetation indices that incorporate two or more spectral bands with pre-defined weights have usually been applied to correlate with crop yields. With recent advancements in technologies, high temporal resolution satellite data have been available at moderate (10-30 m, combination of free Landsat 8 and Sentinel-2) and high (~3 m, commercial Planet) spatial resolutions. However, most existing studies continue to exploit vegetation indices, disregarding mainly surface reflectance values (SRs) in the available spectral bands. This study aimed to address this question and explored the use of SRs, namely green (~0.560 µm), red (~0.660 µm), NIR (~0.865 µm), and SWIR1 (~1.5 µm), when estimating winter wheat yields. In order to account for discrepancies in satellite data acquisitions, we applied a quadratic relationship (Equation (4)) to SRs of individual spectral bands against accumulated GDD (Equation (5)). This relationship was previously applied to VIs only and has never been applied to SRs. Our results showed that this quadratic relationship can be effectively applied to SRs as well, with a relative error of fitting ranging from 5% to 12%, and R 2 from 0.59 to 0.91. The best results of phenological fitting for SR values were observed for NIR (average R 2 = 0.88), while the green spectral band showed the worst performance of R 2 = 0.6. While these results show the adequacy of a quadratic function, we would like to emphasize the potential limitations of this fitting model. The use of a parabola implies that the upward phase of crop growth is mirrored in the senescence, which is the case for our study area. However, in the regions, where water is more limiting to crop growth than sunlight, such an assumption might be incorrect. Therefore, when fitting the parabola to the satellite-derived VIs or SRs, one should always check the fitting performance, and in case of poor fitting, other fitting models should be explored to better capture crop phenology.
By having those quadratic relationships in place for SRs, multiple phenological metrics can be derived. A similar approach with multiple phenological metrics has been successfully applied to cropland and grassland mapping in the USA for multiple years using Landsat and MODIS data, though only EVI was used for fitting [50]. Therefore, phenological metrics, such as area under curve (AUC) (Figure 4) and coefficients of the quadratic relationship (Equation (4)), can be used to estimate crop yields. Since adding multiple regressors into a regression model might lead to model overfitting (especially in the case of a small sample size), we proposed the use of ridge regression [53], which incorporates an L 2 regularization [58,59]. Our best winter wheat yield model was a combination of AUC and constant, linear, and quadratic coefficients of the quadratic model for NIR and red spectral bands, with an RMSE = 0.201 t/ha (5.4%) and R 2 = 0.73 (Figure 9). This model outperformed VI-based models. Adding other spectral bands, namely green and SWIR1, did not improve the model performance for winter wheat yields. These results confirm previous findings for winter wheat yields studies, where NIR and red play the most important role in capturing the biophysical parameters of winter wheat [12,61]. For other crops, for example, corn and soybean, a similar approach can be applied to identify spectral bands that maximize the correlation with crop yields. However, these should be investigated further.
We also investigated-in a numerical way-the efficiency of the combined use of Landsat 8/OLI and Sentinel-2/MSI for winter wheat yield estimation. All three years featured in the study (2016-2018) had different scenarios of satellite data acquisitions, which benefited such an analysis. The most substantial impact was observed in 2016, when only one of the two Sentinel-2 satellites was available, however, was not fully operational (revisit cycle >10 days). Using a single sensor only in 2016, either Landsat 8/OLI or Sentinel-2A/MSI, did not give any reasonable results for the crop yield model, since each satellite contributed with 2.6 ± 1.4 and 3.6 ± 1.6 cloud-free observations in the March to June period, respectively, which was not enough to capture the growth dynamics of winter wheat. The single-sensor-only models provided R 2 = 0.01 and R 2 = 0.34 for Landsat 8 and Sentinel-2A, respectively. A combination of satellites increased the number of observations to 6.2 ± 2.2, and the performance of the winter wheat yield model improved to R 2 = 0.59. For 2017 to 2018, when data from Sentinel-2B started to be available and both Sentinel-2A/B became operational with a combined five-day revisit cycle, that impact was not so dramatic; nevertheless, the combination of Landsat 8 and Sentinel-2 improved the performance of the winter wheat yield model by 18% and 15% in 2017 and 2018, respectively, compared to Sentinel-2A/B only. Though Landsat 8 has a nominal 16-day revisit cycle (8-day for overlapping scenes), which is by far less than two satellites Sentinel-2A/B, it consistently added on average 3.2 ± 1.6 cloud-free observations (during the March to June period), which constituted on average 39% of the total observations. These results further highlight the importance of multi-source data usage in these types of studies.

Conclusions
In this paper, we focused on the estimation of winter wheat yield at a regional scale in Ukraine, using the combined Landsat 8 and Sentinel-2 data at a 30-m spatial resolution. We emphasized the importance of normalizing (smoothing) satellite data in crop yield studies through, for example, phenological fitting. Not doing such normalization may result in a considerable decrease in crop yield model performance. Unlike most existing approaches utilizing vegetation indices only, we derived per-pixel phenological metrics directly from all spectral bands (green, red, NIR, and SWIR), when fitting them against AGDD. The derived phenological metrics were correlated against crop yields using a ridge regression (with L 2 regularization) in order to avoid model overfitting. The best results were achieved for the model with NIR and red spectral bands and derived area under curve (AUC), constant, linear, and quadratic coefficients of the quadratic model. The best model yielded RMSE = 0.201 t/ha (5.4%) and R 2 = 0.73 on cross-validation.
We also analyzed the importance of multi-source data usage for crop yield estimation. For the area considered in this study, both Landsat 8/OLI and Sentinel-2A/B/MSI data were valuable, and the combined use yielded better performance than a single sensor usage for all three years (2016-2018). These results reiterate the need for high-frequency observations at moderate and higher spatial resolutions for crop yield studies.
Further works should be directed towards an investigation of other crops, such as corn and soybeans, as well as identification of efficient ways of incorporating synthetic aperture radar (SAR) observations to improve the frequency of data in areas with high cloud cover.