Quantifying Uncertainty and Bridging the Scaling Gap in the Retrieval of Leaf Area Index by Coupling Sentinel-2 and UAV Observations

: Leaf area index (LAI) estimates can inform decision-making in crop management. The European Space Agency’s Sentinel-2 satellite, with observations in the red-edge spectral region, can monitor crops globally at sub-ﬁeld spatial resolutions (10–20 m). However, satellite LAI estimates require calibration with ground measurements. Calibration is challenged by spatial heterogeneity and scale mismatches between ﬁeld and satellite measurements. Unmanned Aerial Vehicles (UAVs), generating high-resolution (cm-scale) LAI estimates, provide intermediary observations that we use here to characterise uncertainty and reduce spatial scaling discrepancies between Sentinel-2 observations and ﬁeld surveys. We use a novel UAV multispectral sensor that matches Sentinel-2 spectral bands, ﬂown in conjunction with LAI ground measurements. UAV and ﬁeld surveys were conducted on multiple dates—coinciding with di ﬀ erent wheat growth stages—that corresponded to Sentinel-2 overpasses. We compared chlorophyll red-edge index (CI red-edge ) maps, derived from the Sentinel-2 and UAV platforms. We used Gaussian processes regression machine learning to calibrate a UAV model for LAI, based on ground data. Using the UAV LAI, we evaluated a two-stage calibration approach for generating robust LAI estimates from Sentinel-2. The agreement between Sentinel-2 and UAV CI red-edge values increased with growth stage—R 2 ranged from 0.32 (stem elongation) to 0.75 (milk development). The CI red-edge variance between the two platforms was more comparable later in the growing season due to a more homogeneous and closed wheat canopy. The single-stage Sentinel-2 LAI calibration (i.e., direct calibration from ground measurements) performed poorly (mean R 2 = 0.29, mean NRMSE = 17%) when compared to the two-stage calibration using the UAV data (mean R 2 = 0.88, mean NRMSE = 8%). The two-stage approach reduced both errors and biases by > 50%. By upscaling ground measurements and providing more representative model training samples, UAV observations provide an e ﬀ ective and viable means of enhancing Sentinel-2 wheat LAI retrievals. We anticipate that our UAV calibration approach to resolving spatial heterogeneity would enhance the retrieval accuracy of LAI and additional biophysical variables for other arable crop types and a broader range of vegetation cover types. A.R. M.W.; writing—review A.R., A.F., A.M., S.H., R.R. and M.W.; visualization, A.R.; funding acquisition, A.M. and M.W. authors


Introduction
The agricultural sector is under increasing pressure to manage resources more sustainably, whilst increasing production in order to meet the demands of a growing population [1,2]. These challenges satellites. Matese et al. [39] included a statistical comparison between Normalised Difference Vegetation Index (NDVI) maps derived from UAV and high-resolution (6.5 m) RapidEye satellite observations. Although an agreement was achieved between the UAV and satellite data, NDVI uses the visible red spectra, which is strongly absorbed by chlorophyll, becoming saturated at intermediate to high levels of chlorophyll concentration [40]. There is no recorded evaluation of co-located satellite and UAV red-edge measurements, which avoid these saturation issues.
Calibrating LAI models from Sentinel-2 red-edge 20-m spectral observations using ground LAI measurements is challenged by major scale mismatches and in-situ sampling problems. The mismatch arises because typical in-situ measurements of LAI using gap-fraction approaches (e.g., [22]) only record the canopy structure close to the user, i.e., within~1 m radius, an order of magnitude smaller than the satellite pixel. The heterogeneity in crop LAI at a scale of 2 to 3 m would therefore introduce potentially strong biases into a calibration [41]. Accessing multiple points within a field in order to collect a sufficient number of LAI measurements to capture the sub-pixel variability would be impossible without the observer causing some damage to the crop. To avoid crop damage, observers tend to collect data around the edges of fields or along farm vehicular access tracks (referred to hereafter as tramlines). Bias problems in retrievals will be amplified by this non-random sampling of crop variability.
Thus, recent research has aimed to use UAV sensors to bridge this scaling gap between ground and EO data, including the calibration of EO-derived VIs using UAVs (e.g., [38,42]). However, there has been a scarcity of studies that have applied a UAV calibration to retrieve measurable biophysical parameters describing the crop canopy, such as LAI. Another weakness in these past studies is that UAV sensors have measured different spectra to the satellite. UAV sensors should have the same spectral bands to those of satellite-based sensors for robust scaling studies [43].
Here, for the first time, we show how a UAV multispectral sensor with bands matching those of Sentinel-2 can be used to calibrate Sentinel-2 retrievals of crop LAI, with a robust characterisation of error. In conjunction with LAI ground measurements at multiple field sites and growth stages, we deployed a UAV-mounted multispectral sensor which has the same red-edge and near-infrared central wavelengths as Sentinel-2. These UAV sensor data, once calibrated, allow upscaling of ground LAI measurements over larger spatial extents at very high spatial resolutions (around 5 cm). We spatially aggregated these LAI data to allow a direct spatial match to observations retrieved from Sentinel-2 at 20 m. Using a machine learning approach, we further evaluated the performance of a novel two-stage Sentinel-2 LAI retrieval calibration that incorporates these up-scaled UAV LAI estimates. Based on UAV multispectral observations, Revill et al. [26] performed a robust analysis of Sentinel-2 bands for retrieving wheat canopy variables, identifying the sensor's red-edge 705-nm and near-infrared 865-nm bands as being the most informative for estimating wheat LAI. This present study extends these findings by investigating the error associated with the retrieval of wheat LAI from Sentinel-2 at the native resolution of these spectral bands (i.e., 20 m).
We performed a series of experiments in order to answer the following research questions: 1.
How do maps of red-edge VIs compare between high-resolution UAV imagery and coarser-spatial-resolution Sentinel-2 data throughout the winter wheat season? 2.
How do observation scale and scale mismatch (i.e., between in-situ data and satellite resolution) influence the accuracy of wheat LAI estimates? 3. To what extent can models calibrated with high-spatial-resolution UAV-derived winter wheat LAI estimates reduce uncertainty and bias in the retrieval of LAI from Sentinel-2 data?
We discuss the potential for UAVs to support the improved calibration of vegetation parameters on the basis of this research.

Field Sites
This research involved acquiring multi-temporal data within multiple winter wheat (Triticum aestivum L.) fields located on two commercially managed farms during the 2017/2018 growing season. These two farms (referred to hereafter as Farm 1 and Farm 2) are within UK Ordnance Survey (OSGB) 10-km-grid squares with references of NT51 and NT84 (centred around latitude/longitude of 55.42/−2.71 and 55.70/−2.24, respectively), located in the central and southern regions of Scotland, respectively ( Figure 1). The wheat fields were sown on the 19th and 29th September 2017 and harvested on the 23rd and 20th August 2018 for Farms 1 and 2, respectively, at a seed rate of between 280 to 350 seeds/m 2 . The fields on Farm 1 had a predominantly sandy clay loam soil texture, whereas, for Farm 2, the soil texture was sandy silt loam. This research involved acquiring multi-temporal data within multiple winter wheat (Triticum aestivum L.) fields located on two commercially managed farms during the 2017/2018 growing season. These two farms (referred to hereafter as Farm 1 and Farm 2) are within UK Ordnance Survey (OSGB) 10-km-grid squares with references of NT51 and NT84 (centred around latitude/longitude of 55.42/−2.71 and 55.70/−2.24, respectively), located in the central and southern regions of Scotland, respectively ( Figure 1). The wheat fields were sown on the 19th and 29th September 2017 and harvested on the 23rd and 20th August 2018 for Farms 1 and 2, respectively, at a seed rate of between 280 to 350 seeds/m 2 . The fields on Farm 1 had a predominantly sandy clay loam soil texture, Figure 1. Locations of two farms (red dots) and map insets highlighting the field sites (red border) and within-field areas of interest (black cross-hatching) where field campaign data, including ground measurement and UAV surveys, were carried out.

Ground Measurements
Ground measurements were recorded on three dates for each farm (Table 1) at regular points that were within the field areas of interest and were positioned using a standard handheld Global Navigation Satellite System (GNSS). Across the farms and growing seasons, a total of 80 sets of measurements were recorded and included LAI and growth stage observations in accordance with the Zadoks decimal code [44]. For each of the 80 measurement points, three replicate LAI measurements were made within a two-metre square using a SunScan device (Delta-T Devices, Cambridge), which were then combined to provide an average. SunScan LAI measurements typically include all plant components, including the wheat, stem, and ears. We, therefore, apply a bias correction to the LAI values detailed in Revill et al. [26], which was based on regression analysis between SunScan and leaf area meter measurement carried out by destructive sampling.

Ground Measurements
Ground measurements were recorded on three dates for each farm (Table 1) at regular points that were within the field areas of interest and were positioned using a standard handheld Global Navigation Satellite System (GNSS). Across the farms and growing seasons, a total of 80 sets of measurements were recorded and included LAI and growth stage observations in accordance with the Zadoks decimal code [44]. For each of the 80 measurement points, three replicate LAI measurements were made within a two-metre square using a SunScan device (Delta-T Devices, Cambridge), which were then combined to provide an average. SunScan LAI measurements typically include all plant components, including the wheat, stem, and ears. We, therefore, apply a bias correction to the LAI values detailed in Revill et al. [26], which was based on regression analysis between SunScan and leaf area meter measurement carried out by destructive sampling. UAV flights covering the within-field areas of interest for each farm site ( Figure 1) were carried out during three LAI ground measurement dates in each season. The UAV survey consisted of flying a hexa-copter (DJI Matrice 600) at a height of 100 m above ground level and a constant speed of 3.1 m s −1 . The platform was equipped with a real-time kinematic GNSS and was flown autonomously using a pre-planned flight mission, created using the DJI Ground Station Pro software, for each of the within-field areas, in order to ensure consistent and comparable multi-date coverage.
Multispectral imagery was acquired throughout the UAV flights using a MAIA camera system (SAL Engineering/EOPTIS), which is composed of an array of nine 1.2-megapixel monochromatic sensors that collect data simultaneously using a global shutter. The nine sensors of the MAIA camera have band-pass filters that have the same central wavelength as that of the first nine bands (i.e., bands 1 to 8A) of the ESA Sentinel-2 Multispectral Instrument [45]. During the UAV flights, these sensors imaged the field areas from a nadir position that was maintained using a 3-axis stabilisation gimbal (DJI Ronin-MX). In order to ensure complete coverage of the flying area, successive imagery along the flight lines was acquired with a 20% to 30% overlap. At the above-ground flying altitude the MAIA camera system had a ground sampling distance of approximately 4.7 cm and field-of-view of 60 × 45 m 2 .

UAV Imagery Post-Processing
Image processing, including radial and radiometric calibration, was applied to the raw MAIA/Sentinel-2 imagery using the MultiCam Stitcher Pro software (v.1.1.8). The multiband images were first co-registered in order to correct the offsets between the nine sensors on the camera system and, thus, to provide a series of single images with multiple spectral bands. The radial calibration then involved a per-pixel correction for vignetting (i.e., the effect of a reduction in illumination from the centre to the edge of the image).
To calculate ground-leaving reflectance, two identical white ground targets were placed at opposite ends of the UAV flight extents. Each target comprised a 1.0-m 2 panel that was made of a Lambertian-reflectant polyvinyl chloride (PVC)-coated material ('Odyssey' trademark material, Kayospruce Ltd.), which has previously been used in ESA remote-sensing fieldwork campaigns (see [46]). At the beginning of each UAV flight, the spectral reflectance of these targets was measured using an Analytical Spectral Devices (ASD) FieldSpec Pro. This target reflectance was converted to absolute reflectance, following the guidelines outlined by the National Environment Research Council (NERC) Field Spectroscopy Facility [47]. These data were then convolved with the spectral response of each MAIA/Sentinel-2 band to correct the reflectance digital number (DN) recorded at pixels for each of the MAIA/Sentinel-2 spectral bands to absolute reflectance [45]: Remote Sens. 2020, 12, 1843 6 of 18 where, for each MAIA/Sentinel-2 band, the corrected DN, DN i , is calculated based on the ratio of the reference target spectra, Rt i , to the same value measured by the camera, Pt i . We applied corrections to the recorded position of each of the MAIA/Sentinel-2 image by matching the GNSS timestamp to that of the UAV platform. Consequently, with the timestamps matched, we were able to use the more precise real-time kinematic GNSS position recorded in the UAV flight log. The collected images were then loaded into Agisoft PhotoScan Professional (v.1.3.3), where a photogrammetric workflow was applied to align and produce an othomosaic of the multiband images covering each of the field sites and was then spatially resampled to a ground sampling distance of 5 cm.

Sentinel-2 Data
Data from the multispectral instrument sensor on-board the two ESA Sentinel-2 satellite platforms (i.e., Sentinel-2A and -2B) was downloaded free of charge from the Copernicus Open Access Hub [48]. After cloud and cloud shadow screening, we identified six Sentinel-2 images for further analysis that were acquired within ±5 days of the UAV flight dates (Table 1). These images had already been atmospherically corrected by ESA, from Level-1C top-of-atmosphere to Level-2A bottom-of-atmosphere reflectance products, using the Sen2Cor algorithm [49]. These images were also available in cartographic geometry (UTM/WGS84 projection).
Although the spatial resolution of Sentinel-2 near-infrared and visible bands is 10 m, observations in the red-edge band centred on 705 nm, which have been demonstrated to improve the retrieval of wheat variables [26], are available at 20-m resolution. In our analysis we, thus, use only the Sentinel-2 Level-2A output image products that were provided by ESA at a ground sampling distance of 20 m.

Characterising Sentinel-2 Sub-Pixel Variability
We quantified the average sub-pixel variability of the Sentinel-2 dataset using the temporally and spatially corresponding multispectral high-resolution UAV data. Using QGIS (v.2.18), we first generated a grid of 20 × 20 m cells by vectorising the 20-m resolution Sentinel-2 705 nm red-edge band data that was subset to match the extent of the UAV coverages over the field sites for each observation date. For each of the six observation dates we then calculated the red-edge chlorophyll index (CI red-edge ; [40]) for the UAV and Sentinel-2 datasets: where R x refers to the reflectance measured by either Sentinel-2 or the UAV sensor at wavelength x nm. Using the CI red-edge in this analysis allowed us to distinguish between areas of low and high vegetation areas. Using the 20-m vector grid, we sampled the variability of UAV CI red-edge data for all of the overlapping Sentinel-2 pixels ( Figure 2). For each farm and observation date, summary statistics quantifying the differences between the two sensors were reported. These metrics included the mean standard deviation and coefficient of variation (CV), calculated as a ratio of the standard deviation to the mean CI red-edge , along with the coefficient of determination (R 2 ) and normalised root-mean-square error (NRMSE) statistic describing the fit comparing the Sentinel-2 to UAV values, expressed as: where E i and O i represent the Sentinel-2 and UAV CI red-edge values, respectively. n is the number of 20-m vector grid cells included in the comparison between the two sensors.
band data that was subset to match the extent of the UAV coverages over the field sites for each observation date. For each of the six observation dates we then calculated the red-edge chlorophyll index (CIred-edge; [40]) for the UAV and Sentinel-2 datasets: where refers to the reflectance measured by either Sentinel-2 or the UAV sensor at wavelength nm. Using the CIred-edge in this analysis allowed us to distinguish between areas of low and high vegetation areas. Using the 20-m vector grid, we sampled the variability of UAV CIred-edge data for all Figure 2. Sample CI red-edge image over a field on Farm 2 shown for the high-resolution (5 cm) UAV CI red-edge imagery (acquired on 23 May 2018) with an overlapping sampling grid generated from the Sentinel-2 20-m pixels.

Impacts of Scale on LAI Retrievals
LAI was estimated from the UAV data through the calibration of a Gaussian processes regression (GPR) machine learning [50] algorithm using the ground measurements. GPR provides a non-linear, non-parametric and probabilistic approach to establishing relationships between inputs (i.e., reflectance values) and outputs (i.e., LAI ground measurements), allowing for both the predictive mean and variance to be estimated [51,52]. Research in Han et al. [27] demonstrated that GPR performed favourably for the estimation of wheat yields when compared to alternative machine learning algorithms. Furthermore, when compared to ground measurements, research in Revill et al. [26] achieved high agreement with GPR-modelled wheat LAI estimates retrieved using the same UAV multispectral sensor used in this present study.
The GPR modelling approach was implemented using the Machine Learning Regression Algorithms toolbox (MLRA v.1.24; [53]) available in the ARTMO (Automated Radiative Transfer Models Operator v.3.26; [54]) software package. In order to quantify the impact of spatial scale on the LAI retrieval accuracy, two GPR model calibration approaches were applied using the combined LAI dataset consisting of all 80 measurements that were recorded across the farms and measurement dates. Revill et al. [26] identified the 705-nm and 865-nm spectral bands as being the most informative for estimating LAI; we, therefore, only include these two Sentinel-2 bands within the GPR model calibration. The first calibration approach involved training the GPR model based on the LAI measurements using the average UAV 705-nm and 865-nm reflectance data that was observed within a one-metre radius of each ground measurement point location. This radius accounted for the locations of the three LAI replicate measurements. The second GPR calibration involved re-training the retrieval algorithm with the average of the 705-nm and 865-nm reflectance within a 20 x 20 m square with the position of the ground measurement point at the centroid. Through the spatial aggregation involved in this second calibration (i.e., varying from 2 to 20 m), we, thus, mimic LAI that could be retrieved from 20-m scale Sentinel-2 observations. Both calibration approaches were applied to a combined dataset consisting of all multi-date measurements acquired from both farms directly compared to the corresponding LAI point ground measurements.

Sentinel-2 LAI Retrieval Approaches
For the retrieval of LAI from Sentinel-2 observations, two GPR model calibration approaches were investigated: single-stage and two-stage calibrations (Figure 3). In order to perform an independent validation of both GPR model approaches, an equal (i.e., 40/40) split was randomly applied to the 80 ground measurements in order to generate training and validation datasets. The single-stage calibration involved using the training measurements to fit the retrieval model directly to the overlapping Sentinel-2 20-m reflectance data. The two-stage calibration approach involved, first, calibrating the fine-scale UAV data with the training dataset (see Section 2.3.2). The UAV fine-scale LAI estimates are then spatially averaged to the overlapping Sentinel-2 20-m pixel grid cells. Second, the resultant spatially aggregated LAI was then used to calibrate the LAI retrieval algorithm from the Sentinel-2 spectral bands (i.e., 705 and 865 nm) used in this analysis, sampled at 20 m. The two calibration approaches were then applied to the Sentinel-2 observations covering measurement points included in the validation dataset and compared to the spatially aggregated (20 m) UAV LAI data that was also estimated using the validation dataset. Second, the resultant spatially aggregated LAI was then used to calibrate the LAI retrieval algorithm from the Sentinel-2 spectral bands (i.e., 705 and 865 nm) used in this analysis, sampled at 20 m. The two calibration approaches were then applied to the Sentinel-2 observations covering measurement points included in the validation dataset and compared to the spatially aggregated (20 m) UAV LAI data that was also estimated using the validation dataset.

Multiscale Chlorophyll Index Inter-Comparisons
The mean CIred-edge values derived from the UAV data were always higher than those from Sentinel-2 (mean NRMSE = 27%, Table 2). From a linear fit comparing Sentinel-2 to UAV values, for both farms, the R 2 generally increased with developmental stage. The discrepancy between UAV and Figure 3. Overview of approaches for calibrating GPR models used for retrieving LAI from Sentinel-2 data. Including: single-stage calibration based on ground measurements and indirect calibration involving, first, upscaling ground measurement using high-resolution UAV data and, second, calibrating the Sentinel-2 retrieval model based on UAV LAI data. Validation of both approaches performed using UAV data averages per Sentinel-2 pixel.

Multiscale Chlorophyll Index Inter-Comparisons
The mean CI red-edge values derived from the UAV data were always higher than those from Sentinel-2 (mean NRMSE = 27%, Table 2). From a linear fit comparing Sentinel-2 to UAV values, for both farms, the R 2 generally increased with developmental stage. The discrepancy between UAV and Sentinel-2 CI red-edge was typically larger for the earlier growth stages when compared to observations later in the season. The variance in per-farm CI red-edge values from both sensors also shows a narrowing with increasing growth stage. For the majority of growth stages, however, the variation in values derived from the UAV sensor (mean CV = 17%) was broader than that of Sentinel-2 (mean CV = 12%). The mean standard deviation of the UAV CI red-edge values (1.07) was nearly double that of the Sentinel-2 data (0.64). The difference in variance can also be seen from a probability distribution function of values from the two sensors for Farm 1 (Figure 4). Table 2. Summary statistics for CI red-edge , derived from samples extracted from a Sentinel-2 20-m pixel grid, quantifying multi-temporal differences between coarse-scale Sentinel-2 and fine-scale resolution UAV values averaged per Sentinel-2 grid cell. Metrics, reported per farm and growth stage (GS), include number of pixels sampled, mean value, standard deviation (SD) and coefficient of variation (CV), coefficient of determination (R 2 ) and normalised-root-mean-square error (NRMSE). UAV values averaged per Sentinel-2 grid cell. Metrics, reported per farm and growth stage (GS), include number of pixels sampled, mean value, standard deviation (SD) and coefficient of variation (CV), coefficient of determination (R 2 ) and normalised-root-mean-square error (NRMSE).

Linking UAV LAI Retrieval Accuracies to Spatial Scale
There was evidence of substantial LAI heterogeneity from comparing UAV retrievals of LAI at the scales of in-situ data (2 m). The impact of spatial scale on LAI retrievals was investigated by comparing the UAV-estimated LAI at 2-m and at 20-m spatial resolutions to in-situ measurements recorded on the ground (Figure 5). From a linear fit, the correlation with ground measurements of the LAI retrieval model calibrated with the 20-m resolution data was poor (R 2 = 0.56) when compared to that of calibration with the 2-m spectral data (R 2 = 0.80). The overall error between the measured and retrieved LAI was more similar for the two spatial resolutions; the NRMSEs were 11% and 13% for the 2-m and 20-m calibrations, respectively. However, bias was clearly greater when using the 20-

Linking UAV LAI Retrieval Accuracies to Spatial Scale
There was evidence of substantial LAI heterogeneity from comparing UAV retrievals of LAI at the scales of in-situ data (2 m). The impact of spatial scale on LAI retrievals was investigated by comparing the UAV-estimated LAI at 2-m and at 20-m spatial resolutions to in-situ measurements recorded on the ground ( Figure 5). From a linear fit, the correlation with ground measurements of the LAI retrieval model calibrated with the 20-m resolution data was poor (R 2 = 0.56) when compared to that of calibration with the 2-m spectral data (R 2 = 0.80). The overall error between the measured and retrieved LAI was more similar for the two spatial resolutions; the NRMSEs were 11% and 13% for the 2-m and 20-m calibrations, respectively. However, bias was clearly greater when using the 20-m resolution data (particularly for LAI measurements > 3 m 2 m −2 ) with a regression slope of 0.57, compared to 0.80 for the 2-m calibration.

Sentinel-2 LAI Retrieval Calibration Approaches
The two-stage LAI calibration approach for Sentinel-2 had a much better agreement with the validation dataset (R 2 = 0.88) than the single-stage calibration (R 2 = 0.29) (Figure 6). The two-stage approach could, thus, explain ~3 times as much of the variation among 20-m pixels. The error of the single-stage calibrated model (NRMSE = 17%) was more than twice that of the two-stage calibration (NRMSE = 8%). The single-stage calibration also had a large negative bias (regression slope = 0.35) when compared to the two-stage model (0.81). The single-stage calibration, thus, has major problems in identifying changes in LAI, and picking out high and low values accurately.

Sentinel-2 LAI Retrieval Calibration Approaches
The two-stage LAI calibration approach for Sentinel-2 had a much better agreement with the validation dataset (R 2 = 0.88) than the single-stage calibration (R 2 = 0.29) (Figure 6). The two-stage approach could, thus, explain~3 times as much of the variation among 20-m pixels. The error of the single-stage calibrated model (NRMSE = 17%) was more than twice that of the two-stage calibration (NRMSE = 8%). The single-stage calibration also had a large negative bias (regression slope = 0.35) when compared to the two-stage model (0.81). The single-stage calibration, thus, has major problems in identifying changes in LAI, and picking out high and low values accurately.
The observed-estimated LAI bias values in the distribution for the single-stage calibration were broader and slightly positively skewed when compared to that of the two-stage calibration (Figure 7) when using the same independent calibration/validation datasets. The two-stage calibration bias had a narrower and more symmetrical distribution. The single-stage calibration has a mean LAI error of 1.0, compared to 0.45 for the two-stage process. Furthermore, the single-stage approach produces very poorly constrained estimates in some cases, shown by the long tails of the error distribution.
The two-stage LAI calibration approach for Sentinel-2 had a much better agreement with the validation dataset (R 2 = 0.88) than the single-stage calibration (R 2 = 0.29) (Figure 6). The two-stage approach could, thus, explain ~3 times as much of the variation among 20-m pixels. The error of the single-stage calibrated model (NRMSE = 17%) was more than twice that of the two-stage calibration (NRMSE = 8%). The single-stage calibration also had a large negative bias (regression slope = 0.35) when compared to the two-stage model (0.81). The single-stage calibration, thus, has major problems  The observed-estimated LAI bias values in the distribution for the single-stage calibration were broader and slightly positively skewed when compared to that of the two-stage calibration ( Figure  7) when using the same independent calibration/validation datasets. The two-stage calibration bias had a narrower and more symmetrical distribution. The single-stage calibration has a mean LAI error of 1.0, compared to 0.45 for the two-stage process. Furthermore, the single-stage approach produces very poorly constrained estimates in some cases, shown by the long tails of the error distribution. The Sentinel-2 two-stage calibration was capable of resolving the within-field spatial variability in LAI, whereas the single-stage calibration only captured less than half of this variability. We compared LAI maps of sample fields derived from the different calibration approaches to that from the fine-scale UAV LAI estimates (Figure 8). The variability captured by the two-stage approach was visually comparable to the fine-scale UAV LAI estimates. In particular, the two-stage approach better resolves areas of high and low LAI. The single-stage calibration, on the other hand, appeared to underestimate LAI at higher values.
The bias and distribution of LAI estimates derived from the two calibration approaches for the field samples, in comparison with the UAV LAI data, can also be seen in the distribution plots ( Figure  8). The shape of the two-stage LAI distribution matches the UAV estimates much closer when compared to that of the single-stage calibration. The single-stage calibration LAI estimates also had a much narrower distribution, which generally peaked at LAI values less than the two-stage calibration estimates, indicating a consistent and strong bias. The Sentinel-2 two-stage calibration was capable of resolving the within-field spatial variability in LAI, whereas the single-stage calibration only captured less than half of this variability. We compared LAI maps of sample fields derived from the different calibration approaches to that from the fine-scale UAV LAI estimates (Figure 8). The variability captured by the two-stage approach was visually comparable to the fine-scale UAV LAI estimates. In particular, the two-stage approach better resolves areas of high and low LAI. The single-stage calibration, on the other hand, appeared to underestimate LAI at higher values.
The bias and distribution of LAI estimates derived from the two calibration approaches for the field samples, in comparison with the UAV LAI data, can also be seen in the distribution plots ( Figure 8). The shape of the two-stage LAI distribution matches the UAV estimates much closer when compared to that of the single-stage calibration. The single-stage calibration LAI estimates also had a much narrower distribution, which generally peaked at LAI values less than the two-stage calibration estimates, indicating a consistent and strong bias.

Discussion
Through a series of experiments using UAV data linked to ground measurements, we have analysed and tested approaches for minimizing spatial scaling biases and uncertainties when monitoring wheat using Sentinel-2 20-m resolution data. We quantified variation in the spectral reflectance of fields at scales below the resolution of a Sentinel-2 and found that heterogeneity within crops is an important consideration for calibrating satellite sensors. We then showed that UAV data could provide a robust bridge from in-situ data resolution to satellite pixel scale. We developed a

Discussion
Through a series of experiments using UAV data linked to ground measurements, we have analysed and tested approaches for minimizing spatial scaling biases and uncertainties when monitoring wheat using Sentinel-2 20-m resolution data. We quantified variation in the spectral reflectance of fields at scales below the resolution of a Sentinel-2 and found that heterogeneity within crops is an important consideration for calibrating satellite sensors. We then showed that UAV data could provide a robust bridge from in-situ data resolution to satellite pixel scale. We developed a robust calibration approach to take account of and correct for heterogeneity, and to effectively quantify retrieval error for LAI. This approach should be valid across other vegetation types for producing higher-quality biophysical retrievals.

How Do UAV and Sentinel-2 Chlorophyll Index Maps Compare across Multiple Growth Stages?
When comparing between the CI red-edge values derived from the Sentinel-2 and UAV imagery, we found that the correlation between the values from the two platforms generally increased as crop growth stage advanced. During early phenological periods, it is likely that the influence of bare soil on the reflectance spectra would have been greater, with the wheat canopy being highly heterogeneous across the field. In particular, the earlier growth stages noted for both farms (i.e., growth stages 31 and 39) correspond to the start of significant periods of canopy development, including stem elongation and the development of yield-forming leaves [11]. During this stage, when both crop ground cover and fractional light interception increase linearly with LAI [55], farmers target their main applications of nitrogen fertiliser. When compared to the relatively coarse-scale (20 m) Sentinel-2 observations, the 5-cm spatial resolution of the UAV dataset was sufficient to resolve these small discontinuities in the canopy. The distribution of values across the field was also much narrower for Sentinel-2 when compared to that of the UAV, further suggesting that these variations in canopy cover were not detected by the satellite sensor.
By later growth stages (i.e., growth stages 51 to 77), the leaf canopy is closed with the crop entering ear emergence, flowering and grain-filling phases. Correspondingly, at these growth stages, the spectral reflectance and spatial variability of the canopy across the field is generally more homogenous. Increased uniformity in the wheat canopy was also evident in the statistics where the variance and distribution of CI red-edge values estimated from the Sentinel-2 observations were comparable to those of the UAV. Studies in Fawcett et al. [56] also demonstrated a similarly high agreement between UAV and Sentinel-2 CI red-edge values during the later growth stages for maize crops.

Canopy Heterogeneity and Impact of Scale on LAI Retrieval Accuracies
Our high-resolution data confirmed that LAI was highly heterogeneous at scales below the satellite resolution (Table 2, Figure 5). The linear fit between the UAV and ground LAI was significantly better when sampled at two metres when compared to that at 20 m. The performance of GPR LAI retrieval model when calibrated with the 2-m resolution data (R 2 = 0.8 and NRMSE = 11%) is also comparable to other studies with this sensor on wheat trials plots [26], where Revill et al. reported an R 2 of 0.84 and NRMSE of 9%. These results together suggest that a resolution of 2 m is required to capture the variability in LAI measurements within the fields sampled. At coarser resolutions, the UAV retrieval of LAI became more biased. When investigating the retrieval of winter wheat LAI from EO data, Reichenau et al. [41] showed that even averaging data at a spatial resolution of 5 m can significantly reduce the variability in LAI when compared to ground measurements.

UAV Observations Reduced Uncertainty and Biases in Sentinel-2 LAI Retrievals
The two-stage calibration process for Sentinel-2 LAI estimates, which incorporates the upscaled ground measurements via UAV, demonstrated a clear improvement over a single-stage process without UAV data. In the two-stage process, the error was reduced by >50% in the accuracy of LAI retrievals when validated using independent ground data. As a key variable for quantifying vegetation canopy structure, accurate and consistent LAI estimates can be used directly to determine crop phenology and status (e.g., [12,57]). Past research has also demonstrated the value of LAI data for updating state variables in process-based crop models in order to improve the spatial and temporal simulation of crop dynamics, including yield and net carbon land-atmosphere exchanges (e.g., [14][15][16][17][18]).
The enhancement in LAI retrieval achieved with the two-stage GPR model calibration approach can be attributed to both an increase in the spectral purity and to the volume of samples used to train the GPR LAI retrieval models. With regards to the spectral sample purity, when training the model directly based on ground measurements (i.e., the single-stage calibration), due to accessibility with equipment and efforts to minimise crop damage, LAI measurement points would typically be constrained to the vicinity of the field tramlines (i.e., access tracks), which would bias the training dataset. Indeed, a negative bias in LAI was clear with the single-stage calibration results (Figures 6 and 7). With the two-stage approach, however, the fine-scale UAV observations can resolve the spectral reflectance around and beyond the measurement points more precisely. UAV sampling reduced the influence of tramlines, by sampling the broader field and avoiding access problems for field sampling. These more representative spectral samples can be used to estimate LAI across the spatial extents of the UAV images, including areas that would not otherwise be ground-sampled, and then aggregated to match the resolution of the co-registered Sentinel-2 pixel grid.
In comparison to alternative approaches for retrieving crop variables from EO spectral reflectance, the performance of machine learning methods, including GPR, can be more sensitive to the number of ground measurements used in the training dataset [28,29]. Furthermore, unless prior knowledge of the field is available, for example through the identification of management zones [58][59][60], the within-field scale variability in in-situ LAI is typically insufficiently sampled. Our two-stage approach addresses these limitations in machine learning LAI retrieval techniques by producing accurate and high-resolution maps of the upscaling ground measurements. These upscaled measurements, which cover multiple overlapping Sentinel-2 pixels, can then be sampled and used to produce a larger training dataset when calibrating the Sentinel-2 retrieval algorithm. For instance, the number of Sentinel-2 pixels that overlapped UAV imagery on Farm 1 and 2 were 80 and 161, respectively, each of which could be used to train the LAI retrieval model.

Research Implications and Limitations
Calibrating Sentinel-2 data using fine-scale UAV estimates can produce LAI datasets that are less biased and better capture the within-field variability when compared to ground measurements. In an operational context, our two-stage UAV calibration approach could be applied, for instance, at some sample within-field areas to capture the variability in crop growth. This calibration could then be applied to larger extents of the Sentinel-2 dataset in order to generate improved estimates of wheat LAI at the farm and landscape scales.
In our study, for the first time, we have deployed a UAV sensor with Sentinel-2 matching band-pass filters. Past studies comparing UAV to satellite-scale observations (e.g., [38,56]) have used UAV instruments with different spectral bands to those of the satellite sensor. For instance, research in Dash et al. [61], which involved comparing NDVI derived from a UAV to that from the RipidEye satellite sensor, attributed the disparities in NDVI to differences in sensor spectral properties. In our research, by matching spectral observations, we reduce errors due to band-positioning off-sets. This UAV instrument, linked to the GPR modelling, provided a unique opportunity to scale up the ground measurements, whilst tracking the uncertainty of hand-held instruments, to the spatial extents of UAV flights that could then be compared to the Sentinel-2 data. We recommend future studies take advantage of such band-matching approaches.
Our study is limited to the use of spatially and temporally concurrent data (i.e., ground, UAV and Sentinel-2 acquisitions) for only one wheat growing season across two farm sites. Calibration and validation across additional growing seasons and sites would be of value to further test the performance of our two-stage calibration approach. While the calibration is specific, we would not expect our approach to be limited to wheat. We anticipate a comparably strong performance of the two-stage calibration when applied to other arable crops, and even to other vegetation types. Research in Pla et al. [38], for instance, demonstrated the calibration of Sentinel-2 NDVI using UAV data for quantifying damages to rice crops across large spatial extents. Given the typical scale mismatches between ground measurements and Sentinel-2 data, we would also expect that the two-stage calibration would be equally as valid for improving the retrieval of additional crop variables that covary with LAI, including fAPAR and canopy chlorophyll content [7,62,63]. In particular, our approach overcomes widespread calibration challenges related to resolving the spatial heterogeneity of vegetation, which are even more pressing in unmanaged/natural ecosystems. We would, therefore, recommend that the two-stage calibration, involving the use of fine-scale UAV observations, should be adopted as a standard when using Earth observations for vegetation upscaling activities.

Conclusions
Even within managed crop systems, we show there is considerable and important within-field variation in LAI at scales finer than the resolution of current satellite imagers. We show that UAV multispectral observations at the cm scale, acquired from a sensor designed to match Sentinel-2 spectral bands, improve interpretation of the satellite signal. Furthermore, the fine-scale resolution of the UAV sensor provides a tool for accurately upscaling LAI ground measurements, which were collected in coordination with the UAV flights, to satellite resolution. The within-field variance in spectral data resolved from the UAV observations was linked to wheat growth stage. Consequently, the Sentinel-2 and UAV platform data were more comparable at the later growth stages, when the vegetation canopy appeared more homogeneous due to a reduced influence of bare soil. Calibrating models used to retrieve LAI from Sentinel-2 observations directly from ground measurements performed poorly and were unable to explain the variance in LAI throughout the growing season. On the other hand, our novel two-stage model calibration, involving the use of upscaled UAV LAI estimates, demonstrated a clear improvement in the accuracy of LAI retrievals from Sentinel-2 data, reducing bias strongly. It would be anticipated that our approach would also improve the retrieval of additional biophysical variables for other arable crop types and a broader range of vegetation cover types. Repeating our analysis at additional field sites would be necessary in order to test the robustness of the two-stage calibration. Nonetheless, our study has highlighted the value of UAV observations for effectively providing a link between point measurements on the ground and 20-m resolution multispectral observations made from the Sentinel-2 satellite.