The Impact of Pan-Sharpening and Spectral Resolution on Vineyard Segmentation through Machine Learning

: Precision viticulture benefits from the accurate detection of vineyard vegetation from remote sensing, without a priori knowledge of vine locations. Vineyard detection enables efficient, and potentially automated, derivation of spatial measures such as length and area of crop, and hence required volumes of water, fertilizer, and other resources. Machine learning techniques have provided significant advancements in recent years in the areas of image segmentation, classification, and object detection, with neural networks shown to perform well in the detection of vineyards and other crops. However, what has not been extensively quantitatively examined is the extent to which the initial choice of input imagery impacts detection/segmentation accuracy. Here, we use a standard deep convolutional neural network (CNN) to detect and segment vineyards across Australia using DigitalGlobe Worldview-2 images at ∼ 50 cm (panchromatic) and ∼ 2 m (multispectral) spatial resolution. A quantitative assessment of the variation in model performance with input parameters during model training is presented from a remote sensing perspective, with combinations of panchromatic, multispectral, pan-sharpened multispectral, and the spectral Normalised Difference Vegetation Index (NDVI) considered. The impact of image acquisition parameters—namely, the off-nadir angle and solar elevation angle—on the quality of pan-sharpening is also assessed. The results are synthesised into a ‘recipe’ for optimising the accuracy of vineyard segmentation, which can provide a guide to others aiming to implement or improve automated crop detection and classification.


Introduction
Viticultural practices worldwide have been transformed over the last two decades by the application of precision viticulture (PV). PV is the implementation of high precision spatial information for adoption of site-specific vineyard management plans, optimisation of vineyard production potential, and reduction of environmental impact [1]. Spatial information is essential for determining planting locations and automating crop harvesting. When companion spectral information is also available, products such as vine health, vigor, forecasted estimates of wine grade/quality and yield, and continual health and infection monitoring can also be derived. PV enables growers to be responsive to the spatial variability of the crop environment and vine performance within a vineyard block [2]. Remote sensing provides stage of vegetation, soil colour, surface conditions (e.g., soil wetness), and image acquisition conditions (e.g., solar angle and shadowing) [24,25]. Darker (due to mineralogy) or wetter soils will have lower red reflectance, and hence will be more similar to photosynthetic vegetation (p.s.v.) at red wavelengths. Similarly, shadowing of interrow materials will reduce their red reflectance [19], and this effect will be more significant at certain sun and vinerow angles [26]. The incorporation of multiple wavelength information can compensate for many of these similarities encountered when using a single visible wavelength band. For example, the incorporation of NIR (760-900 nm) information [27], or ratio indices such as the Normalised Difference Vegetation Index (NDVI) and Ratio Vegetation Index (RVI) [28], can discriminate shadowed vines from dark soil and other surfaces that are dark at visible wavelengths. Soil line vegetation indices, such as the perpendicular vegetation index (PVI) amongst others, can also be utilised to derive vegetation properties. Their utility, however, is contingent on the accuracy of the derived soil-line gradient, and they may be less strongly correlated with grapevine biophysical variables (e.g., leaf-area index) than ratio indices [29]. The inclusion of ratio indices and multiple NIR wavelength channels, can assist in the differentiation of vineyards from other agricultural crops planted in similar patterns, such as orchards and olive gardens [5]. Misclassification of non-vineyard areas can occur for crops with a similar planting pattern (e.g., comparable interrow spacing), or true vineyards with a different planting pattern that is less common in the region (e.g., gridwise or distributed [5]) may be missed. Classification accuracy is improved by image acquisition in summer so that both vine and soil are visible [30]. This discussion highlights that there are many factors which impact the resulting accuracy of vineyard block detection in satellite imagery. These factors translate into non-trivial decisions in the initial selection of that imagery, including: which wavelengths and/or spectral indices should be incorporated, whether sharpening should be employed, in which season should imagery be acquired, in order to optimise detection success. Despite the aforementioned studies mentioning the utility of multispectral information in vineyard detection, a systematic evaluation of the separate and combined impacts of pan-sharpening of multiple wavelength bands, the inclusion of spectral indices, and the role of viewing and solar angles has not been undertaken.
This paper quantitatively investigates how the initial choices in remote sensing imagery for vine block detection impact on the resulting detection accuracy of a deep convolutional neural network (see Section 2.2 for more detail), in order to identify what combination of imagery parameters optimises the resulting vine block segmentation. In particular, the following issues are evaluated using Worldview-2 imagery: (i) whether the incorporation of multispectral information enhances vineyard detection capabilities; (ii) the sensitivity of Gram-Schmidt (GS) pan-sharpening to image acquisition parameters; and (ii) whether the inclusion of GS pan-sharpened VNIR multispectral data and derived vegetation indices, rather than the off-sensor resolution of the VNIR bands, improves detection accuracy.

Methodology
This work targeted vineyards in wine regions across multiple states in Australia. The locations of wine regions were identified from Wine Australia's Geographical Indicators (GIs) [31], which provide the geographic boundaries of official wine zones, regions, or subregions. Nine images were utilised in the analysis, originating from Tasmania [32], and the GIs Wrattonbully (South Australia), Riverland (South Australia), Barossa (South Australia), Riverina (New South Wales), Geographe (Western Australia), South Burnett (Queensland) and Goulburn Valley (Victoria) [31].

Satellite Imagery
Multispectral visible to near-infrared (VNIR) imagery from DigitalGlobe's Worldview-2 satellite was utilised in this work. Details of the platform and the resolutions of the imagery are provided in Table 1, and image metadata in Appendix Table A1. The dates of image acquisition were chosen from within the period of maximum correlation between image data and grape properties, which has been shown to be during veraison [33], i.e., the onset of ripening and changing berry colour [34]. The timing of veraison varies with both broad climate zone and local climate factors such as temperatures, rainfall and soil moisture [35], but in the southern hemisphere typically falls within late summer. The accuracy of vineyard vegetation classification via satellite imagery is highest in summer when the vines are experiencing a growth period and the vegetation canopy reaches its greatest extent [30,33,36]. Ideally, imagery is obtained when the visibility of the vinerows is high, the interrows are clear, and the soil is dry; these factors increase the vinerow and interrow contrast. Additional potential selection criteria for imagery include cloud coverage, observation angle (off-nadir angle compared to a vertical downwards view), sun elevation angle (compared to directly overhead) and sun azimuthal angle (compared to compass directions). Given limitations in the availability of recent imagery within the pre-defined timeframe and GIs of interest, the secondary criteria considered were a restriction of cloud coverage (to <10% of the scene) and minimisation of the off-nadir angle.
Raw data were processed, orthorectified, radiometrically calibrated and atmospherically corrected through DigitalGlobe's GBDX platform, utilising the Advanced Image Preprocessor algorithms [37]. Orthorectification involved correcting the imagery for geometric distortions due to the curvature of the Earth, surface relief (terrain) and the sensor geometry (such as orientation/tilt of the image, and movement of the sensor relative to the terrain). The orthorectified imagery provided had been was registered to the SRTM + USGS NED digital elevation model (DEM) with accuracy derived through comparison to ground control points. The horizontal accuracy for WorldView-2 orthorectified products is 3.5 m CE90 (i.e., 90 percent of all products will achieve this accuracy) [38]. The 4.2 m CE90 map-scale ortho products achieve a point RMSE of 2.0 m or better, hence we estimate the RMSE of the product to used here to be similar. The vertical accuracy of orthorectified images is 3.6 m CE90. Additional post-processing was undertaken by the imagery provider, including adaptive smoothing to reduce noise. Radiometric calibration and atmospheric corrections converted the pixel values from digital numbers into true reflectance values (percent of solar flux reflected by surface), and corrected for atmospheric and solar illumination effects. Although data processing into true reflectance values is not required for accurate vineyard detection from single band data (e.g., calibrated digital numbers [12]), it is important when using multispectral data or comparing multiple images from different sensors and viewing conditions.
Where used in our experiments, pan-sharpening was undertaken to fuse the high spatial resolution information of the panchromatic band with the lower spatial resolution multispectral bands, obtaining both high spatial and high spectral sensitivity. A number of broad families of pan-sharping methodologies exist, differing in the way in which spatial details are generated and blended [13]: component substitution (CS) methods, which includes techniques such as intensity-hue-saturation (IHS) [39], Gram-Schmidt (GS) [40] and adaptive Gram-Schmidt (GSA) [14], and principal component analysis (PCA) [41]; multi-resolution analysis (MRA) methods, which includes wavelet-based methods [42], high-pass filtering (HPF) [43], generalized Laplacian pyramids (GLP) [44]; and hybrid techniques, such as generalized band-dependent spatial detail (BDSD) [45]). In this work, spectral pansharpening was undertaken through the ENVI software package using the Gram-Schmidt routine [40,46] with Worldview-2 parameters. Gram-Schmidt was chosen as it is able to sharpen more than three spectral bands, performs well in preserving the quality of the multispectral information (minimsing spectral distortions) compared to other methods [47], is less computationally complex than some other methods (e.g., Intensity-Hue Saturation) and is still widely utilized for satellite imagery (e.g., [48]). Gram-Schmidt is also appropriate as it is from the CS family of methods, which have been shown to have higher tolerance to aliasing, shift, and other visual artefacts, due to, for example, spatial misalignment between the multispectral and panchromatic bands [49]. This was considered particularly important due to the narrow width of the vinerows requiring detection-similar in width to the pan-sharpened pixel size. Although recent work in comparing fusion algorithms for landcover segmentation (the particular application of relevance in this work) demonstrated that the GS fusion method had strong performance, other members of the CS family such as PCS and GSA surpassed it [14,47]. Therefore, although GS is appropriate in the context of this work (vine row segmentation), future work will focus on implementing a range of CS and other pansharpening family methods to assess whether any significant gains in machine learning segmentation are achieved.
The Gram-Schmidt (GS) algorithm is based on Gram-Schmidt vector orthogonalization. A simulated panchromatic band is constructed and all bands (including multispectral) are decorrelated, then back-transformed in high resolution [40]. Best pansharpening results-determined from minimal visible distortion of RGB 'natural look' imagery and minimal creation of null or negative pixels-were obtained using cubic convolution resampling. Spectral distortions, however, likely differ for each wavelength band [47], and spectral indices derived from band ratios may have a multiplicative effect on spectral distortions (thereby reducing their quality) [50][51][52], so these distortions were assessed prior to vine block segmentation.

Machine Learning
A computer vision technique known as semantic segmentation was used to detect and delineate the boundaries of vineyard blocks [53]. The objective in this method is to segment image regions by classifying every pixel in the image as belonging to exactly one of a number of designated categories. In this paper, we consider only binary classification, namely that each pixel should be classified as belonging to a vineyard (e.g., pixels could contain grapevine vegetation or inter-vinerow materials), or not belonging to a vineyard.
In the last five years, deep convolutional neural networks trained by supervised learning have been empirically shown to nearly always outperform alternative machine learning methods in computer vision problems (e.g., [54]), and semantic segmentation is no exception [55]. Hence, we also used deep convolutional neural networks (specifically a U-net semantic segmenter-see Section 2.2.2). Human-labelled 'ground truth' labels that segment each training image are required for such supervised learning-see Section 2.3. Following training, the weights in the neural network are assumed to predict segmented regions in these training annotations with very high accuracy, and ideally to also generalise to correctly segment unlabelled data not used in training.

Data Models
As shown in Table 2, five different input data models were investigated (labelled as M1, M2, M3, M4, and M5), with the aim of determining if multispectral information was important for identifying vineyard pixels, and whether derived information, such as pan-sharpened spectral bands, or a vegetation index (NDVI), might enable better performance than raw PAN and MS channels. A vegetation index is included to test the whether the spectral properties of grapevines can be sufficiently summarised (in order to differentiate them from other row crops) by the information captured in only two additional bands (in this case, red and near-infrared wavelengths). Although many other vegetation indices are available, NDVI is widely applied in viticulture (e.g., [56]).
Note also that we undertook preliminary investigations into models that included all eight multispectral bands pan sharpened to higher resolution. These were difficult to pursue, due to the need for extra computational resources (i.e., RAM because pan-sharpened bands require 16 times more RAM per image per channel than the corresponding MS band) that was not readily available to us. For this reason, and because we found the performance of such models to be worse than that of M3 baselines, we did not pursue further as primary Data Models in this paper.

Neural Network Architecture
Several alternative designs for deep convolutional neural network semantic segmenters were investigated: SegNet [57], U-net [58] and DeepLab-v3 [59]. Both U-net and DeepLab-v3 provided better results than SegNet, but U-net was selected due to being being relatively small in its model size (as measured by the number of learned parameters), while there was no significant difference in performance. We hypothesise that while DeepLab-v3 is known to be better than U-net for multi-class semantic segmentation [59] that this is due to the greater diversity of imagery it tends to be applied to in such cases, such as that obtained from self-driving cars. In contrast here, and more broadly in crop detection in satellite imagery, we have a relatively simple segmentation task (binary in this case), with relatively little viewpoint and object-scale diversity, which may explain why we found no benefits from DeepLab-v3.

Neural Network Training
We found it useful to use transfer learning [60]. It has recently been shown that the "fine-tuning" method of transfer learning, where all weights in a network previously trained on one dataset are updated during training on a new dataset, tends to provide better results than the more well-known approach where a pre-trained neural network is used only to convert input data into a feature set for training a new model [60]. The fine-tuning method can therefore be thought of as an alternative to randomly initialisating the weights in a deep network. Empirically, fine-tuning tends to work best by training for a relatively small number of epochs and a very low learning rate [60].
For the purposes of this paper, we made use of a U-net model (which we label as M0) that was originally trained on many more images than used in this paper (nine of these are listed in Table A1). We used the weights of model M0 for transfer-learning using fine-tuning, and hence the architecture of our models in this paper were all identical, and identical to that of M0, except for the very first layer of weights, which differed in the number of input channels, due to differing numbers of channels in the data models investigated. We therefore used random random weights for the first weights layer only.
We created a dataset for training by tiling each image (and corresponding label masks) into patches of size 256 × 256 from Images 1, 2 and 3 (Table A1). The other six images were not accessible for training models for this paper. The total percentage of pixels in the vineyard class in the three training images was very small (0.1%, 0.2% and 5% in Images 1, 2 and 3 respectively). This means that there are many times more pixels in the 'not vineyard' class than within the 'vineyard' class, resulting in a classic example of the 'class imbalance' problem from machine learning [61]. To enable our models to learn effectively despite class imbalance, a form of compensation known as minority class oversampling [62] was used. The version we devised followed from the need to subdivide each image into small tiles, as is typical in semantic segmentation, in order to benefit from speeding up training on GPUs with 12 GB of RAM. Each such patch was assigned as an 'oversample' patch, if any pixel in it was labelled as within a vineyard. Access to labelling of 'edge cases' was available, that is, examples of images that resembled vineyards but were verified to be other areas such as strawberry fields, or ploughed paddocks. Patches in this category were also assigned as 'oversampling patches', since the total area of these cases was very low. During training using stochastic gradient descent [63], sequences of batches of patches were randomly selected such that all 'oversample' patches were selected exactly once during one epoch of training. For each epoch, an equal number of patches not in the 'oversample' list were randomly chosen, and then not used again until all such patches had been used once, which took about 10 epochs. The downside of oversampling in this way is that the neural network may well overfit to outliers in the oversample patches. This was the main reason we used transfer learning, which is known to be a good way to help avoid overfitting.
We used the standard fine-tuning approach of stochastic gradient descent with momentum and weight decay, with a very low learning rate (0.0001). We found 20 epochs of training to suffice for convergence on validation data. When we report validation results below, the patches used for training were excluded from use in calculation of performance measures.

Measuring Vineyard Detection Performance
The performance of the automated vineyard detection was measured through comparison to an independent manually labeled dataset of vine block boundaries. This ground-truth boundary shapefile was generated through reference to the Worldview-2 imagery (particularly, a pan-sharpened 'natural colour' image composite, pan-sharpened colour infrared composites, and pan-sharpened NDVI), as well as online mapping datasets Google Earth and Google Street View [64]. The Google Street View dataset typically pre-dated the satellite images, and, in some circumstances, was inaccurate due to land-use change. A number of quantitative metrics for measuring the validity of vineyard detections were considered, chosen for their widespread usage in either the geospatial and remote sensing, or machine learning and semantic segmentation fields. The metrics utilized all correspond to the case where one of the binary categories is designated as a "positive" class (in this case, 'vineyard'): "precision" (also known as "map user's accuracy"), "recall" (also known as "map producer's accuracy"), "area ratio" and "Jaccard Index" (JI). For completeness, we also consider metrics that treat both classes (in this case, 'vineyard' and 'not vineyard') equally: "overall accuracy" with associated kappa statistic (which requires calculation of the expected accuracy of a random classifier).
Through comparison to the ground-truth data, the following categories of pixel classification can be defined: true positives (TP), i.e., vineyard pixels correctly classified; false positives (FP), non-vineyard pixels incorrectly classified as vineyards; true negatives (TN), non-vineyard pixels correctly classified; and false negatives (FN), vineyard pixels incorrectly classified as non-vineyard.
The aforementioned performance metrics which are referenced to a designated positive-class are: • Precision = TP/(FP + TP). This provides a measure of the total fraction of predictions that really are vineyard. • Recall = TP/(TP + FN). This provides a measure of the total fraction of actual vineyard correctly predicted as vineyard. • Jaccard Index = TP / (TP + FP + FN). In addition, expressed as "intersection over union" (IOU), this is a measure of the spatial overlap between pixels predicted to be in vineyards, and pixels labelled as being in vineyards.

•
Area ratio = (TP + FP)/(TP + FN). This is the ratio of the spatial area (in number of pixels) of predicted vineyards over real vineyards. It is also the ratio of recall over precision. However, even when a high agreement between predicted and actual vineyard area is achieved, the predicted vineyard block boundaries could potentially be non-overlapping with the real boundaries. Penalising such a case is ignored by area ratio but not by Jaccard Index.
Those performance metrics that consider each class equally are: • Overall Accuracy = (TP + TN)/(TP + TN + FP + FN). This is the fraction of all pixels that were correctly classified. The expected accuracy estimates the overall accuracy value that could be obtained from a random system. The denominator equals the square of the total number of observations. • Kappa statistic = (Overall Accuracy−Expected Accuracy)/(1−Expected Accuracy) [65,66]. This provides a measure of the level of agreement between classification and ground-truth that could originate through chance. A large and positive Kappa (near one) indicates that the overall accuracy is high and exceeds the accuracy that could be expected to arise from random chance. This can be interpreted as the classifier providing a statistically significant improvement in the classification of 'vineyard' and 'not vineyard' than could be obtained through random assignment of pixels to the binary classes.
A number of implicit relationships exist between the above metrics, such as area accuracy, is the ratio of precision to recall.
For vineyard identification, we consider the metrics referenced to a positive class as of more value than those that are not, since the total number of pixels in the vineyard class are far fewer than those in the non-vineyard class.

Pan-Sharpening
The impact of pan-sharpening on the spectral values of grapevine vegetation (along vinerows) was examined, considering wavelength, the NDVI spectral index, and the viewing conditions under which the imagery was acquired (vinerow statistics are provided in Table A2) Although the Gram-Schmidt algorithm used for pan-sharpening (PS) is considered to obtain good spatial results whilst minimizing spectral distortions, the potential for some spectral distortion is known (see Section 2.1) and was observed here. Here, we compare the pansharpened pixel values with the original multispectral values, for vinerow vegetation (not on a whole image basis). Figure 1 illustrates that the magnitude of spectral distortion varied with wavelength but was typically within 50% of un-sharpened reflectance values. The Gram-Schmidt algorithm generally reduced mean reflectance values locally within the vinerows. Statistically, spectral differences between the two profiles were more significant at visible wavelengths (where the PS profile was more than 1σ below the mean un-sharpened profile); however, this result is likely to be highly dependent on the characteristics (e.g., homogeneity) of the vineyard vegetation selected. Spectral shape was generally preserved, and the difference in spectral slopes was less significant than the differences in mean pixel values. Spectral distortions were expected to be largest in the two wavelength bands (coastal 400-450 nm and NIR2 860-1040 nm), which are non-overlapping with the panchromatic band (450-800 nm). In the spectral profiles shown in Figures 1  and 2, the differences across the longer wavelengths >700 nm remained generally small, while the largest differences were observed at visible wavelengths, particularly at blue and red wavelengths. Image acquisition parameters appear to have a significant impact on the magnitude of distortion of the reflectance values through PS. Ideally, imagery would be acquired at nadir (0 • off-nadir angle) and with the sun at zenith (90 • solar elevation). From Figure 2, imagery obtained under near-ideal conditions (green and black profiles) certainly led to better performance of Gram-Schmidt pan-sharpening, incurring minimal spectral distortions across all wavelengths (i.e., spectral profile ratio near 1.0). The improvement was observed most significantly in the blue and red wavelength bands, where the spectral distortions from images obtained under poorer viewing conditions were typically substantial. For example, the blue (Image 7) and magenta (Image 9) profiles both have a factor of 2 distortion in the spectral reflectance at blue and red wavelengths, and have off-nadir angles greater than 26.7 • and solar elevation less than 49.1 • . In general, pan-sharpening reduced the reflectance from the un-sharpened values, by a factor of <2. These results indicate that, under non-ideal viewing conditions, the pan-sharpening process is likely to substantially change the visual appearance and 'colours' of the scene captured through R, G, B composite imagery ('natural look') as the relationship between the R, G, B bands is altered. More significantly, PS may distort the interpretation of vegetation characteristics derived from the relationship between red wavelength reflectance and other bands (examined below). It is unclear whether either of the image viewing conditions examined had a greater impact on sharpening spectral quality (Figure 3). The off-nadir angle was more strongly correlated with the 'spectral profile ratio', having a Pearson's correlation coefficient of 0.66 between the angle and the mean spectral profile ratio across visible wavelengths. This is compared to a correlation coefficient of −0.39 with the solar elevation angle. This result indicates that minimising the off-nadir view may be more important to image quality than optimising the solar elevation angle (which is likely to cause significant interrow shadowing for most vinerow orientations); however, neither of these correlations is statistically significant given the number of images available for examination. Figure 2. Comparison between the spectral profile ratio (un-sharpened over pan-sharpened) of vinerows imaged under different conditions. Off-nadir angles and solar elevation angles, respectively, are given for each image profile in brackets. All profiles showed a strong vegetation signature in both pan-sharpened and un-sharpened images (pixel counts are provided in Table A2). The dashed line at a spectral ratio of 1.0 indicates no spectral distortions introduced through the pan-sharpening process.  Table A1. Symbol colours provide the mean un-sharpened to pan-sharpened spectral ratio (plotted in Figure 2) across bands 1 to 5 (visible wavelengths). A weak positive correlation is observed between the magnitude of the spectral distortions and the degrees off-nadir (Pearson correlation coefficient 0.66). Figure 4 illustrates the impact of pan-sharpening on the vegetation index NDVI, which is derived from the normalised difference in reflectance of red and near-infrared wavelengths. In a general vinerow, mean NDVI values are increased through the pan-sharpening process, and are within a factor of 1.3 of the un-sharpened values. The distribution of NDVI per pixel values may also be skewed, as indicated by the non-circular mean-centred one standard deviation envelopes, as the algorithm increased the variance in NDVI values across each vinerow. No significant correlation between NDVI change and image acquisition parameters was observed; however the least skewed mean values originated from the images taken under near-ideal conditions (black and green profiles). Similar trends were also observed when comparing un-sharpened and pan-sharpened NDVI values of generally homogeneous (at the panchromatic spatial resolution) vegetation regions, namely irrigated ovals/sports fields. As observed for vinerows, in all cases, the mean sharpened NDVI of ovals/sports fields was on the order of 10% greater than the un-sharpened mean, and the variance generally greater. This illustrates that the change in the mean and distribution of NDVI values observed for vinerows is not related to the periodic variation in vinerow-interrow signatures being separated in the pan-sharpened imagery, but encompassed within the vinerow ROIs in the un-sharpened imagery.  Table A2), while the dashed lines indicate the mean centered 1 standard deviation ellipses. The aspect ratio of the ellipses (semi-minor over semi-major axis length) is provided in brackets (a circle would have an aspect ratio of 1) to highlight whether the distribution of pixel values was significantly skewed by the pan-sharpening process.

Quantitative Data Model Comparisons
The impact of spatial resolution, spectral resolution, and pan-sharpening (and, implicitly, image acquisition parameters) on model performance are quantified in Table 3 (with additional metrics in Table A3). Four key metrics (defined in Section 2.3) are utilised to compare the vineyard block predictions with ground-truth labels of vineyard boundaries, and thereby assess the accuracy of the resulting segmentation. Several visual examples of vineyard segmentations from models M1-M5 are provided in Figures 5-7. The results for recall and precision are visualised in Figure 8, and for precision and area ratio in Figure 9. Although strongly correlated in this case (Table A4), it is conventional within machine learning fields to visually compare the output of precision and recall. The accuracy and JI results are not visualised as they were also both highly correlated with recall. In contrast, precision and area ratio were only weakly correlated (Table A4). In general, the best predictive performance was achieved by the M2 model trained and run on the panchromatic band and all eight multispectral channels in un-sharpened (coarse) resolution. This model typically achieved higher recall, higher Jaccard Index, and an area ratio close to 1.0 (also better Kappa statistic, Table A3). Precision on one image was slightly poorer than was achieved by using M4 (pan-sharpened R-RE-NIR bands), and on another the area ratio was slightly poorer than achieved with M4. In general, the results for the M1, M3 (pan-sharpened RGB bands) and M4 models included more false detections compared to M2, while M3 obtained poorer area prediction and spatial overlap with more misses of true vineyards.
The performance of M5 was generally very similar to M2, being the most dissimilar in the recall for Image 2. The variance in performance across the models was observed to be image dependent, likely due to variations in vineyard vegetation and image acquisition/quality parameters (i.e., the off-nadir angle and solar elevation angles) that were not adequately captured in the model training dataset. From Figure 8, all models for Images 3 and 6 obtained relatively similar precision, recall, and area ratio measures (each varied by <2%), while for Images 1 and 2 the improved performance of M2 in precision and recall was more substantial. When comparing precision and area ratio in Figure 9, M2 for Image 2 provided a significant improvement in the estimation of vineyard area, although for Image 1 the area ratio of M1 was the most accurate, albeit with lowered precision compared to M2. In general, the precision, recall, and JI metrics were more sensitive to the choice of model (Table 4), while variation in area ratio between models was smaller. In contrast, the area ratio and Jaccard Index (spatial area overlap) were more sensitive to variations between imagery.
Although the measures of accuracy reported here represent differing sources of included error, they generally agreed on the relative ranking of the model results. M2 was generally the highest performing, with M5 the most similar, and M1 and M3 typically only slightly poorer (with the exception of Image 1). For all images, the best performing model had a Kappa statistic indicating >77% better agreement between predictions and ground-truth than would be likely to be obtained through chance (Table A3).   Figures 8 and 9 also provide some insights into the generalizability of the results to different images. A model whose performance is robust to imagery obtained at different locations and times (assuming the choice of input bands to the model is fixed) would result in symbols with the same colour in Figure 8 being clustered together. This was generally not the case. This result is also seen in the range of the coefficient of variation between models in Table 4. Figure 8. Comparison of precision and recall for different images (symbol type) and models (symbol color). The incorporation of un-sharpened multispectral data generally enhances recall but reduces precision. The worst recall and precision results were generally obtained from the use of pan-sharpened RGB multispectral. Figure 9. Comparison of precision and area ratio for different images (symbol type) and models (symbol color). There is no clear relationship between model parameters and area ratio, suggesting it is more strongly related to the image characteristics. Image 1 shows the largest variation in area ratio performance between models.

Interplay between Spatial Resolution and Spectral Values (Image Fusion).
Spectral analysis with pan-sharpened multispectral values should typically be undertaken with caution. Pan-sharpening algorithms typically introduce distortions to the spectral reflectances that vary in magnitude with wavelength, image characteristics, and choice of algorithm. Pan-sharpened satellite imagery was necessary for vinerow delineation and vine canopy extraction, due to the spatial resolution required to visually differentiate vegetation from interrows. This work, however, has assessed the costs and benefits associated with utilizing pan-sharpened multispectral data for automated segmentation of vine block boundaries through a CNN. Although vineyard detection can be achieved using a single high-resolution spatial but broad spectral panchromatic band (e.g., [4,12,19,30] and results from our model M1), there are many factors that can reduce contrast between vines and interrow materials and reduce the overall accuracy of grapevine vegetation classification. These factors include the visibility of the vines and interrows, grapevine growth state, interrow spacing, surface conditions (e.g., soil wetness), and image acquisition conditions. The incorporation of other wavelengths, particularly the near-infrared, can compensate for these similarities and increase detection accuracy [27]. The Shannon-Nyquist theorem-which regards the discrete sampling frequency needed for the reconstruction of a continuous signal [67]-can be utilised to estimate the spatial resolution required for vinerow detection. Through application of the theorem, periodic patterns can be reliably detected in imagery with a spatial resolution >2× smaller (finer) than the pattern period [30]. Given interrow widths in Australia are typically 3 m to 3.3 m wide, while vinerows typically have <1 m width, the relevant vine planting period would be expected to be less than 4.3 m, requiring imagery with a pixel size no larger than 2.15 m for detection of vineyard rows. In order to precisely delineate the vinerow vegetation and reliably separate the canopy edges from the interrows, greater spatial sensitivity is required. For this task, at least three pixels overlapping each interrow are desired (two pixels of vinerow vegetation and interrow, and one pure interrow pixel), and hence resolutions of finer than 1.1 m. From Table 1, this necessitates pan-sharpening of Worldview-2 multispectral imagery. Figures 1 and 2 examine the impact of pan-sharpening on grapevine vegetation values, revealing that PS introduced a non-trivial distortion (typically a reduction) of reflectance values but generally preserved the shape of the vegetation spectral profile and the spectral slopes. The magnitude of spectral distortion was less when comparing image slopes rather than mean pixel values, implying that spectral slope classification methods [68] may be more robust than pixel based classifiers such as K-Means and Maximum Likelihood (e.g., [69]). Spectral distortions increased with poorer image acquisition conditions (larger off-nadir angle and smaller solar elevation angle). The distortions were generally largest at visible wavelengths, particularly the blue and red bands. Significant changes to coastal and NIR2 band reflectance values were expected as these bands do not spectrally overlap with the panchromatic band. Hence, the fusion of spatial information from the simulated low resolution Pan band (computed as a weighted linear combination of the multispectral bands with weight 0 for bands 1 and 8, see Table 1) into the coastal and NIR2 band will be dependent on the spatial-spectral relationships between the other multispectral bands and the PAN band. However, significant spectral distortions to the coastal band was not observed, nor to the NIR2 band (likely due to the spectral redundancy i.e., partial overlap with the NIR1 band [70]). Although ideally all non-overlapping multispectral bands would be excluded from the pan-sharpening, the implementation of Gram-Schmidt sharpening through ENVI software outputs a high-spatial resolution transformation of all bands. The distortions at red wavelengths may be attributed to sun-angle differences. Changes in sun angle are known to strongly affect red wavelength vegetation reflectance, and are largely dependent on the foliar distribution and leaf-area-index [71]. The most severe spectral distortions in Figure 2 originated from images taken under the lowest sun elevation angles and largest off-nadir angles. This suggests that the initial radiometric corrections for sun elevation and differences in atmospheric path length across the image were insufficient. Another plausible explanation is that the manual delineation of vinerows was confused by the presence of interrow shadow and displaced from the centre of the vinerow, and hence did not constitute a pure grapevine vegetation signal. The latter explanation is unlikely, as multiple image products (false-colour composites and vegetation indices) were utilised in the selection of grapevine ROIs.
The results demonstrate the sensitivity of pan-sharpening quality to solar and sensor viewing angles, which in turn directly impacts the accuracy of vine block detection using sharpened multispectral data (see Section 4.2 below). It is difficult to assess the generalizability of these conclusions to other vine detection studies (at the same scale). One key uncontrolled variables that may impact the robustness of the spectral results is the intra-image differences in the vegetation ROIs used for spectral profiles, due to differences in vigor, canopy structure, or even cultivars/grape varieties. These differences in vegetation account for an unknown fraction of the variance between images. This could only be further mitigated by aiming to capture as much variance in grapevine vegetation within each image as possible. A potential avenue for future work is to extend this study by incorporating shortwave infrared (SWIR) reflectance from the Worldview-3 satellite. The response of vegetation at these wavelengths is strongly related to their water content, and has been shown in viticultural studies to be highly sensitive to water stress [72], and can be used to infer other biophysical parameters and forecast plant production/yield. The incorporation of SWIR reflectance may also provide benefit in the discrimination of grapevine vegetation from other types of row crops, thereby increasing the accuracy of the automated grapevine detection and image segmentation. However, similarly to reflectance at shorter wavelengths (VIS-NIR), SWIR reflectance is also sensitive to variation in the bidirectional reflectance distribution, and the pan-sharpened bands would be expected to vary in quality with solar and sensor viewing angles, as demonstrated in this work. Previous works have shown that vegetation indices derived from SWIR bands can likewise be strongly impacted by sun-target-sensor geometry (e.g., [73]). An additional caveat is that the Worldview-3 SWIR bands are~three times coarser than the spatial resolution of the multispectral sensor, and hence more significant distortion of spectral values as a result of the pan-sharpening would be expected.

Performance Validation
The spectral and spatial resolution of input imagery had a significant impact on the accuracy of predicted vineyard block boundaries. Despite a mean coefficient of variation in precision, recall, and JI on the order of 10% across images and 3% across models (Table 4), the validity of the predictions remained high. Classification mistakes (i.e., false detections) typically constituted <12% of the predictions, and the predicted vineyard area was generally within 5% of actual. Missed vineyard pixels were a slight larger source of error, constituting <23% of the predictions. This was likely due to the variation in vineyard vegetation being insufficiently captured in the labelled data used in training. The mean Jaccard Index for the highest performing model (M2) was 0.8, outperforming the current state-of-the-art from a CNN architecture achieved by the top ranked participants in recent image segmentation competitions (e.g., [74,75]). Ref. [4] utilised an object-based classifier to undertake a similar vineyard detection task. The input imagery consisted of pan-sharpened Worldview-2 multispectral reflectance, spectral ratios, and derived textural features. They reported correctness (analogous to precision) and completeness (analogous to recall) over 89%, averaging 92% correctness and 93% completeness, respectively. Ref. [5] utilised similar methods and input imagery to [4], and reported user's (analogous to precision) and producer's (analogous to recall) accuracy's for vineyards ranging from 92.5-97.5% and 94.87-100%, respectively, depending on image region. These results are comparable to the highest-performing model discussed here (M2), which achieved mean correctness/precision of 89% and mean completeness/recall of 88% across all images (also mean area ratio of 1.01 and mean accuracy of 99.8%, calculated from Table 3), without the incorporation of pan-sharpening, spectral indices or textural features.
The results clearly illustrate the importance of incorporating multispectral data rather than simply using a high resolution panchromatic band. The sensitivity to chlorophyll absorption and plant tissue structure from red and infrared wavelength absorption, and hence the ability to differentiate vegetation based on attributes such as vigour, canopy density and structure, translated into improved vineyard detection and more accurate prediction of vineyard area. However, the incorporation of a high-resolution subset of the multispectral data-pan-sharpened visible red, green and blue bands-resulted in generally similar or poorer performance across all performance measures, compared to the use of the panchromatic band only. This can be attributed both to the spectral distortions introduced by the pan-sharpening process (quantified in Section 3.1), particularly at blue and red wavelengths, and the insufficient sensitivity to vegetation when only visible wavelengths are used. When the red, red edge and near-infrared wavelengths were incorporated to capture the unique spectral response of green vegetation, at pan-sharpened resolution, the model performance slightly improved on two images compared to panchromatic only, but still did not exceed the use of coarse un-sharpened multispectral. When the spectral response of vegetation was summarised through a high resolution spectral index (NDVI), a greater improvement was observed with the performance then approaching the un-sharpened multispectral model. This improvement in performance can be explained by the enhanced sensitivity to vegetation, while minimizing the spectral distortions introduced through pan-sharpening by using a normalised difference product of the sharpened spectral bands rather than bands themselves (see Section 3.1).
The machine learning model we used was trained to perform binary image segmentation, i.e., to classify each pixel as either part of a vineyward, or not part of a vineyward. In future work, it would be interesting to extend to more than two classes, and enable classification of vineyard by grape variety. Similar multi-class problems have been tackled using semantic segmentation applied to high resolution satellite imagery, for example where the pixel categories include forest, water, urban, and several others [76].

Conclusions
This work quantified the impact of initial choices in Worldview-2 remote sensing imagerynamely the wavelengths used, their spatial resolution, and, implicitly, the image acquisition parameterson the accuracy of vineyard boundary detection via a machine learning methodology. The incorporation of multispectral information at its native off-sensor resolution was found to enhance vineyard detection capability and spatial area prediction, when compared to the performance of the algorithm on a single panchromatic band. However, pan-sharpening-a frequently used technique in remote sensing for precision viticulture-was found to cause significant spectral distortions that were dependent on both wavelength and image acquisition parameters. These distortions led to poorer vineyard detection performance when the model was run on high-spatial resolution visible wavelength bands, and on high-spatial resolution red, red edge, and near-infrared bands (chosen to enhance sensitivity to green vegetation). The use of the high-resolution NDVI index resulted in similar (but slightly poorer) performance to the coarse multispectral model, as the spectral distortions introduced by pan-sharpening were reduced through taking the normalised difference ratio of the spectral bands. In summary, the imagery parameter choices which optimized the automated vineyard segmentation were input bands: panchromatic and un-sharpened multispectral. If pan-sharpening is necessitated, then results may be optimised by minimization of the off-nadir angle, and maximization of the solar elevation angle. These results provide valuable information for others working more broadly on crop detection, and the derivation of grapevines or other vegetation characteristics at fine spatial scales from space. Acknowledgments: Innovation Connections grants ICG000351 and ICG000357 from the Australian Federal Government's Department of Industry, Innovation, and Science are gratefully acknowledged.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: