Classification of grapevine varieties using UAV hyperspectral imaging

The classification of different grapevine varieties is a relevant phenotyping task in Precision Viticulture since it enables estimating the growth of vineyard rows dedicated to different varieties, among other applications concerning the wine industry. This task can be performed with destructive methods that require time-consuming tasks, including data collection and analysis in the laboratory. However, Unmanned Aerial Vehicles (UAV) provide a more efficient and less prohibitive approach to collecting hyperspectral data, despite acquiring noisier data. Therefore, the first task is the processing of these data to correct and downsample large amounts of data. In addition, the hyperspectral signatures of grape varieties are very similar. In this work, a Convolutional Neural Network (CNN) is proposed for classifying seventeen varieties of red and white grape variants. Rather than classifying single samples, these are processed together with their neighbourhood. Hence, the extraction of spatial and spectral features is addressed with 1) a spatial attention layer and 2) Inception blocks. The pipeline goes from processing to dataset elaboration, finishing with the training phase. The fitted model is evaluated in terms of response time, accuracy and data separability, and compared with other state-of-the-art CNNs for classifying hyperspectral data. Our network was proven to be much more lightweight with a reduced number of input bands, a lower number of trainable weights and therefore, reduced training time. Despite this, the evaluated metrics showed much better results for our network (~99% overall accuracy), in comparison with previous works barely achieving 81% OA.


Introduction
Understanding vegetation development is a crucial aspect of crop management that impacts the effectiveness and productivity of agricultural efforts.Precision Agriculture (PA) involves observing agricultural variables that affect crop production, which enables accounting for spatial and temporal variations, resulting in enhanced crop performance, reduced costs, and improved sustainability.Additionally, it provides a forecasting tool to accurately supply crop needs, such as water and nutrients.If applied to vines, this concept is known as Precision Viticulture (PV), with a wide variety of applications, ranging from the detection of biomass (Di Gennaro and Matese, 2020), water content (Santesteban et al., 2017;Gutiérrez et al., 2021) and other compounds (Peng et al., 2022), to vigour estimation (Bramley, 2010;Campos et al., 2019), detection of plant diseases, pest surveillance (Mendes et al., 2022), analysis of grape maturity (Soubry et al., 2017) or yield estimation (Hassanzadeh et al., 2021).They are either focused on grape clusters, leaves, stems or vineyard support (Singh et al., 2022).
The characterization of vineyard plots using UAVs is particularly challenging due to their variability, regarding the tree structure, inter-row spacing and surrounding elements (bare soil, shadowed areas, grassing, etc.).Therefore, high-detailed images are relevant for discriminating vegetation, soil and weeds, which have been previously shown to affect grape estimations (Ammoniaci et al., 2021;Sassu et al., 2021).Consequently, a significant effort of previous studies is oriented toward the segmentation of the canopy (Pádua et al., 2022).In this regard, UAVs help to support make-decision systems by gathering precise information that enables the estimation of biophysical and performancerelated features (Bramley, 2010).
This study examines hyperspectral data for classifying a significant number of grapevine varieties.First, data are explored to shed some light on the initial clustering of hyperspectral signatures of these varieties, and then, the feature reduction problem is studied over them.Once the shortcomings of working over these data are presented, a Deep Learning training pipeline is presented for providing a phenotyping tool applied to vineyards.The study's findings aim to demonstrate the classification capabilities of UAV hyperspectral data.In comparison with previous work, the proposed network is proven to perform well over UAV and satellite-based imagery.

Related work
The purpose of this section is to offer an overview of the research on hyperspectral data, encompassing both conventional and novel approaches.Given that our case study is focused on grape classification, the primary techniques for this task are outlined below.
Processing of hyperspectral signature.Remotely sensed data is subject to various factors, such as sensorrelated random errors, atmospheric and surface effects, and acquisition conditions.Therefore, radiometric correction is performed to obtain accurate data from the Earth's surface.Although the literature in this field covers numerous topics, it primarily focuses on satellite imaging.While some of the techniques studied can be applied to UAV imaging, other topics are irrelevant to our case study.For instance, atmospheric effects like absorption are not significant in close-range work.However, due to low flight altitudes, UAV instability, and varying viewing angles, pre-processing operations can be challenging (Jakob et al., 2017).
Most studies that involve the classification of satellite images use standard datasets with radiometric corrections, provided by the Grupo de Inteligencia Computacional (GIC) (M.Graña et al.).In the case of UAV hyperspectral imaging, various corrections are necessary to obtain precise data, including geometric and radiometric corrections and spectral calibrations (Adão et al., 2017).Geometric distortions are primarily caused by UAV instability and the acquisition technique, with push-broom sensors showing higher geometric distortions that can be reduced using stabilizers.Geometric correction can be achieved through an Inertial Navigation System (INS), Global Positioning System (GPS), and Digital Elevation Model (DEM).Although commercial software is available for this approach, it requires a high-precision system for accurate correction.Alternatively, Ground Control Points (GCPs) have been extensively utilized to ensure correct positioning (Ramirez-Paredes et al., 2016).In addition, dual acquisition of visible and hyperspectral imagery enables matching both data sources (Jurado et al., 2021;Xue et al., 2021a;Ramirez-Paredes et al., 2016), with visible data being more geometrically accurate.Another technique that has been shown to assist with geometric correction is feature matching among overlapping images (Akhoundi Khezrabad et al., 2022).
In a similar way to geometric distortions, radiometric anomalies can also be fixed with software tools provided by the hyperspectral manufacturer.The aim is to convert the Digital Numbers (DN) of the sensor to radiance and reflectance of Earth's surfaces, regardless of acquisition conditions.Therefore, the latter result must be applied to Deep Learning techniques for their implementation over any hyperspectral dataset.The coefficients required for this correction are generally calibrated in the laboratory, but they may vary over time (Adão et al., 2017), which may affect the radiometric correction.Grayscale tarps, whose reflectance is known, can be used to support this process and perform linear interpolations to calibrate the acquired DN (Lucieer et al., 2014) using the empirical line method (ELM) (Aasen et al., 2018;Sousa et al., 2022).To perform the linear interpolation for the radiometric correction, it is necessary to have dark and grey/white references, which are usually obtained from isotropic materials that have a grayscale palette and exhibit near-Lambertian behaviour (Jakob et al., 2017;Sagan et al., 2022;Duan et al., 2013).An alternative approach is to acquire radiance samples, which can be used with fitting methods such as the least-square method (LSM) to transform DNs.However, factors such as the vignetting effect, which causes a drop in intensity, can result in a different radiometric response in pixels of the same bands (Yang et al., 2017).
There have been studies on atmospheric corrections for UAV flights, which involve correlating the at-sensor radiance to the surface hemispherical-directional reflectance function (HDRF) measured in the laboratory beforehand.Grayscale palettes in Lambertian materials can be used for this purpose, as documented in previous works (Lucieer et al., 2014).While physically-based methods have also been explored, they tend to be more time-consuming.In addition, public repositories such as the Aviris data portal (California Institute of Technology) (California Institute of Technology) offer atmospherically corrected datasets.Other widespread hyperspectral processing software, such as ENVI and Headwall SpectralView, include semi-automatic correction algorithms (Queally et al., 2022;Jia et al., 2020;Sagan et al., 2022).
Traditional hyperspectral classification.Deep Learning methods have recently become the preferred approach for classifying hyperspectral imagery.However, earlier techniques relied on comparing the acquired data to reference reflectance shapes that were ideally measured in a laboratory.The primary objective of these methods was to measure the similarity between labelled and unlabelled spectral shapes.Spectral libraries, containing data measured from a spectrometer, were used for this purpose.For instance, Kokaly et al. (2017), Dutta et al. (2017), andMatusik et al. (2003) provided spectral libraries for minerals, trees, and daily surfaces, respectively.These methods varied from the widely used Euclidean distance to more sophisticated techniques such as Spectral Angle Matching (SAM), Cross-Correlogram Spectral Matching (CCSM), and probabilistic approaches like Spectral Information Divergence (SID) (Pu, 2017).Among these techniques, SID and CCSM have been found to perform better in mineral classification from Aviris data.Similarly, van der Meer (2006) proved that SAM introduced much more confusion than SID and Spectral Correlation Matching (SCM) in a case study of mineral classification.SAM has the advantage of being invariant to different scales, making it useful for heterogeneous acquisition devices and conditions.Other techniques derived from SAM include Spectral Correlation Angle (SCA), based on the Pearson correlation coefficient, and Spectral Gradient Angle (SGA) (Ren et al., 2022).In addition to similarity, angular and probabilistic measures, the literature describes error and colourimetric methods.The former group includes the widely used Mean Square Error (MSE), Root Mean Square Error (RMSE), Mean Relative Absolute Error (MRAE), Back-Projection MRAE (BPMRAE) and the Peak Signal-to-Noise Ratio (PSNR) (Agarla et al., 2021).More recently, Kumar et al. (2021) introduced three new metrics (Dice Spectral Similarity Coefficient (DSSC), Kumar-Johnson Spectral Similarity Coefficient (KJSSC), and a hybrid of the previous, KJDSSC tan , that outperformed traditional techniques on mineral and vegetation classification.
Colourimetric measures involve measuring distance in various colour spaces, such as pro-Lab.Agarla et al. (2021) have compared all these techniques to assess their correlation and determine the most significant ones.Additionally, spectral derivatives, which are finitely approximated considering the spectral signatures and wavelength distance, are used to remove or compress illumination variations resulting from acquisition conditions (Fernandes et al., 2019;Pu, 2017).These techniques are still applied when the reflectance profile of different materials exhibits notable variations.Furthermore, semi-automatic classification methods have been reviewed for situations where classification involves only a few labels.For instance, Ahmed et al. (2021) used Principal Component Analysis (PCA) to extract features from multispectral imagery and proposed a vegetation index to label different trees.Pádua et al. (2020) described similar work on classifying chestnuts with phytosanitary problems using two new indices: ExNIR and ExRE, where Ex refers to Excess.However, these techniques are not suitable for differentiating a significant number of vineyard varieties since they all exhibit similar shapes.We refer the reader to Shanmugam and SrinivasaPerumal (2014) for an in-depth revision of spectral matching and attributes conditioning the construction of spectral libraries.
Hyperspectral transformation and feature extraction.In this section, we discuss the transformations that facilitate classification using traditional methods.Due to the extensive coverage of land by satellite imagery, it is uncommon for hyperspectral pixels to depict the spectral signature of a single material.Therefore, there is a prevalent topic in the hyperspectral literature, which involves breaking down the acquired Earth's surfaces by analyzing the hyperspectral images.The problem is illustrated with  = MF + , where  is the spectral signature of different materials,  is the weight,  is an additive noise vector and  is an  × 1 matrix where  is the number of bands.Hence, the difficulty of finding a solution to  and  is lowered if  is fixed, i.e., the end-member signatures are known.The Multiple end-member spectral mixture analysis (MESMA) was the initial approach taken, followed by the Mixture-Tuned Matching Filtering Technique (MTMF), which eliminates the need to know end-members in advance.This approach was further refined with the Constrained Energy Minimization (CEM) method, which effectively suppresses undesired background signatures.
The current state-of-the-art techniques for Linear Mixture Models (LMM) can be categorized based on their dependency on libraries.Additionally, the level of supervision and computational cost also determines the classification of these techniques.The taxonomy of methods, as described by Borsoi et al. (2021) (Bhatt and Joshi, 2020).Non-negative matrix factorization (NMF) has also attracted attention as it can extract sparse and interpretable features (Hruška et al., 2018).Recently, the incorporation of spatial information into hyperspectral unmixing has been investigated (Shi and Wang, 2014).This involves considering the surrounding pixels using kernels of varying sizes and shapes, such as squared, cross, or adaptive.Weights can also be assigned based on the distance to the centre and the measured similarity using functions like SID, SAM, Euclidean distance, etc.Current state-of-the-art methods, such as NMF, have been combined with spectral information (Zhang et al., 2022a).
Besides discerning materials, the results of Hyperspectral Imaging (HSI) present a large number of layers that can be either narrowed or transformed, as many of them present a high correlation.Otherwise, the large dimensionality of HSI data leads neural networks and other classification algorithms to be hugely complex.Accordingly, the most frequent projection method is PCA (Jiang et al., 2022;Shenming et al., 2022;Lu et al., 2022), which projects an HSI cube of size  ×  ×  into  × , where  has a size of  ×  ×  , and  is a matrix such as  × .In this formulation,  is the number of target features (Amigo and Benson, 2019).Independent Component Analysis (ICA) is a variation of PCA that not only decorrelates data but also identifies normalized basis vectors that are statistically independent (Pu, 2017).Least Discriminant Analysis is another commonly used technique, but it is primarily applied after PCA to increase inter-class and intra-class distance (Shenming et al., 2022).In the literature, it is also referred to as Partial Least-Square Discriminant Analysis (PLS-DA), mainly as a classifier rather than a feature selection method.
Instead of projecting features into another space, these can be narrowed into the subset with maximum variance according to the classification labels of HSI samples.There are many techniques in this field, including the Successive Projection Algorithm (SPA), which reduces colinearity in the feature vector.The Competitive Adaptive Reweighted Sampling (CARS) method selects features with Monte-Carlo sampling and iteratively removes those with small absolute regression coefficients.Two-Dimensional Correlation Spectroscopy (2DCS) aims to characterize the similarity of variance in reflectance intensity.Liu et al. (2019) used the Ruck sensitivity analysis to discard bands with a value below a certain threshold.Agilandeeswari et al. (2022) calculated the band entropy, vegetation index, and water index for wavelength subsets, generating a narrower cube only with bands above three different thresholds.Finally, the work of Santos-Rufo et al. (2020) presents an indepth evaluation of methods based on Partial Least Squares (PLS) regression.To this end, HSI data from olive orchards were first narrowed and then classified with LDA (Least Discriminant Analysis) and K-Nearest Neighbours (KNN).In conclusion, the Lasso method (Friedman et al., 2010) as well as Genetic algorithms (Mehmood et al., 2012) showed the best performance with LDA.
Remote sensing data typically contains inherent noise, which means it is rarely used as-is.To address this issue, Gutiérrez et al. (2018) used a combination of Standard Normal Variate (SNV) and de-trending to remove the scatter effect.Then, they smoothed the hyperspectral signature by applying Savitzky-Golay filtering with different step sizes over the first and second derivatives.
Classification of hyperspectral imaging with ML and DL.This section focuses on reviewing studies related to the classification of vineyard varieties using HSI.Although there are numerous studies on HSI classification, only those relevant to vineyard varieties will be discussed.In addition, state-of-the-art DL networks achieving high accuracy in HSI classification will also be briefly reviewed.
Despite there exists considerable research on segmentation, only a few studies have addressed the classification of vineyard varieties using RGB and multispectral imagery.In these studies, binary masks or grayscale maps were first extracted to distinguish soil, shadows, and vineyards.Clustering, line detection, or machine learning (ML) algorithms and artificial neural networks (ANNs) were then applied to segment vineyard rows (Fuentes-Peñailillo et al., 2018;Karatzinis et al., 2020;Hajjar et al., 2021;Pádua et al., 2020Pádua et al., , 2022;;Poblete-Echeverría et al., 2017).Geometrical information from depth maps, digital elevation models (DEMs), LiDAR data, and photogrammetric reconstructions were also assessed (Kerkech et al., 2020a;Aguiar et al., 2022;Jurado et al., 2020).DL approaches for semantic segmentation and skeletonization algorithms have also been discussed (Kerkech et al., 2020b;Barros et al., 2022;Nolan et al., 2015).Further insight into this field is provided in Li et al. (2020).
The classification of different vineyard varieties has been previously achieved with traditional methods and proximal hyperspectral sensing.In the work of Gutiérrez et al. (2018), samples were selected by manually averaging the variety signature and filtering those with high correlation to such a signature.Support Vector Machine (SVM) and Multilayer Perceptron (MLP) were then trained with k-fold to distinguish thirty varieties (80 samples for each one), with the latter obtaining a recall close to one.Kicherer et al. (2017) also presented a land phenotyping platform that segments grapes from the depth map and discerns between sprayed and nonsprayed leaves.To this end, several learning models were tested: LDA, Partially Least Square (PLS), Radial Basis Function (RBF), MLP and soft-max output layer (PNET), with RBF and PLS showing the best results.Besides phenotyping, the following work is aimed at detecting diseases (Nguyen et al., 2021;Bendel et al., 2020b,a) and plagues (Mendes et al., 2022).However, these applications formulate a binary problem where signatures of distinct classes are significantly different regarding scale (Bendel et al., 2020b) and shape (Bendel et al., 2020b).Despite this, previous learning models are also implemented (MLP, RBF, PLS and LDA) and almost achieved the perfect discrimination performance (Bendel et al., 2020a).Nguyen et al. (2021) conducted a study similar to ours, where they attempted to differentiate healthy and infected leaves with a comparable spectral signature.However, their data was obtained from land, and they used the flattened layer of 2D and 3D convolutional networks as input for Random Forest (RF) and SVM algorithms.They found that combining PCA reduction (50 features) and RF resulted in the best performance (97%), and RF improved SVM classification regardless of data reduction.Additionally, ML and DL techniques have been extensively applied to various crops, such as maize, sugarcane, rice, and bread wheat, using both satellite and proximal imaging.Transfer learning, attention-based, and residual models are commonly used in the literature (Zhang et al., 2022b).A lightweight CNN composed of several inception blocks was also developed to classify up to 15 plant species (Liu et al., 2022).The authors compared their proposed CNN model to other commonly used models for classifying RGB images, including AlexNet, VGGNet, and GoogLeNet.They found that the best results were achieved using a combination of six RGB and Near Infrared features, with an accuracy of 94.7%.The use of PCA with only six features achieved an accuracy of 88%.Nezami et al. (2020) also applied a 3D CNN to classify three tree species using both hyperspectral and visible imaging, as well as canopy height models as input, with an overall accuracy below 95%.
When it comes to DL for the classification of satellite hyperspectral imaging, it is more current than UAV imaging.A standard dataset is available for this purpose, on which several experiments have been conducted.Among them, the top-performing models based on overall accuracy (OA) are discussed below.Zhong et al. (2020) both published an HSI dataset and proposed a simple CNN with Conditional Random Field (CRF) to extract spatial relations among data, even with the presence of gaps.They obtained an OA of 98% and 94% over their own HSI dataset.Moraga and Duzgun (2022) presented an Inception-based model with parallel convolutional pipelines of increasing size, achieving near-perfect classification.Chakraborty and Trehan (2021) proposed the SpectralNet model, which combines wavelet decompositions with a traditional convolutional path (OA: 98.59%-100%).Roy et al. (2020) developed HybridSN, which includes both spectral-spatial and spatial feature learning using 3D and 2D convolutional layers (OA: 99.63%-100%).Roy et al. (2021) later introduced a network based on residual blocks and spectral-spatial attention modules with varying architecture (start, middle and ending ResBlock) (OA: 98.77%-99.9%).Lastly, the A-SOP network (Xue et al., 2021b) proposed a module composed of matrixwise operations that output a second-order pooling from the attention weights, after extracting the first-order features (OA: 98.68%-100%).

Materials and methods
The structure of this section is as follows: firstly, a brief explanation of the study area and sensors is provided.Next, the challenges of classifying vine varieties are introduced by the collected data.Subsequently, UAV imagery is utilized to differentiate between phenotypes of white and red root variants.To achieve this, a CNN architecture is proposed, which is evaluated against previously reviewed work with impressive OA results.

Study area
The vineyards used as study areas in this work are situated in the Northern region of Portugal, specifically in Vila Real (Figure 1).Each vineyard plot is dedicated to either red or white grapevine variants, and each grapevine variety is cultivated in one or more contiguous rows.Data acquisition was conducted with two different UAV flights to collect HSI swaths.

Material
The DJI Matrice 600 Pro (M600) hexacopter equipped with a Nano-Hyperspec from Headwall was used for the UAV flight.The Ronin-MX gimbal was employed to minimize geometric distortions in HSI acquisition.The lens has a focal length of 12 mm, covering 21.1°.The HSI swaths have 270 spectral bands and a width of 640, with the height depending on the flight plan.The spectral range goes from 400 nm to 1,000 nm, with a uniform sampling of 2.2 nm that increased to 6 nm at half maximum.The UAV's location was captured at different timestamps using two positioning antennas, and angular data were recorded using an Inertial Measurement System (IMU).The flight was planned using Universal Ground Control Station at an altitude of 50 m with a 40% side overlap.The red and white varieties were surveyed with 8 and 5 swaths, respectively.Table 1 provides a summary of the number of samples concerning each grape variety.

Pre-processing of hyperspectral data
This section briefly describes the process of obtaining reflectance data from raw hyperspectral imagery, which is illustrated in Figure 2. The hyperspectral data was collected using a drone and processed using Headwall SpectralView™ software.Several swaths were captured for each study area, and a white sample was marked from the white area in a grayscale tarp, while a dark reference was obtained by collecting a hyperspectral sample with the lens cap on before the flight.The sensor exposure and frame period were adjusted before the flight by pointing at a bright reference to avoid clamping samples from white surfaces.The white and dark references were then used to convert the raw data to reflectance.The ortho-rectified swath in Figure 2 was obtained using high-resolution DEMs (25m) from Copernicus' observation program (European Environment Agency, 2017) and the drone's GPS and IMU data.However, nonortho-rectified swaths were used for the analyses presented in this paper to work with smaller image sizes and avoid distorting the hyperspectral signatures.

Transformation of hyperspectral data
The analysis of the corrected reflectance data involved several steps to observe and differentiate the spectral signatures from different varieties.Initially, PCA was employed to assess the clustering of each variety using the first two principal components.The distribution of the transformed signatures can be visualized in Figure 3   To determine the most suitable feature transformation algorithm for the collected HSI data, we evaluated four algorithms: NMF, PCA, FA, and LSA.These were selected since they do not require the sample labels for their execution and therefore can work over yet not observed data.Additionally, LDA was included in these tests, despite requiring the labels.Remark that the starting data is composed of 140 bands after discarding the first and last layers, which are typically noisier.For each algorithm and varying numbers of features, from 5 to 95, two tests were performed.
First, the Distance-based Separability Measure (DSI) was computed using the transformed manifold (Guan and Loew, 2020).Subsequently, an SVM model was trained to predict the labels of the samples.Through this evaluation, it was determined that FA outperformed the other algorithms in terms of both metrics, especially with more than thirty-five features.The results of these experiments are summarized in Figure 5, which supports the use of FA with forty features in the following sections.Furthermore, feature transformation algorithms, such as FA, help to perform material unmixing to make the processing more robust against different background surfaces, including soil, low-vegetation and other man-made structures.Pixels from UAV-based hyperspectral swaths depict more than one material.Therefore, the classification of these data ought to work under different kinds of surfaces.As proposed in recent work, the material unmixing could be performed with NMF.To this end, hyperspectral swaths in reflectance units could be flattened to 1D ( ← ℎ ⋅ ) with a dimensionality of  × , where  is the number of features.Then, this flattened vector could be transformed into weight ( × ) and component matrices ( × ), where  is the number of target surfaces (end-members).However, the number of materials visible in a single image (or vineyard varieties) is not known in nature.Therefore, material unmixing in nature is rather suited for the classification of significantly different signatures rather than performing fine-grained classification.On the other hand, the feature space can be transformed and narrowed to a few more representative features, instead of unmixing materials.In this regard, FA is also aimed at decomposing  into  ×  + Θ without the non-negative restriction, with Θ being the measurement error (Bandalos, 2018).For this reason, FA is proven to be highly suitable to our case study, beyond outperforming the rest of the feature transformation methods in terms of separability and classification accuracy using an SVM model.

Automated training
The classification of vineyard varieties with UAV data can be hardly approached with 1D algorithms due to the high similarity of spectral signatures.This shortcoming was already observed in Figure 5, where SVM did not perform well for any number of features (OA always below 50%).In this section, a method based on deep learning is described to classify 3D hyperspectral patches.Through this section, the proposed method is tuned to achieve a high generalization performance.

Dataset
Once radiometrically corrected, hyperspectral swaths were manually labelled as depicted in Figure 7 to distinguish different vineyard varieties.The Normalized Difference Vegetation Index (NDVI) was first extracted to differentiate vegetation from the ground, and images were then thresholded to create a binary mask from each swath.Following, these binary masks were annotated with Sensarea (Bertolino, 2012) by marking each row with a different polygon and colour, according to the variety.Some rows were marked with more than one polygon, in order to avoid annotating small vegetation clusters that do not belong to vineyards but to small vegetation.For this reason, different polygons were labelled with the same colours, also because some varieties were repeated in several rows.According to this, Table 1 shows the number of collected samples for each variety and the number of cultivated rows.
Figure 7: Workflow for manually labelling HSI swaths.First, the false RGB image is displayed.Then, the NDVI is extracted, followed by thresholding and marking with polygons using the Sensarea software.Finally, a Boolean operation, ∧, is performed between the polygon and binary masks to obtain the final labelled regions.
HSI cubes and masks were then split into 3D patches whose size is a matter of discussion in Experimentation and analysis.Dividing the hyperspectral swaths into patches for classifying pixels using their neighbourhood helps to partially suppress noise.Individual pixels are not substantial enough by themselves; instead, aggregations learned by kernels help to mitigate the noise.The window size used in this study is 23 × 23 by default, whereas previous work has used patches whose x and y dimensions range from 7 (Roy et al., 2021) to 64 (Chakraborty and Trehan, 2021).Only Liu et al. (2022) reported patches of much larger dimensionality (200 × 200) for RGB images.The larger the patch, the deeper can be the network, though it also increases the number of trainable weights, the training time and the amount of data to be transferred into/from the GPU.Configurations using larger patch sizes are more suited to images with notable spatial features, such as close-range RGB imagery, whereas ours ought to be primarily discerned through spectral features.
Instead of inputting the label of every patch's pixel, they were reduced to a single label corresponding to the centre of an odd-sized patch.Thus, the classification was performed per pixel rather than an overall semantic segmentation.Based on this, the hyperspectral samples were processed using the following steps: 1) separating the training and test samples at the outset, 2) fitting the FA and standardization only to the training samples to emulate a real-world application, and 3) transforming both the training and test samples using the fitted models (see Figure 6).Standardization is utilized to eliminate the mean and scale reflectance to unit variance.By employing this approach, the CNN restricts the range of input HSI values, although the initial values are expected to differ due to various sensor exposures, frame periods, and environmental conditions across different flights.Regarding feature reduction, spectral bands were transformed and narrowed to  ← 40 with FA.None of the fitted models requires the pixel's labels and thus are very convenient for their application in new and unlabelled areas.The resulting dataset is composed of 542,167 and 1,115,960 patches for red and white varieties, which were split into training (68%), validation (12%) and test (20%) subsets.With this partitioning, a total of 368k samples are used for training on red varieties, whereas 758k are applied to white variety classification.

Implementation
In order to make this paper self-contained, a brief introduction about DL is detailed in this section.Firstly, Deep Learning refers to layered representations that evolve in a learning process where they are exposed to input data.Typically, the depth of these models is large enough to automatically transform data and learn meaningful representations of data.Despite these models being able to work over any kind of structured data, even 1D, it is here proposed that one's pixel neighbourhood may help with the phenotyping problem.Convolutional Neural Networks (CNN) have achieved remarkably good results in the computer vision field.Convolutions are designed to learn local features, rather than global, by applying element-wise transformations while sliding over input data.These are defined as rank-3 tensors defined by width, height and depth.The width and height determine how large is the neighbourhood of every element, whereas the depth is the number of different learned filters.Hence, a single filter is applied elementwise to compose a response map from input data, whereas the whole filter stack is known as a feature map.If several convolution operations are concatenated, these evolve from learning low-level details (e.g., edges) to high-level concepts.Since individual filters are applied element-wise, the learnt patterns are invariant to the position within the image (Chollet, 2021).However, kernels may not be applied for every element.Instead, information can be compressed using steps greater than one, also known as the stride value.Another key concept in CNN is that not every node is connected, thus partially tackling the overfitting problem.Training and test errors ought to remain similar during training, which implies that the network is not learning the training data (overfitting) or generalizing excessively (underfitting).To avoid both situations, the capacity of the model must be tuned in terms of complexity to generalize and reach low training error.
Trainable CNN layers are typically defined by a matrix of weights and biases applied over input data,  (; , ), with  being an activation function that allows solving non-linear problems.In this work, ReLU and Leaky ReLU have been applied to tackle the vanishing gradient problem, together with batch standardization.The latter operations work similarly to the standardizer applied as a preprocessing stage.On the other hand,  and  are updated during training with the objective of minimizing a loss function comparing ground truth and predicted values for supervised classification.This is done by an optimizer that changes these trainable parameters using the error gradient scaled by the learning rate, .The greater , the faster is achieved the convergence, though it can also lead to significant oscillations.This process is known as Gradient Descent (GD) (Kattenborn et al., 2021).As faster convergence may be necessary at the beginning, the decay and momentum concepts were introduced to downscale  during training, thereby omitting abrupt changes.
Besides convolutions and normalization, there exist other layers to narrow data, avoid overfitting and output probabilistic values.The pooling operations, with max and average being the most popular, are aimed at downsampling input data.Dropout layers are used as a mechanism to introduce some noise into the training by zeroing out some output values, thus getting rid of happenstance patterns.Weight regularization also seeks to make the model simpler by forcing the weights to be small.Finally, the output units of the model are aimed at transforming features to complete the classification task.For a multi-label problem, the Softmax represents the probability distribution of a variable with  possible values.
The kind of problem and label representation is coupled with the cross-entropy function measuring the error.Sample labels were not hot-encoded to reduce storage footprint, and therefore, a sparse categorical cross-entropy as defined in Equation 1 is used for training in a multi-class problem.Otherwise, hot-encoding requires transforming labels into binary vectors of size  that activate the indices of the sample label(s), with  being the number of unique labels.
where ŷ is the model's output as a vector of size  with ŷ indicating the probability of the sample to belong to the i-th class, and  is the ground truth given by an integer value.

Architecture and training
Several architectures were checked over both datasets, transitioning from networks with a few layers to the network proposed in Figure 8. Hyper-parameter tuning was also used to define the best values for dropout, activation and convolutional layers, including the number of filters, the percentage of zeroed weights, or the gradient of Leaky ReLU activation.Similarly, the final activation was also checked despite Softmax being usual for multi-class output.
The input of the network is a single patch of size 23 × 23 × 40.The spatial attention layer proposed by Xue et al. (2021b) is appended as an additional layer that is shown to help with the classification.The original paper proposed first and second-order pooling before and after the spatial attention module to transform the original features as we did with FA.Attention-based kernels have been recently used for HSI classification to provide more discriminative spectralspatial features from input data.Xue et al. (2021b) applied attention-based as a link among the first data transformation, coined as a first-order feature operator, and a second-order kernel.Roy et al. ( 2021) used an attention-based kernel in a similar fashion to Xue et al. (2021b), though this one was replicated throughout the network pipeline.It was also part of a residual network where the input is simultaneously processed through 1) an attention kernel and 2) convolutional layers.The outputs of both pipelines are then concatenated at the ending point of each network's block.
In this work, the attention-based kernel of Xue et al. ( 2021b) is applied over input data and the result is concatenated in  with the original inputted data.The first operation in the attention-based thread is to normalize the data, from which the kernel will learn the weights.Such a kernel is a correlation matrix that learns the cosine distance between the central pixel and the neighbours.Then, learned weights are normalized through a softmax function that is shown to provide better convergence.With this in mind, the attentionbased can be formulated as follows: (5) where  2 refers to L2 normalization,   2 × is a 3D patch resized from  ×× , ,  central as well as  are intermediate states and   is the result that is later concatenated with the original form.Then, two similar blocks are included as Part II and Part III in Figure 8.Both share the same structure: Inception block, normalization, activation and dropout.It is a frequent follow-up of convolutional layers (Li and Zhang, 2022;Xue et al., 2021b), with dropout being greater (0.4) for middle layers than the last and initial layers (0.2). Instead of using max-pooling to downsample the network, strides of size 2 in convolutional layers were observed to perform better.The network specifications are shown in Table 2.
The Inception block was first proposed by Szegedy et al. (2014).The first proposal consisted of a module with four parallel layers later concatenated: convolutional layers with different kernel sizes (1 for spectral features and 3 and 5 to obtain aggregations from surrounding pixels), and a maxpooling layer that works directly over input data.Accordingly, a response map with a large number of filters is obtained.The importance of each of them is determined by the following layers that will again downsample data.However, at the time this layout was considered to be prohibitive if the input layer has a large number of filters, especially for kernels of larger size.Therefore, 1 × 1 convolutions aimed at reducing data were attached before each one of the Inception threads (max-pool and convolutions with  > 1).
In this work, both proposals were used: the naïve is checked in the experimentation to increase the network's capacity, whereas the second is part of the proposed network.The latter compresses spatial data even more and is following connected to the network output.Finally, the model is fitted with training data and its performance is assessed with validation samples.For supervised problems like ours, data is composed of both samples and ground truth.In this work, the training samples were split into several sets according to the hardware limitations, and the model was iteratively trained during a significant number of iterations ( ← 500).Besides mitigating storage limitations, this leave-p-out cross-validation also helps to generalize by not training over the complete dataset.Furthermore, each one of these clusters is further split into small batches during a single iteration.The batch size must be large enough to include a balanced representation of samples.In this work, the batch size was set to 2 10 .This phase can be terminated early if no improvements are observed during  ← 20 epochs.A summary of the hyperparameters used in this study can be found in Table 3.

Data sampling and regularization
It can be observed from Figure 10 that the dataset is clearly not balanced.The number of vineyard rows differ in length and so does the number of examples for each variety.Instead of generating new feasible batches to upsample, a subset was obtained with different techniques.The objective is not to equalize the number of samples for every variety but rather to make it more balanced.Accordingly, the subsampling is performed by determining how many groups are downsampled; the larger it is, the more balanced gets the dataset at the expense of reducing the number of hyperspectral samples.In this regard, Figure 10 compares the original distribution observed in a training batch, the utilized downsampling technique and a minority downsampling approach which leads to a huge decrease in usable patches.Besides balancing the dataset, which is split into several batches to make it fit in the GPU, the CNN is watched with a callback that saves the current best model and prevents saving an overfitted model.
Batches were probabilistically ( ← 0.1) transformed by performing rotations and orientation flips so that learn features are invariant to the flight's positioning conditions.Every possible transformation, regardless of probability, is shown in Figure 9.With this approach, each batch of the training dataset was processed twice; each one with a different random seed, and therefore, differently transformed patches.Hence, the regularization was controlled by the proposed downsampling and transformation sequences.Several considerations were also taken into account during the CNN design: 1) the CNN must not have a large number of trainable parameters to avoid overfitting and a sufficient number to cope with underfitting, and 2) dropout layers were included to randomly reset some output weights and thus lead to proper generalization.

Experimentation and analysis
The effectiveness of the proposed network is evaluated in this section.To this end, the classification experiments are jointly performed over various hyperspectral swaths of  finally, the f1-score measures the model's precision by leveraging both precision and recall metrics.Several representative neural networks are compared with our method: LtCNN (Liu et al., 2022), JigsawHSI (Moraga and Duzgun, 2022), SpectralNET (Chakraborty and Trehan, 2021), HybridSN (Roy et al., 2020), Nezami et al. (Nezami et al., 2020) and A-SPN (Xue et al., 2021b).From these, only a few address airborne-sensing imagery (Liu et al., 2022), whereas the rest are focused on satellite data.Hence, several considerations must be addressed: 1) some of these manuscripts apply different transformations to the input data and 2) the number of spectral bands also differ from our sensing tool.Therefore, the preprocessing pipeline was selected as the one providing better performance over our input data, either our pipeline or the one proposed in the reference work.However, FA showed a higher performance for every network if input data was transformed according to this fitting method, rather than the following: • Chakraborty and Trehan (2021) (SpectralNet) used FA with only 3 features.
Regarding implementation, all the tests were performed on a PC with AMD Ryzen Threadripper 3970X 3.6 GHz, 256 GB RAM, Nvidia RTX A6000 GPU and Windows 10 OS.The proposed CNN as well as the compared networks were implemented with Keras (version 2.10.0) and TensorFlow (version 2.10.1) in Python.CUDA 11.8 and CuNN 8.6 were installed to reduce the fitting time.Not every network from previous work could be applied as published; for example, LtCNN is designed for large image patches (200 × 200) and thus convolutional striding and MaxPooling cannot be applied when patches reach a size of 1 × 1.For LtCNN (Lu et al., 2022), kernel size and max pooling's strides were reduced as depicted in the Additional data accompanying this work.

Classification results
Table 4 shows the overall results of our method in comparison with state-of-the-art networks for classifying HSI datasets.Most of them are considerably unstable due to operating with noisy UAV data, rather than working with satellite imagery.In addition, the second best performing network is Nezami et al. (2020), which is only the only one checked against UAV datasets for discerning different tree species.Similar to ours, it is also a shallow CNN with only a few layers; however, convolutions are applied in a sequential manner, rather than operating with stacked features extracted from various parallel convolutions.The confusion matrix in 11 shows the OA of the proposed network against any grape variety.Hence, classification over the majority of varieties showed uniform results, with most of them being close to 99%.Note that these percentages were rounded, and therefore, some of these results are below 99%, while others are above.When averaged, all these results lead to an OA of ≈ 98.8%, as shown in Table 4.

Training time and network capacity
The benchmark on the classification problem is relevant to show that training time is not excessive and that the proposed network does not have much more capacity than the problem warrants.Regarding the capacity, Figure 13 shows that training and test signatures are similar, and therefore do not show overfitting nor underfitting behaviours.Furthermore, the training AO and loss worsen as a new dataset is introduced, whereas the validation metrics remain similar.Thus, the network is not memorizing input data.Along with this, the network is only parameterized by nearly 560k parameters, while other state-of-the-art models exceed ten million parameters (see Figure 12).The number of parameters is derived from the proposed architectures, using an input of size 23 × 23 × 40.Note that the network of Lu et al. (2022) was decimated in our experimentation with pooling operations of a lower size than proposed to adapt it to smaller patches.Finally, the response time for training the proposed network is below an hour, whereas others require up to several hours.Note that every available sample was used during training, instead of using strides; otherwise, the training time can be reduced.Further insight into these results is provided by Table 4.

Separability
The output of the proposed network can be assessed in terms of separability by removing the final Dropout and Dense layers.Data is transformed and flattened according to the network's learnt weights.It is subsequently embedded with uMAP (McInnes et al., 2020) to compress highdimensionality data into a few features, thus allowing us to visualize the new data representation.The same procedure can be followed over the original data to compare how was the data manifold uncrumpled.As shown in Figure 14, different labels were not perfectly unmixed, although the improvement in comparison to the starting representation is notable.To provide this result, the last densely connected layer was connected to uMAP fitting with  = 2; hence, 2592 features were narrowed to two features to represent the embedding in a two-dimensional chart.

Impact of window size
The patch size is one, if not the most relevant, parameter concerning the network architecture.Larger patches are assumed to also work as irrelevant spatial features can be zeroed out.However, it also comes at the expense of increasing the training time.On the contrary, lower patches come at the risk of not being sufficient for classifying samples as accurately as done by the proposed network.Figure 15 shows the whole battery of metrics obtained with patch sizes ranging from 9 to 31.According to the obtained results, patches have been split with dimensionality 23 to balance network capacity and accuracy, despite higher patch size achieving slightly better results.Accordingly, the highest patch size reached an OA of 99.57%, whereas the lowest reached 82.5% (size of 9).On the other hand, the selected dimensionality achieves an OA of 99.20%, thus leveraging network size and capacity.On the other hand, Figure 16 depicts the training time and network size as the patch dimensions increase.The number of training splits was calculated according to the patch size, and therefore, the lowest size had also a lower number of subdivisions.This led to a considerable time bottleneck in patch-wise transformations since they are performed in the Central Processing Unit (CPU).Therefore, it can be observed that the selected size is also intended to leverage training time with capacity and accuracy.

Ablation study
The proposed network is intended to be validated in this section by removing and transforming some of the network features, while the rest remain unchanged.The proposed changes are the following: 1. Two convolutional layers were included before Part II to extract spectral and spatial features.
2. Both Inception blocks were modelled using the naïve version (Szegedy et al., 2014) (see Figure 17).The  The obtained results are shown in Table 5. Removing the SA layer led to a slight decrease in performance, similar to exchanging the first Inception block.Unsurprisingly, using the naïve version of the Inception layer twice led to a significant performance decrease for every metric, as it kept transforming the spectral dimensionality in a deep layer.The second Inception version also transforms spectral features but rather provides them as an additional layer (concatenation) that can be weighted according to their contribution to the output.Following this reasoning, swapping the first Inception block with the first version of it did not involve a huge performance drop.Improvements to the proposed architecture over the c) variant were very small and therefore may suggest that using either one of them does not offer great changes in the performance.Similar results to this last setup were achieved by removing the spatial attention layer; it did not lead to a significant performance drop, though better and especially, more stable, results were obtained using it.

Analysis of errors
As observed in previous sections, our architecture achieved a high OA and AA.Still, there is a margin for improvement that must be tackled by finding which are the weaknesses in the overall labelling, transformation and classification pipeline.Instead of predicting randomly selected samples, another experiment is to predict every hyperspectral swath sample, thus allowing us to determine where errors take place within the study area.As observed in Figure 18, these errors are spatially clustered instead of being sparsed over the study area.If these are compared against the RGB mosaic of the hypercubes, errors are observed to belong to 1) small vegetation clusters, mainly proceeding from small vegetation mistakenly labelled as vineyard and 2) samples surrounded by ground or metallic vineyard supports.Note that these are hard to notice during the labelling since they present signatures similar to the target leaves and they are surrounded by vegetation, thus hardening the definition of a geometrical shape for rapidly tagging which is relevant or not.Still, some errors are present in grape samples surrounded by ground and other surfaces since these have a notable impact on the sample's neighbourhood, thus distorting the final probability.Note that every boundary sample is surrounded by vegetation, and therefore, it should be expected that their signature is the fusion of the signatures concerning a target variety and ground.However, it may be not relevant enough to tell apart varieties, thereby leading to more errors on samples heavily surrounded by ground.

Training over satellite imagery
The main shortcoming of compared CNNs focused on satellite imagery is that they obtain a poor performance over the proposed UAV datasets and vineyard varieties.Hyperspectral imagery from UAVs is noisier than satellite observations, with the spectral signature of the latter being more smoothed out.Therefore, previous work did not overcome noise in the classification of vineyards and most of them showed poor performance.Only another architecture tested over UAV samples managed to reach an OA near 80%, whereas others showing worse results proposed huge networks with millions of parameters, thus overkilling the classification with a large training time.
The aim of this chapter is to remark on how our network behaves over satellite hyperspectral imagery.Although the network was not designed for this purpose, several tests were launched over publicly available datasets frequently used for comparison.The number of labels in these data ranges from nine to sixteen, and the number of spectral bands also differs from our imaging device.However, FA was fitted to obtain only 40 features per pixel, as proposed for UAV imagery, and so does the architecture of the network.According to the number of samples of each dataset, the batch size was adapted as shown in Table 6.Every dataset had unlabelled samples which were removed from the training and test datasets to establish a fair comparison with previous work.Unlike our UAV datasets, labels in satellite imagery were imbalanced, with some of them having only a few dozen of examples.Hence, balancing was not applied in satellite datasets to avoid levelling the rest of the classes with others that present scarce examples.
From the results, it can be observed that the proposed network can also be applied to this kind of imagery despite not being its main area of expertise.All of the datasets converged early to classification metrics over 99%, and despite not providing state-of-the-art results, these are very close to the current networks that show the best performance.As we did not intend to tune the network for satellite imagery, the learning rate remained as before, and the batch size was scaled according to the number of samples.

Training over fewer examples
Another conducted experiment was to train the proposed CNN with a lower amount of information.In this regard, the training was repeated to learn from a percentage of training samples ranging from 10% to 100% (of 68%).In Figure 19, it can be observed that the OA drastically goes to 92% with a 10% of training data, although it is still able to learn relevant features to provide a high OA.It is hypothesized that, as the number of training data increases, the number of learned Salinas valley 100 ± 0.00 100.0 ± 0.00 256 100 ± 0.00 100 ± 0.00 Moraga and Duzgun (2022) spatial features is notably higher, whereas lower amounts of information are enough for learning spectral features that enable classifying samples from their neighbourhood.

Transfer learning
Transfer learning has been widely studied to take advantage of trained networks with a notable capability of separating a variable number of classes.The underlying concept is that a network which has been successfully applied to one case study and has learned relevant features may be applied to another case study with a similar outcome.Nonetheless, it does not necessarily involve training the whole network, which typically has a severe amount of parameters.Instead, some layers that learn more abstract features, presumably the first, are not trained; their weights are preserved and deeper layers are trained to specialize in another application.
The objective of this section is to carry out several experiments to conclude whether weights learned over the classification of other UAV datasets can be exploited to make the training faster, and even more accurate.This experiment was approached by using the publicly available WHU-Hi HSI datasets for classifying rural materials, including different vegetation crops (Zhong et al., 2020).These datasets are primarily designed for semantic segmentation applications, although the outlined transformation procedure can also be applied to them.It is important to note that these datasets have a lower level of detail (LOD) compared to ours, resulting in ground-truth masks that bear a closer resemblance to satellite imagery.Despite this reduced LOD, they have proven valuable in expediting the training process due to their smaller size.Conversely, the materials depicted in the WHU-Hi datasets exhibit significant dissimilarities from those present in our imagery.The ground-truth masks in these datasets appear smoother due to the lower LOD, leading to a heightened emphasis on spectral features while spatial features contribute less significant information.Consequently, the learned weights from these datasets serve as initial weights, initiating a re-training process focused on acquiring spectral and spatial features from imagery collected with a high LOD.
Table 7 shows the outcome of this experimentation.Using previously trained weights considerably contributed to improving the metric results, in comparison with the default weight initialization.By default, weights in Keras are initialized so that the variance is guaranteed to be similar across the network layers (Xavier initialization).Note that, not every dataset evenly contributed to improving the results; the weights from Han Chuan and Long Kou datasets seem to contribute better to separate hyperspectral samples.

Discussion and conclusions
The proposed network presents a CNN model that uses current state-of-the-art procedures applied to classify grapevine varieties.Spatial-attention layers were proved to enhance the results, whereas Inception blocks were checked to discuss which one could provide better results over hyperspectral imagery.Note that the majority of CNN works in the literature are oriented towards RGB imagery, and therefore, their transition to hypercubes is not as trivial.Also, only a few works have addressed the classification of UAV-based HSI, since it is considerably noisier.Despite this network being proposed for phenotyping applied to grapevines, it was also tested over standard satellite HSI datasets which are frequently used to establish fair comparisons.The results showed that samples from satellite imagery were also classified with notable accuracy (over 99.7%).In contrast, most of the compared models achieved poor results over UAV-based imagery (∼ 81% at most).Our training time and the number of parameters were also lower than most of the compared classification models; some of them are composed of millions of parameters, whereas ours solely consists of ∼ 560 parameters.
Another explored step, which may be the most relevant, is the preprocessing of reflectance.First, hypercubes are composed of a huge number of bands that cannot be processed by CNNs in a reasonable response time.Furthermore, it hardens the fitting stage and most of these bands may be either redundant or irrelevant to the classification.A significant number of feature reduction and transformation methods were checked, which led to Factor Analysis as the one providing better results in an automatic pipeline.The latter concept is relevant since other methods may be capable of providing better clusters at the expense of demanding the number of different materials collected in hyperspectral imagery.Using FA, hypercubes with 270 bands were narrowed to only 40 features, thus lowering the number of trained parameters and response time.
Regardless of the high OA and AA, the analysis of errors also depicted some clustered samples that were mislabelled.If checked against false-colour RGB imagery from hypercubes, most of them were low-vegetation labelled as grape varieties or samples surrounded by other surfaces which hardened the classification.
As a future work, we would like to further extend this network to integrate as many more varieties as possible.It also ought to be explored whether a different growth stage has some effects on the classification.Hence, proving this would enable this work to be used over any study area at any stage of the year as long as it comprises some of the varieties over which the model has been trained.Also, a considerable effort is being made to collect information about crops receiving grants from national and European funds.For instance, it should be inferred whether a crop is abandoned or not to make a decision on whether these funds are granted.Similarly to this, estimating the area of vineyard rows, even the harvested varieties, could also help in this and other decision-making processes.Finally, the labelling may greatly benefit from using the Digital Elevation Model as collected by the surveying UAV, in order to avoid some labelling errors and speed up this manual task.

Figure 1 :
Figure 1: Overview of the areas surveyed with UAV hyperspectral imaging for the classification task.Two different vineyard crops are depicted according to their main variety: a) red and b) white.
Similar to Moraga and Duzgun's work in 2022, the FSKNet model also employs a combination of 2D and 3D convolutional layers with an intermediate separable convolution to reduce training latency while achieving comparable overall accuracy results.The FSKNet model achieves an OA above 99% with significantly fewer parameters and a shorter training time.This work focuses on single output classification, where the output is either hot-encoded or given as a single value.However, other approaches have gained attention, such as contrastive learning and multiinstance segmentation, which propose outputs of higher dimensionality.Zhu et al. (2021) investigated pixel-wise classification of HSI patches using semantic segmentation, using the popular U-Net architecture with additional convolutional LSTM and attention-based mechanisms.Xin et al. (2022) used transformers to independently encode spatial and spectral features and then combine them.Contrastive learning has also been used to address the lack of labelled datasets, where HSI patches and 1D data are jointly used during training so that the network learns by comparing pairs of samples (Guan and Lam, 2022).Finally, Meerdink et al. (2022) accurately labelled HSI with multi-instance learning by creating bags of samples with distinct labels.

Figure 2 :
Figure 2: Conversion of a) hyperspectral DNs into b) reflectance using a white and dark reference.The three grey levels are sampled in a).
, without clear distinctions between different labels.Additionally, to further explore the clustering patterns, 50 features were extracted with PCA and following narrowed to three components using uMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) (McInnes et al., 2020) (Figure Figure 4).Similar conclusions are drawn from the last investigation, showing no clear distinction among varieties.

Figure 3 :
Figure 3: Distribution of hyperspectral samples according to the two components extracted using PCA.

Figure 4 :
Figure 4: 3D distribution of hyperspectral samples obtained by narrowing 50 components calculated with PCA using uMAP.

Figure 5 :
Figure 5: Results of experiments conducted to compare different feature transformation algorithms with different numbers of components.PCA, FA, NMF, and LSA (Truncated SVD) are evaluated using the DSI metric and the OA obtained by training an SVM model.The default DSI and accuracy are obtained from the original data with 140 features.

Figure 6 :
Figure6: Overview of dataset preparation.First, a binary mask was generated using the NDVI and rows were organized into different groups to distinguish vineyard classes.Once pixels are processed as described in Pre-processing of hyperspectral data, both reflectance and labels were split into patches.The signatures on the right side show the original and transformed reflectance, including the variance per feature.

Figure 8 :
Figure 8: Scheme of the proposed CNN, highlighting four different parts as well as the structure of Inception blocks.

Figure 9 :
Figure9: Transformations that can be probabilistically performed over every hyperspectral patch.

Figure 10 :
Figure 10: From top to bottom: initial distribution of samples per label, proposed narrowing, with only three groups being downsampled, and equalization of samples, leaving only seventy instances per group.

Figure 11 :
Figure 11: Confusion matrix for classifying red and white varieties altogether.

Figure 12 :
Figure 12: Response time for training the network as well as the number of parameters for every compared network, including ours.

Figure 13 :
Figure 13: Train and validation accuracy and loss during training.

Figure 14 :
Figure 14: Clustering of samples according to the feature transformation performed by uMAP over 1) the starting hyperspectral features and 2) features extracted by the CNN before transferring it to the final Softmax layer.

Figure 15 :
Figure 15: Overall accuracy obtained for patches of different sizes, from 9 to 31.

Figure 16 :
Figure 16: Training time, in minutes, and number of parameters as the window size increases.

Figure 18 :
Figure 18: Errors observed in the classification of red varieties, together with the hyperspectral signature of a few samples concerning different surfaces.

Figure 19 :
Figure 19: OA observed by training the proposed CNN network with a percentage of the train dataset.

Table 1
Summary of acquired information regarding different grapevine varieties.For each variety, the number of field rows and samples obtained by labelling UAV imagery is shown.

Table 2
Layer specifications of the proposed network.Inception blocks are simply named, but their layers were not expanded here to make this table more readable.

Table 3
Hyperparameters used during training.

Table 4
Overall results in terms of Overall Accuracy (OA), Average Accuracy (AA) and Kappa coefficient with different methods.

Table 5
Overall results in terms of OA, AA and Kappa coefficient with different CNN schemes.

Table 6
Classification of hyperspectral imagery from satellite platforms in terms of OA and Kappa coefficient ().