Texture Extraction Techniques for the Classification of Vegetation Species in Hyperspectral Imagery: Bag of Words Approach Based on Superpixels

Texture information allows characterizing the regions of interest in a scene. It refers to the spatial organization of the fundamental microstructures in natural images. Texture extraction has been a challenging problem in the field of image processing for decades. In this paper, different techniques based on the classic Bag of Words (BoW) approach for solving the texture extraction problem in the case of hyperspectral images of the Earth surface are proposed. In all cases the texture extraction is performed inside regions of the scene called superpixels and the algorithms profit from the information available in all the bands of the image. The main contribution is the use of superpixel segmentation to obtain irregular patches from the images prior to texture extraction. Texture descriptors are extracted from each superpixel. Three schemes for texture extraction are proposed: codebook-based, descriptor-based, and spectral-enhanced descriptor-based. The first one is based on a codebook generator algorithm, while the other two include additional stages of keypoint detection and description. The evaluation is performed by analyzing the results of a supervised classification using Support Vector Machines (SVM), Random Forest (RF), and Extreme Learning Machines (ELM) after the texture extraction. The results show that the extraction of textures inside superpixels increases the accuracy of the obtained classification map. The proposed techniques are analyzed over different multi and hyperspectral datasets focusing on vegetation species identification. The best classification results for each image in terms of Overall Accuracy (OA) range from 81.07% to 93.77% for images taken at a river area in Galicia (Spain), and from 79.63% to 95.79% for a vast rural region in China with reasonable computation times.


Introduction
Monitoring vegetation species in a natural area is an important task in the context of human intervention planning. Specifically, the observation of the dynamic behavior of the vegetation provides useful insights for biodiversity conservation and forestry, among other fields. Hyperspectral imagery for remote sensing has been revealed as a powerful technique in this field, and many examples can be mentioned: from land cover changes [1] to mapping vegetation species [2,3]. Although satellite-based remote sensing is a way of obtaining consistent and comparable data, Unmanned Aerial Vehicles (UAVs) provide a more flexible platform with higher spatial resolution. The price of multi-or hyperspectral sensors used on board UAVs has decreased during the last few years. This fact makes them widely used even by small companies for an increasing number of tasks.
In the case of images for land cover analysis, supervised classification solves the problem of, given a hyperspectral image and its reference data, obtaining and distinguishing the different vegetation species or artificial elements present in the scene. In order to perform this classification, texture features can be extracted from the image [4], thus improving the classification accuracy. These features characterize the visual structures present in the scene. As it is a powerful visual cue, texture supplies information to identify objects or uniform regions of interest in the images. Texture can be differentiated from color in the sense that it refers to the spatial organization of a set of basic elements or primitives over the image called textons. These can be defined as the fundamental microstructures in natural images and the atoms of preattentive human visual perception [5].
Texture classification deals with designing algorithms or processing schemes for declaring a given texture region as belonging to one out of a set of categories (in a context where training samples have been provided). Research on texture features is mainly focused on three well-established approaches: Bag of Words (BoW)-based [6], Convolutional Neural Network (CNN)-based [7], and attribute-based [8]. The goal of BoW texture feature extraction is the statistical representation of texture images as histograms over a texton dictionary. The approach of CNNs aims to leverage large labeled datasets to learn high quality features, which can then be categorized using a simple classifier. In the case of the attribute-based approach, there are three essential issues: the identification of a universal texture attribute vocabulary, the establishment of an annotated benchmark texture dataset, and the estimation of texture attributes from images based on low level texture representations. One of the first attempts was carried out in [9], where a set of seventeen human comprehensible attributes (seven related to color and ten to structure) for color texture characterization were introduced.
Different papers focused on the classification of vegetation species using texture features in color, multi-, or hyperspectral imagery can be found in the literature. The simplest methods to characterize vegetation using textures are based on color histograms, statistical measures (mean, standard deviation, skewness, kurtosis, or entropy, among others), and clustered centers of filter bank responses. Following this approach, a classification scheme for the canopy cover mapping of spekboom in a large semiarid region in South Africa is presented in [10]. The scheme is based on a set of spectral features and vegetation indices, including several statistical measures in sliding windows of several sizes. A different scheme for natural roadside vegetation classification is presented in [11]. This scheme learns two individual sets of BoW dictionaries from color and filter-bank texture features.
Two simple methods for texture extraction, based on the analysis of patterns in the neighborhood of a pixel, are Local Binary Pattern (LBP) and Gray-Level Co-occurrence Matrix (GLCM). LBP is used in [12] for the classification of tree species using hyperspectral data and an aerial stereo camera system. Feature extraction is performed following a patch-based approach. On the other hand, a large number of publications on the classification of vegetation are based on the GLCM texture method. Vegetation mapping in complex urban landscapes using a hybrid method combining Random Forest and GLCM texture analysis at nine different window sizes is presented in [13]. The classification is done using ultra-high resolution imagery acquired at low altitudes. A proposal of crop classification method for hyperspectral images combining spectral indices and GLCM texture information is presented in [14]. An object-based GLCM texture extraction method for the classification of man-planted forests on mountainous areas using satellite data is the contribution presented by [15]. As a preprocessing step, the texture feature of segmented image objects are enhanced using a 2D Gabor filter. Using very high resolution images acquired by UAVs, a study to identify the most relevant image parameters for tree species discrimination is conducted in [16]. Specifically, classification of savannah tree species is carried out by using chromatic coordinates, spectral indices, the canopy height model, and GLCM texture measures in different window sizes. Similarly, the potential of combining spectral measures and GLCM texture information for crop classification in time-series UAV images is investigated in [17].
More elaborate texture methods based on local invariant descriptors such as SURF and SIFT can also be used for characterizing vegetation species. For example, a methodology for vegetation segmentation in cornfield images obtained by UAVs is presented in [18]. Specifically, it focuses on finding an appropriate set of different color vegetation indices and local descriptors for vegetation characterization. The classification of weeds growing among crops using a BoW model based on SIFT or SURF features is presented in [19]. Finally, a study on the application of SIFT to cropland mapping in the Brazilian Amazon based on vegetation index time series is conducted in [20].
A classification scheme based on textures can be combined with other types of features obtained by UAVs to improve the classification results [10,16]. Among them are spectral features, vegetation indices, and morphological measures. For the detection of the extent of trees and shrubs, the canopy height model (CHM) is the one most commonly used. LiDAR sensors have been widely used in order to collect high resolution information on forest structure. Surface reconstruction by image matching can also be used to estimate CHM. It is achieved by exploiting the redundancy of multiple overlapping aerial images [21,22]. CHM is not used in this paper since the available datasets in many cases do not provide multiple images for the same area.
For the classification of a image using textures it is necessary to delimit regions over which the texture features are computed. Most of the vegetation classification methods proposed in the literature use regular patches [6,8,10,12,13,15,16,19]. In other cases, segmentation or object detection algorithms for dividing the image into regions are used [14,17,18]. A technique commonly used for the extraction of uniform regions in images is the segmentation based on superpixels [23,24]. A superpixel is a set of neighboring pixels (segment) which are similar in terms of low-level properties (such as spatial proximity, color, intensity, or other criteria). They differ from other segmentation methods in that the size and regularity of the superpixels are similar throughout the image. Superpixels provide a convenient and compact representation of images that allow to reduce the computational cost of the processing algorithms [25]. In the schemes presented in this paper, a texture feature vector is computed for each superpixel.
In this paper, different techniques for vegetation classification in multi and hyperspectral images based on texture extraction and BoW are proposed. The techniques are grouped into three categories: codebook-based, descriptor-based, and spectral enhanced descriptor-based schemes. The main contribution of this work is that in all the presented schemes the texture algorithms are computed inside superpixels, in contrast to most of the methods previously published in the literature, in which the vegetation textures are extracted from patches or objects. Moreover, some of the descriptor-based methods have not been applied before to multi-and hyperspectral images. Finally, a detailed comparison of the different techniques is carried out in terms of classification accuracy for several land cover remote sensing datasets.
The rest of the paper is organized into four sections. Section 2 presents a description of the proposed schemes involving superpixel computation and texture extraction. The experimental results for the evaluation in terms of classification performance and computational cost are presented in Section 3. The discussion is carried out in Section 4. Finally, Section 5 summarizes the main conclusions.

Methods
Three different schemes for texture extraction in order to obtain superpixel descriptors (i.e., a vector which describes the texture or visual properties of each superpixel in the scene) were proposed. The main novelty is that the texture features were computed inside these irregular patches of the images called superpixels and that the schemes were adapted to profit from the information available in all the bands of the hyperspectral images. Different texture extraction techniques can be derived from the proposed schemes depending on the algorithms selected for their stages as it will be explained throughout the paper.
The different stages of the proposed schemes, shown in Figure 1, are the following. Superpixel extraction. This is a particular type of segmentation stage. As previously mentioned, a superpixel is a set of pixels which are similar in terms of spatial proximity, color, intensity, or other properties. There is a relationship between these superpixels and the objects present in the scene.
In this stage, a set of S superpixels was extracted from the image. The computed superpixels are irregular, which is a difference between this process and other similar ones (e.g., creation of a grid of square patches). The differences in size and shape among superpixels are due to the adaption of each superpixel to the objects appearing in the scene.
In our case, the algorithm used for superpixel extraction was Simple Linear Iterative Clustering (SLIC) [26], although other options such as those based on watershed [27] or Efficient Topology Preserving Segmentation (ETPS) [28] would obtain similar results. SLIC clusterises pixels into superpixels, taking into account their relative position and spectral values, so both spatial an spectral information are considered. This algorithm is an adaptation of k-means for superpixel generation that begins defining clusters. Each pixel is associated to the nearest cluster center. Then, the cluster centers are adjusted to be the mean of all pixels belonging to the cluster. The assignment and update steps are iteratively repeated until the convergence criteria (a maximum number of iterations or an error value) is met. Finally, a postprocessing step enforces connectivity by reassigning disjoint pixels to nearby superpixels. It offers good results for segmenting hyperspectral images [24].
After the segmentation was computed, most of the subsequent stages in the proposed schemes were calculated at the superpixel level instead of at the pixel level. In particular, only one label from the reference data was considered for each superpixel, which is the one associated to the central pixel of this superpixel. Moreover, texture extraction was performed inside each superpixel.
Keypoint detection and description. A set of points of interest or keypoints were extracted from the image for each band and each superpixel. This stage was used in two of the schemes as shown in Figure 1b,c. These keypoints may be extracted in the positions given by a keypoint detector [29] or densely at each pixel position over a fixed grid. In addition, they should be distinctive and robust to image transformations.
Given a keypoint and its neighboring pixels, a set of features were computed, obtaining a local texture descriptor. In our case, the algorithms used to create texture descriptors were Scale-Invariant Feature Transform (SIFT) [30], Histogram of Oriented Gradients (HOG) [31], Dense SIFT (DSIFT) [32] and Local Intensity Order Pattern (LIOP) [33]. SIFT and HOG algorithms include both a descriptor and a keypoint detector. In LIOP a fixed grid was created, because this technique does not include a keypoint detector algorithm, only a descriptor one. DSIFT also uses a similar dense approach, but the detector is built-into the technique.
This process was applied to each spectral band, and then descriptors from all the bands were grouped for each superpixel taking into account their location in the XY plane. At the end of the process, a variable number of keypoints (with their corresponding descriptors) was assigned to each superpixel. The dimension of each descriptor was denoted as D and the their number as N, as shown in Figure 1b,c.
Codebook generation. The objective of this stage was to create a texton dictionary with K codewords based on all the bands of the input image. This codebook can be learned [34,35] or predefined [36]. In this paper the codebook was learned using k-means [6] or Gaussian Mixture Modeling (GMM) [37] algorithms. The size and nature of the codebook greatly affects the performance of the classification. The key was to generate a compact but yet discriminatory one.
Feature encoding. Given the codebook and the computed local features (i.e., vector descriptors corresponding to each superpixel), this encoding process mapped the latter to one or a variable number of codewords. The result was a feature coding vector per superpixel. Thereby, the aim of this process was mapping each superpixel description (object representation) to one or more codewords. This is a core component of the scheme, influencing texture classification in terms of both accuracy and speed. The feature encoding algorithms employed in this paper were Vector of Locally Aggregated Descriptors (VLAD) [38] and Fisher Vectors (FV) [37]. Once the desired vector representation for each superpixel was obtained, this representation was used as representative of the superpixel for the later stages.
Dimensionality Reduction. This is a stage where a set of vectors obtained in the previous stage (e.g., descriptors or coding vectors) were reduced. This reduction was performed if the number of bands B of the image was higher than the dimension D of the descriptors or coding vectors, in which case the image was reduced to D bands (see Figure 1 for details). This step was also used in Figure 1c in order to transform the descriptors from dimension D to D red . The techniques used for the reduction could be any of the traditional functions of aggregation such as the sum or the mean, or any other algorithm related to feature extraction such as, for example, Principal Component Analysis (PCA). PCA constitutes a quite popular method for feature extraction [39,40]. PCA estimates projections of the original data so that most of the variance is concentrated in a few components.
Feature classification. This last stage was not part of the texture extraction schemes and performed a classification of the images based on the features produced by the previous texture extraction technique. Texture features were inputs to a superpixel-level classification, i.e., the training and testing sets consisted of superpixels described by their texture features. Once the classification finished the same class was assigned to all the pixels in each superpixel. SVMs were selected as classifiers. They are usually presented as standard non-contextual classifiers for remote sensing classification [41], and can handle scenarios with a low number of training samples [42]. Results obtained by other two standard classifiers in remote sensing, RF and ELMs [43] were also obtained. Figure 1 illustrates the three different proposed texture extraction schemes showing the different stages according to the previous description. All of them have the image as input and one feature vector per superpixel as output.

Codebook-Based Scheme
The first scheme (named codebook-based from now on and shown in Figure 1a) began by performing two tasks in parallel: segmenting the image and creating a codebook. In terms of codebook generation, a texton dictionary with K codewords was created. The final set of codewords obtained was of size K × B (being B the number of bands of the input image).
Given the generated codebook and the computed superpixel segmentation, the next stage was the feature encoding. A vector representation of each superpixel, a texture vector, was obtained by mapping each superpixel to one or more codewords. In the case of k-means, this assignment can be done using the centroid with the shortest Euclidean distance to the superpixel.

Descriptor-Based Scheme
The second scheme in Figure 1 (named descriptor-based) represents an increase in complexity with respect to the first one. In parallel to the superpixel generation, a conditional dimensionality reduction step was performed. Only if the number of bands B of the image is larger than the resulting dimension D of the descriptors that were created, the image was reduced to D spectral bands. The next stage was keypoint detection and description. An algorithm such as SIFT or HOG was applied over each one of the bands of the image. The algorithm carried out two sequential stages. First N keypoints (points of interest) were detected and then, using their neighborhood pixels, a local texture descriptor algorithm was applied to obtain a set or pool of texture features of dimension D. Each one of these vectors was assigned to a superpixel according to the location of the keypoint. The number of keypoints per superpixel is variable, being possible to obtain even zero keypoints for a particular one. The implications of this variable number of descriptors per superpixel are not important in this scheme because they are used only when computing the codebook where they are stacked together. Further implications will be pointed out when describing the feature encoding stage in the next scheme.
After keypoint detection and description, the codebook generator was applied to the stacked descriptor vectors from all bands, obtaining K codewords of size D each. Another conditional dimensionality reduction stage was then performed in case the number of bands of the input image is lower than the descriptor dimension. After the dimensionality reduction (to the dimension of the input image in the first step or to the dimension of the descriptors later), the dimensions of both the codewords and the image pixel-vectors were equal (which means that now B is equal to D).
The output obtained (a texture vector describing each superpixel) is equivalent to the one obtained by the codebook-based scheme in Figure 1a.

Spectral-Enhanced Descriptor-Based Scheme
The last scheme (named spectral-enhanced descriptor-based), shown in Figure 1c, differs slightly from the previous one. The main novelty is that the feature coding stage operates at superpixel level and that a concatenation of spectral information to the texture descriptors at the end of the texture extraction process is performed.
As it can be observed in the figure, the input hyperspectral image for the encoding stage in the previous schemes was replaced here by the image descriptors obtained for the different superpixels. If a superpixel has no associated texture descriptor, the resulting vector is zero. However, if it has one or more descriptors, all of them are compared to each codeword in order to obtain the resulting vector.
As the feature encoding process took the texture descriptors as input (unlike the previous schemes), some kind of spectral information needed to be added. With this objective, a new stage called central pixel extraction was executed. It searched for the central pixel of each superpixel in the spatial coordinates of the image and extracted the corresponding spectral values (central pixel-vector). The resulting data structure once the central pixels were extracted consisted of S vectors (as many as superpixels in the segmentation) each one of dimension B (the number of bands). Finally, a concatenation was performed: a new vector per superpixel was created by stacking the texture vector from the feature encoding with the pixel-vector from the central pixel extraction stage. The output is equivalent to the one obtained by the previous schemes but differing in the dimension of the superpixel feature vector: B + D Red .

Dataset Description
Three datasets were used to evaluate the schemes proposed: a set of three hyperspectral images (from now on standard dataset), a set of multispectral scenes from river basins for which only vegetation classes are taken into account (Galicia dataset), and a large set of multispectral images from a vast region in China called Gaofen dataset. The standard dataset was used for comparison purposes as the scenes in it are usually present in land cover classification papers. Both the Galicia dataset and the Gaofen dataset were used because they contain large images with a wide range of vegetation species (both forests and crops). For the Galicia dataset only vegetation classes will be classified although the images also contain other materials, while for the remaining two datasets all classes including a non-vegetation one are classified.
The previously mentioned standard dataset corresponds to two images commonly used in the remote sensing literature: Pavia University (Pavia) and Salinas Valley (Salinas) [44]. Pavia was obtained by the ROSIS-03 (Reflective Optics System Imaging Spectrometer) sensor over the city of Pavia, Italy, with a spatial resolution of 2.6 m/pixel and covering the spectral range from 430 to 860 nm. Its dimensions are 610 × 340 pixels and 103 bands and its corresponding latitude and longitude are 45 • 11'23.66"N and 09 • 08'57.06"E, respectively. Salinas was obtained by the AVIRIS (Airborne Visible Infrared Imaging Spectrometer) sensor with a spectral range from 400 to 2500 nm. The main properties of this image are a resolution of 3.7 m/pixel, dimensions of 512 × 217 and 224 spectral bands. Moreover, its corresponding latitude and longitude are 36 • 39'33.8"N and 121 • 39'58.7"W. Figure 2 shows the false color composite images and the reference data corresponding to this dataset, while Table 1 displays the classes available in the reference data and the number of disjoint superpixels used for classification in training (15%) and testing (85%). Fifteen percent of superpixels corresponds to between 14% and 15% of the pixels in the image.  The Galicia dataset is made up of four multispectral images, and the objective of their creation was to monitor the interaction of masses of native vegetation with artificial structures and river beds. Four locations in the Galician provinces of A Coruña and Pontevedra were selected in an area comprised between Eiras Dam and River Mestas, with a distance of approximately 145.6 kilometers end-to-end. The datasets were captured by a MicaSense RedEdge multispectral camera [45] mounted on a custom UAV. Its 5 discrete sensors provide spectral channels at wavelengths of 475 nm (Blue), 560 nm (Green), 668 nm (Red), 717 nm (Edge), and 840 nm (Near infrared). The spatial resolution is 8.2 cm/pixel at a height of 120 m.
The four images in the dataset are the following; River Oitavén (Oitavén from now on) of size 6689 × 6722 pixels and is located in 42 • Figure 3 shows the false color composite images and their reference data (constructed in a long-term process involving forestry experts and the authors of the paper) corresponding to each one of the scenes, while Table 2 shows the classes available in the reference data for classification and the number of superpixels used for training (15%) and testing (85%). Moreover, as the objective is the identification of plant species, only vegetation classes are considered.  Water Oak Meadows 342  1943  661  3748  557  3159  114  646  5. Asphalt Authoctonous Finally, the Gaofen dataset was used [47]. It is a large-scale land use grouping of images containing 160 annotated Gaofen-2 (GF-2) satellite scenes. GF-2 is the second satellite of the High-definition Earth Observation System promoted by the China National Space Administration. The spectral range goes from 0.45 to 0.89 µm (blue to near-infrared) and the spatial dimension of the images is 6908 × 7300 pixels. Some of the dataset advantages are its large coverage, wide distribution, and high spatial resolution. It is remarkable that this dataset has high intra-class and low inter-class differences. The images cover an area of more than 50,000 km t2 in China. Specifically, it can be divided into two sets. A large-scale classification set made up of 150 high-resolution images acquired from more than 60 different cities in China and where 5 major categories are annotated. From now on, this dataset will be named GID5 (Gaofen Image Dataset 5 classes). A fine land-cover classification set composed of 30,000 multi-scale image patches coupled with 10 pixel-level annotated images and made up of 15 sub-categories. From now on GID15 (Gaofen Image Dataset 15 classes). Figure 4 shows the false color composite images and their reference data of two images from GID5 and two others from GID15. Table 3 shows the classes available in the reference data and the number of superpixels used for training (15%) and testing (85%). Table 3. Gaofen dataset. Classes available in the reference data and number of superpixels used for training (15%) and testing (85%). NA values indicate the non-existence of samples for a specific vegetation class and image, while "-" implies the non-existence of samples for the specific non-vegetation class [47].

Accuracy Assessment and Set-Up Description
The classification accuracy obtained by classifying the features provided by the proposed schemes was reported in terms of the usual measures in remote sensing. The first measure Overall Accuracy (OA) is the most widely used [48]. It provides the percentage of correctly classified pixels, and it is presented for every experiment. Besides, Quantity Disagreement (QD) and Allocation Disagreement (AD), which measure the disagreement between classification map and reference data in terms of proportion and spatial allocation of the classes, respectively, were also provided [49].
The input data for the experiments were standardized (a mean of 0 and a standard deviation of 1). In addition, with the aim of evaluating the computational cost associated to each method, execution time measures were performed. All the results presented in the paper are the mean value of 3 independent runs for each scenario, each one being obtained under identical experimental conditions.
Regarding the configuration parameters, as mentioned above, SLIC was used as the superpixel extraction algorithm. There are two parameters for SLIC: superpixel size and regularity of the superpixels. The superpixel size is the desired average size of each superpixel (in terms of area in pixel units). Ten and 1100 were the values selected for the standard dataset (as the images in the dataset are small) and the other two datasets, respectively. For superpixel regularity, the larger the value, the more regular the superpixels obtained. A value of 20 was selected for all datasets. These values were experimentally decided and depend mainly on the resolution of the images and on the size of the structures present on them, being in general bigger superpixels more adequate for higher resolution images.
The classification was performed by using SVM, Random Forest, and ELM. More precisely, for SVM the LIBSVM implementation version 3.24 in C/C++ was chosen [50] selecting a linear kernel. The parameter C was determined for each SVM and values of 0.02 for parameter C (the same value for all datasets) gave the best results. In the case of Random Forest, the OpenCV implementation was used [51]. The only parameter set for this algorithm is the number of trees. After a search for the considered datasets, a value of 200 trees was chosen. Last, the ELM implementation selected was [52]. The number of neurons in the hidden layer was 250 for the small images and 500 for the bigger ones. These are standard values for the datasets considered [43].
The classification is performed at superpixel level, i.e., all the pixels in a superpixel are assigned the same label. As far as the training and testing features are concerned, two disjoint sets were set up for each image. Specifically, after segmenting the images using SLIC, 15% of the superpixels from each class were randomly taken for training and the remaining 85% for testing in the general scenario. For training, only one label per superpixel, the label of the spatially central pixel, is considered. The number of 15% of superpixels is equivalent to choose~13% of the pixels of each class. This percentage of training samples is reasonable as it is shown in [43]. Results for 10% and 20% of samples for training were also tested for all the images in the Galicia dataset as this is the most representative dataset in the study, containing large images and including only vegetation classes.
Regarding other set-up details, the VLFeat library version 0.9.21 was used [32]. Specifically, the implementations of the texture extraction related algorithms SIFT, DSIFT, LIOP, HOG, GMM, VLAD, and FV available in the library were used. All the experiments were carried out using C/C++ compiled with gcc 7.5.0. Additionally, a first generation Intel Core i7-8700K CPU at 3.70 GHz and a 64-bit Ubuntu 18.04 were used for all the experiments including the execution time evaluation.
The classification accuracy results were obtained for each one of the three datasets described in Section 2.4 and considering different texture extraction techniques. The selection of specific algorithms for each stage of the proposed texture extraction schemes resulted in 16 different techniques for the experiments. The mapping between these techniques, the schemes they correspond to and the specific algorithms used are presented in Table 4. The row Without Texture Features shows the configuration without performing feature extraction at all and using the central pixel of each superpixel as input for the classification. Table 4. Texture extraction techniques considered for the schemes proposed in Figure 1 detailing the configuration of the different stages.

Experimental Results
The aim of this section is to analyze how the use of the different schemes influences the results of the classification. Classification results and computational efficiency in terms of Overall Accuracy (OA), Quantity Disagreement (QD), Allocation Disagreement (AD), and execution time are presented. The results are obtained for the three hyperspectral and multispectral datasets described above and for the texture extraction techniques that were described in Table 4. Three classifiers are considered: SVM, RF, and ELM.
The classification accuracy results for the standard dataset are shown in Tables 5 for the SVM  classifier, Table 6 when RF is used for classification, and Table 7 for the experiments with the ELM classifier. The listed techniques are grouped according to the texture scheme followed: codebook-based, descriptor-based, and spectral-enhanced descriptor-based. For each technique, 15 % of the superpixels were randomly selected for training and the remaining 85 % for testing. The best results for each image in terms of OA are highlighted with a gray background.
The results in Tables 5-7 show the same trends. As the images in the standard dataset are very small, they do not benefit from a superpixel level classification independently of the classifier considered, so for each classifier all techniques offer similar results and, in general, the OA values are low. The techniques from the descriptor-based scheme and spectral-enhanced descriptor-based scheme present a slightly higher OA, being the SIFT-based ones, specifically SIFT + GMM + FV and SIFT + k-means + VLAD, the best methods for the Salinas and Pavia images respectively. The QD and AD values are low for all the experiments and lower for the higher OA values, as expected. As the standard dataset does not focus on vegetation, experiments with the Galicia dataset considering only vegetation classes were performed. Table 5. Classification results in terms of OA (%), QD (%), and AD (%) obtained by the techniques detailed in Table 4 for the images of the standard dataset and using a SVM classifier. Fifteen percent of the superpixels are used for training. The best OA results are in a gray background.  Table 6. Classification results in terms of OA (%), QD (%), and AD (%) obtained by the techniques detailed in Table 4 for the images of the standard dataset and using a RF classifier. Fifteen percent of the superpixels are used for training. The best OA results are in a gray background.  Table 7. Classification results in terms of OA (%), QD (%), and AD (%) obtained by techniques detailed in Table 4 for the images of the standard dataset and using a ELM classifier. Fifteen of the superpixels are used for training. The best OA results are in a gray background. The results for the Galicia dataset are detailed in Tables 8-10 for the results with a SVM, a RF, and a ELM classifier, respectively. Unlike the standard dataset, the Galicia dataset is made up of larger images, so the execution time is very relevant. This is the reason for displaying execution times in the Tables. The best results for each image in terms of OA are highlighted with a gray background.
It can be observed that in this case, two techniques based on k-means as codebook generator, k-means + BoW and LIOP + k-means + VLAD, offer the best results for all the classifier. The best result is achieve for only one of the images, Mestas, and the RF classifier by a different technique: SIFT + GMM + FV + Spec. The AD and QD values are lower for higher OA values, as expected. Regarding execution times, the techniques with the highest ones correspond to those using a SIFT-based keypoint detection and description algorithm. On the contrary, the methods with the lowest computational cost are those based on LIOP or HOG as keypoint detection and description algorithms, while those based on the simpler codebook-based scheme present reasonable computational costs. Focusing on the best two techniques, LIOP + k-means + VLAD displays lower execution times than k-means + BoW.
In order to determine whether the previous results for the Galicia dataset are statistically reliable, Table 11 shows results varying the percentage of superpixels considered in the training set. Ten percent, 15%, and 20 % of the superpixels from each of the five vegetation classes were selected for training. For each percentage, the superpixels are randomly picked. The four best techniques extracted from the previous table were chosen to perform this comparison: k-means + BoW, GMM + FV, LIOP + k-means + VLAD, and SIFT + k-means + VLAD + Spec. It can be observed that the standard deviation decreases as the size of the training set increases. The highlighted best results show that the k-means + BoW technique outperforms the other methods 7 out of 16 times and presents the lowest standard deviation values. LIOP + k-means + VLAD obtains the best results in 2 out of 16 times and competitive results for the remaining experiments. Table 8. Classification results in terms of OA (%), QD (%), AD (%), and execution times (seconds) obtained by the techniques detailed in Table 4 for the Galicia dataset images and using a SVM classifier. Fifteen percent of the superpixels are used for training. The best OA results are in a gray background.  Table 9. Classification results in terms of OA (%), QD (%), AD (%), and execution times (seconds) obtained by the techniques detailed in Table 4 for the Galicia dataset images and using a RF classifier. Fifteen percent of the superpixels are used for training. The best OA results are in a gray background. It can be concluded from Table 11 that k-means + BoW is a technique that offers the best consistent results because it outperforms the other ones in terms of OA and obtains reasonable results regarding execution times. LIOP + k-means + VLAD is also an interesting technique, although it was experimentally checked that its results are highly dependent on the optimization of the tuning of its input parameters, which is a resource expensive process.
Finally, Table 12 shows the results obtained for the Gaofen dataset as it is a dataset with a large number of scenes. It is divided into GID5 (150 scenes) and GID15 (10 scenes) [47], as it was detailed in the dataset description. The data trends are similar to those of the Galicia dataset, being the codebook-based scheme the best, followed by the descriptor-based scheme and the spectral-enhanced descriptor-based scheme. Specifically, the best techniques are again those based on k-means as codebook generator algorithm. In this case, the best techniques are, in particular, k-means + BoW and k-means + VLAD. Table 10. Classification results in terms of OA (%), QD (%), AD (%), and execution times (seconds) obtained by the techniques detailed in Table 4 for the Galicia dataset images and using a ELM classifier. Fifteen percent of the superpixels are used for training. The best OA results are in a gray background.

Discussion
In this work, different texture schemes based on BoW for vegetation classification using a superpixel approach were studied. We considered multi-and hyperspectral remote sensing images taken by UAVs and satellites. In all the presented schemes, the texture algorithms were computed inside superpixels, in contrast to most of the methods previously published in the literature, in which the vegetation textures are extracted from patches or objects. A detailed comparison of the different techniques was carried out in terms of classification accuracy for several land cover remote sensing datasets. In particular, the Galicia dataset contained five classes of vegetation (oak, meadows, autochthonous vegetation, eucalyptus, and pines), while in the GID5 5 classes were considered, three of them corresponding to vegetation, and in the GID15, 15 classes were considered, corresponding eight of them to vegetation. The best classification results for each image ranged from 81.07% to 93.77% for the Galicia dataset, and from 79.63% to 95.79% for the Gaofen dataset. The techniques and algorithms used in this work included several keypoint detectors and descriptors (HOG, LIOP, SIFT, and DSIFT), algorithms for codebook generation (k-means and GMM), algorithms for feature encoding (histogram-based, VLAD, and FV), and, finally, some algorithms for feature classification (SVM, RF, and ELM). Additionally, SLIC was used for superpixel generation and PCA for dimensionality reduction.
In previous works, studies were carried out to determine the most suitable set of parameters, including textures, to carry out the classification of vegetation in remote sensing images. However, in these works the only texture method considered was GLCM. Specifically, the authors of [10] present a classification scheme for the canopy cover mapping of spekboom in a large semiarid region in South Africa using multispectral imagery (red, green, blue, and near-infrared bands). Three classes were considered (spekboom, tree, and background) and the classification scheme is a decision tree with 47 features grouped into two broad categories: per-pixel (spectral information) and sliding window features (statistic of the pixels inside a small local neighborhood). The decision tree obtained a mean absolute canopy cover error of 5.85%. The authors of [14] present a crop classification method for hyperspectral images combining 40 spectral indices, spectral features (several class-pair distances), and GLCM texture information in a object-oriented approach. Eight classes were considered (chinese cabagge, japanese cabbage, lettuce, radish, pasture, pole bean, and forest) and the classification accuracy obtained was 97.84%. Finally, the authors of [16] performed the classification of savannah tree species from very high resolution images acquired by UAVs. Two flights capturing multispectral imagery (red, green, blue, and near-infrared bands) were made to obtain image mosaics with longitudinal and lateral overlap. The method uses chromatic coordinates, spectral indices, the canopy height model, and GLCM texture measures in different window sizes. Nine classes of trees and shrubs (with an abundance of more than ten individuals within the samples) were considered and an outline of each single-stem individual was drawn onto the image. The overall accuracy obtained was of 77% on average.
For the detection of the extent of trees and shrubs, the canopy height model (CHM) is the one most commonly used. Information on height is obtained from different sources, in some cases through sensor fusion with LiDAR information. In other cases surface reconstruction from aerial images is the way of obtaining this complementary information [21,22]. In this paper, information on height was not considered as only single images are available in the dataset for the areas under study.
Other works use simple texture methods for vegetation classification, specifically LBP and GLCM. For example, the authors of [12] applied the LBP textures for the classification of tree species using hyperspectral data and an aerial stereo camera system. In the classification step, a pixel-based approach and a patch-based BoW approach were used. Four classes were considered (spruce, beech, mixed, and non-tree) and the classification accuracy obtained was approximately 60%. In [13], an UAV performs vegetation mapping in complex urban landscapes using ultra-high resolution color imagery acquired at low altitudes. A hybrid method combining Random Forest and GLCM texture analysis at nine different window sizes was used. Six typical land covers (out of which three are vegetated ones) were considered (grass, trees, shrubs, bare soil, impervious surface, and water) and the classification accuracies ranged from 86.2% to 91.8%. The authors of [15] propose an object-based GLCM texture extraction method for the classification of man-planted forests on mountainous areas using high resolution satellite data, including panchromatic and multispectral bands. The method used a multi-resolution segmentation algorithm to generate image objects and enhances the texture feature of objects using a 2D Gabor filter. Four classes were considered (non-vegetation, natural forest, rubber trees, and crops) and the classification accuracy obtained was 91.4%. The authors of [17] propose a method combining spectral measures and GLCM texture information for crop classification in time-series UAV images composed of three bands (green, red, and near-infrared). The object-oriented approach extracted meaningful objects via multi-resolution segmentation and classification was carried out on object units. Four vegetation classes (highland kimchi cabbage, cabbage, potato, and fallow) were considered. In six multi-temporal images, combining texture features with spectral information led to an increase of 7.72% in OA, compared to the classification result with spectral information only (from 83.13% to 90.85%).
For the classification of a image using textures it is necessary to delimit regions on which the texture features are computed. None of the works cited above uses superpixels to obtain the textures. Instead they use patches, segments or objects. Superpixels were used in [24], which proposes a scheme for natural roadside vegetation classification shooting on the ground (not remote sensing) using color cameras. Six classes were considered (brown grass, green, road, soil, tree, and soil) and the scheme learns two individual sets of BoW dictionaries from color and filter-bank texture features using the nearest Euclidean distance, which were aggregated into class probabilities for each superpixel. Experimental evaluations on a natural image dataset obtained 75.5% accuracy for classifying six objects.
For keypoint detection and description, we considered four algorithms: HOG, LIOP, SIFT, and DSIFT. Few works in the literature proposed descriptor-based methods for the classification of vegetation in images, although this approach is common in the classification of scenes. For example, the authors of [18] present a methodology for vegetation segmentation in cornfield images obtained by autonomous agricultural vehicles. A collection of outdoor color images, which were acquired under different illumination conditions and different plant growth state, were selected. The method focuses on finding an appropriate set of different color vegetation indices and local descriptors for vegetation characterization. Three different classes were considered (vegetation, light-brown soil, and dark-brown soil), and an accuracy value of 95.3% was achieved. In [19], the classification of weeds growing among crops using a BoW model based on SIFT or SURF features is presented. In that work, a small-sized robot was developed for vision based precision control of volunteer potatoes (weed) in a sugar beet field. The highest classification accuracy (96.5%) was obtained using SIFT, Out-of-Row Regional Index (ORRI), and SVM. Finally, the authors of [20] study the application of SIFT to cropland mapping in the Brazilian Amazon based on vegetation index time series. It used a dense temporal SIFT BoW algorithm, which is able to capture temporal locality of the data. The dataset was thus made of 46 MODIS images acquired over two years. Five crop classes were considered (soybean, soybean + millet, soybean + maize, soybean + cotton, and cotton) with accurate detection of around 70% of the agricultural areas.
Based on the presented information, it can be concluded that the number of works in the literature that use descriptors to characterize textures in the topic of vegetation classification in images is very limited. To the best of our knowledge, no superpixel-based descriptors have been previously proposed for the classification of vegetation in multi and hyperspectral images. On the other hand, our classification results are comparable to other studies in the literature. However, an exact numerical comparison is difficult as it depends on the nature of the datasets, the number of classes, and the number of samples used for training. Comparable results in the literature are only available for GID5 [47]. In particular, the accuracy obtained by the CNN-based technique used in this reference was 95.74%, which is similar to the 95.79% obtained in the experiments shown in Table 12. However, the experimental conditions in [47] were different as disjoint sets of images were taken for training and testing. In our approach the same percentage of training and testing samples were selected over each one of the images.

Conclusions
In this paper, different texture extraction schemes at the superpixel level for the classification of vegetation species using multi-and hyperspectral imagery are proposed. These schemes, based on the classical BoW approach, are called codebook-based, descriptor-based, and spectral-enhanced descriptor-based schemes. Some of the following stages are considered for each one; superpixel extraction, keypoint detection and description, codebook generation, feature encoding, and dimensionality reduction. The relevant contributions of this paper are the use of a superpixel segmentation algorithm as a way of dividing an image into homogeneous regions previously to the texture extraction, and the adequate exploitation of the spectral information available in all the bands of the image. Superpixels are used in the keypoint detection and description, codebook generation, feature encoding and classification stages. Sixteen different texture-extraction techniques derived from the three proposed schemes are analyzed in detail in the paper and compared in terms of classification accuracy and execution time considering SVM, RF and ELM as supervised classification algorithms.
Three datasets consisting of real multi-and hyperspectral images containing vegetation classes were employed to test the proposed schemes. As the standard dataset does not focus on vegetation, the Galicia and Gaofen datasets were also considered. The best classification results for each image range from 81.07% to 93.77% for the Galicia dataset and from 79.63% to 95.79% for the Gaofen dataset. The techniques and algorithms used in this work include several keypoint detectors and descriptors (HOG, LIOP, SIFT, and DSIFT), algorithms for codebook generation (k-means and GMM), algorithms for feature encoding (histogram-based, VLAD and FV), and, finally, some algorithms for feature classification (SVM, RF, and ELM). Additionally, SLIC was used for superpixel generation and PCA for dimensionality reduction. The experimental results show that the best techniques are based on k-means as codebook generator. In particular, the highest OA values are offered by k-means + BoW, that is a representative of the codebook-based scheme, using BoW for feature encoding. The second best results on average are provided by LIOP + k-means + VLAD, which uses LIOP for keypoint detection and description and VLAD for feature encoding, as a representative of the descriptor-based scheme. These are also techniques that present reasonable computational cost according to our experiments.
As future work, we plan to analyze the performance of the best techniques with new multispectral images corresponding to vegetation. The desired properties of these images will be the abundance of vegetation and high spatial resolution. Several future research lines that would benefit from the current proposal have also been considered such as testing different algorithms for keypoint detection and description, for instance, robust and powerful techniques like SURF and KAZE. Moreover, the creation of schemes with a different structure from the three described is also projected as future work.

Conflicts of Interest:
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript.