Evaluating the Performance of a Random Forest Kernel for Land Cover Classiﬁcation

: The production of land cover maps through satellite image classiﬁcation is a frequent task in remote sensing. Random Forest (RF) and Support Vector Machine (SVM) are the two most well-known and recurrently used methods for this task. In this paper, we evaluate the pros and cons of using an RF-based kernel (RFK) in an SVM compared to using the conventional Radial Basis Function (RBF) kernel and standard RF classiﬁer. A time series of seven multispectral WorldView-2 images acquired over Sukumba (Mali) and a single hyperspectral AVIRIS image acquired over Salinas Valley (CA, USA) are used to illustrate the analyses. For each study area, SVM-RFK, RF, and SVM-RBF were trained and tested under different conditions over ten subsets. The spectral features for Sukumba were extended by obtaining vegetation indices (VIs) and grey-level co-occurrence matrices (GLCMs), the Salinas dataset is used as benchmarking with its original number of features. In Sukumba, the overall accuracies (OAs) based on the spectral features only are of 81.34%, 81.08% and 82.08% for SVM-RFK, RF, and SVM-RBF. Adding VI and GLCM features results in OAs of 82.%, 80.82% and 77.96%. In Salinas, OAs are of 94.42%, 95.83% and 94.16%. These results show that SVM-RFK yields slightly higher OAs than RF in high dimensional and noisy experiments, and it provides competitive results in the rest of the experiments. They also show that SVM-RFK generates highly competitive results when compared to SVM-RBF while substantially reducing the time and computational cost associated with parametrizing the kernel. Moreover, SVM-RFK outperforms SVM-RBF in high dimensional and noisy problems. RF was also used to select the most important features for the extended dataset of Sukumba; the SVM-RFK derived from these features improved the OA of the previous SVM-RFK by 2%. Thus, the proposed SVM-RFK classiﬁer is as at least as good as RF and SVM-RBF and can achieve considerable improvements when applied to high dimensional data and when combined with RF-based feature selection methods.


Introduction
Remote sensing (RS) researchers have created land cover maps from a variety of data sources, including panchromatic [1], multispectral [2], hyperspectral [3], and synthetic aperture radar [4], as well as from the fusion of two or more of these data sources [5]. Using these different data sources, a variety of approaches have also been developed to produce land cover maps. According to the literature, approaches that rely on supervised classifiers often outperform approaches based on unsupervised classifiers [6]. This is because the classes of interest may not present the clear spectral separability required by unsupervised classifiers [6]. Maximum Likelihood (ML), Neural Networks (NN) and fuzzy samples, and choice of parameters. These characteristics make RF an appropriate method to classify high-dimensional data. Moreover, the tree-based structure of the RF can be used to create partitions in the data and to generate an RFK that encodes similarities between samples based on the partitions [35]. However, RF is difficult to visualize and interpret in detail, and it has been observed to overfit for some noisy datasets. Hence, the motivation of this work is to introduce the use of SVM-RFK as a way to combine the two most prominent classifiers used by the RS community and evaluating whether this combination can overcome the limitations of each single classifier while maintaining their strong points. Finally, it is worth mentioning that our evaluation is illustrated with a time series of very high spatial resolution data and with a hyperspectral image. Both datasets were acquired over agricultural lands. Hence, our study cases aim at mapping crop types.

Methods
This section introduces the classifiers background. As SVM and RF are well-known classifiers, a summary of them is presented in this section. After that, we define the RFK and explain how it is generated from the RF classifier.

Random Forest
The basics of RF have been comprehensively discussed in several sources during last decades [15,30,36]. Briefly, RF classifiers are composed of a set of classification trees trained using bootstrapped samples from the training data [30]. In each bootstrapped sample, about two-thirds of the training data (in-bag samples) are used to grow an unpruned classification (or regression) tree, and the rest of the samples (the out-of-the-bag samples) are used to estimate the out of bag (OOB) error. Each tree is grown by recursive partitioning the data into nodes until each of them contains very similar samples, or until meeting one stopping condition [30]. Examples of the latter are reaching the maximum depth, or when the number of samples at the nodes is below a predefined threshold [30]. RF uses the Gini Index [37] to find the best feature and plot point to separate the training samples into homogeneous groups (classes). A key characteristic of RF is that only a random subset of all the available features is evaluated when looking for the best split point. The number of features in the subset is controlled by the user and is typically called mtry. Hence, for large trees which is what RFs use, it is at least conceivable that all features might be used at some point when searching for split points whilst growing the tree. The final classification results are obtained by considering the majority votes calculated from all trees, and that is why RF is called a bagging approach [30]. A general design of RF is shown in Figure 1.
The operational use of RF classifiers requires setting two important parameters. First, the number of the decision trees to be generated N t . Second, the number of the features to be randomly selected for defining the best split in each node mtry. Studies show the default value of 500 trees and the square root of the number of features in the most applications stabilize the error of the classification [15,38]. Studies also show that classification results are most sensitive to the latter parameter. However, it is important to remark that several studies consistently observe that the differences in Overall Accuracies (OAs) between the best configurations and other configurations for RF are small [11,39,40]. Moreover, RF is known for being fast, stable against overfitting and requiring small sample size with high dimensional input compared to many classifiers [15,41]. Furthermore, RF is commonly used for feature selection by defining feature importance values based on total decrease in node impurity from splitting on the features, averaged over all trees (Mean decrease Gini index). These characteristics, besides the tree-based structure, make RF a good choice to be used as a partitioning algorithm that allows for the extraction of the similarity between samples. This similarity can then be used to create an RFK. In Section 2.3, we discuss how to obtain the similarity values between samples based on partitions created on data by trees in an RF.

Support Vector Machine
The base strategy of an SVM is to find a hyperplane in a high-dimensional space that separates the training data into classes so that the class members are maximally apart [20]. In other words, SVM finds the hyperplane that maximizes the margin, where the margin is the sum of the distances to the hyperplane from the closest point of each class [42]. The points on the margin are called support vectors. Figure 2a illustrates a two-class separable classification problem in a two-dimensional input space. Remote sensing data is often nonlinearly separable in the original high dimensional space [42]. In that case, the original data is mapped into a RKHS, where the data is linearly separable [43]. Figure 2b illustrates a two-class nonlinear separable classification problem in a two-dimensional input space.
(a) (b) Figure 2. Example of a linear (a) and a nonlinear SVM (b) for a two-class classification problem. The nonlinear SVM maps the data into high dimensional space to separate linearly the classes of the data.
Given training column vectors, x i ∈ R N f , where N f is the number of dimensions. In addition, a binary class vector that denotes the labels, y i ∈ {−1, 1}, where i represents the i-th sample, the maximization of the margin can be formulated as a convex quadratic programming problem. One way to solve the optimization problem is using the Lagrange multipliers (dual problem) as follows: In Equation (1), α i is a Lagrange multiplier, C is a penalty (regularization) parameter and x i x j is the dot product between x i and x j . When the data is nonlinear separable in the original space (characteristic of remote sensing data), the data is mapped into RKHS through a mapping function Φ : x → ϕ(x). The dot product in the RKHS space is defined by a nonlinear kernel function k( When the kernel function is calculated for all samples (N), the kernel function generates a square matrix (K ∈ R N×N ) that containing pairwise similarities between the samples. Note that K is a positive definite and symmetric matrix.
Within all type of kernel functions, the most well-known is the Radial Basis Function (RBF) kernel where σ is the bandwidth). Thus, the SVM using the RBF kernel requires to fix two parameters, the σ and C. These parameters are tuned by cross-validation of a grid space of (C, σ). For a comprehensive review of kernel methods, see [44].

Random Forest Kernel
This section presents the RFK kernel. The main idea of the RFK is to calculate the similarities of pairwise data directly from the data by means of a discriminative model (i.e., learning the classification boundaries between classes) [45]. A discriminative approach divides the data into partitions through algorithms such as clustering or random forest [35]. In these cases, the fundamental idea is that the data that fall in the same partition are similar, and the data that fall in the different partitions are dissimilar (e.g., the Random Partition kernel [29]).
Let be ρ a random partition of the dataset, the Random Partition kernel is the average of occurrences that two samples (x i and x j ) fall in the same partition, that is: where I is the indicator function. I is equal to one when ρ g (x i ) = ρ g (x j ), which means for this case that the samples x i and x j fall in the same partition; otherwise, it is zero [12]. In addition, g is the number of the partition in the data created by the eligible algorithms. Following the idea of the Random Partition kernel, the RFK is generated through creating random partitions by the RF classifier. As we have said before, RF is composed of trees. Each tree splits the data into homogeneous terminal nodes [29,46]. Thus, the RFK uses the partitions obtained by the terminal nodes to calculate the similarity among data. In this instance, if two samples are landed in the same terminal node of a tree, the similarity is equal to one; otherwise, it is zero. The similarity of each tree (K t n (x i , x j )) is obtained by [29]: where t is a terminal node and t n is the n − th tree of the RF. Then, the RFK matrix is calculated by the average of tree kernel matrices N t being the number of trees used in the RF.
Moreover, RF can also be used to identify the most important features (MIF) for high dimensional datasets, and an additional RFK can be derived from a subsequent RF model trained with those features only (RFK-MIF), which can be used in an SVM (SVM-RFK-MIF).
To assess the dependence of the applied kernels with an ideal kernel, we adopt the Hilbert-Schmidt Independence Criterion (HSIC) [47]. Given a kernel matrix for training dataset X (K x ) and the ideal kernel matrix for the class vector Y (K y ), the HSIC is obtained as follows [47]: where Tr is the trace operator, H is the centering matrix, and m is the number of samples. It has been proven that lower values of HSIC show the poorer alignment of the kernels with the target (ideal) kernel, and lower class separability subsequently.

Data and Ground Truth
Two high-dimensional data-sets including a time series of multispectral WorldView-2 (WV2) images and one hyperspectral AVIRIS image are used to evaluate the performance of the RFK. The first dataset was used to illustrate our work on a complex problem, namely that of classifying time series of VHR images to map crops. The second dataset was selected because it has been used as a benchmark dataset in several papers [48,49].

WorldView-2
A time series of WV2 images acquired over Sukumba area in Mali, West Africa in 2014 is used to illustrate this study. The WV2 sensor provides data for eight spectral features at a spatial resolution of 2 m. This dataset includes seven multispectral images that span the cropping season [50]. The acquisition dates include May, June, July, October, and November. Ground truth labels for five common crops in the test area including cotton, maize, millet, peanut, and sorghum, were collected through fieldwork. These images and the corresponding ground data are part of the STARS project. This project, supported by the Bill and Melinda Gates foundation, aims to improve the livelihood of smallholder farmers. The Sukumba images are atmospherically corrected, co-registered and the trees and clouds are masked [50]. Figure 3a,b show the study area and the 45 fields contained within the database.

AVIRIS
A Hyperspectral image acquired by the AVIRIS sensor over Salinas Valley (CA, USA) on 9 October 1998 [13] is used to illustrate this study. The Salinas dataset is atmospherically corrected, and although the image contains 224 bands, they were reduced to 204 by removing water absorption bands (i.e., bands [104 − 108], [150 − 163], and 224). AVIRIS provides 3.7 meter spatial resolution. Ground truth labels are available for all fields and these labels contain 16 classes including vegetables, bare soils, and vineyard fields.

Preprocessing and Experimental Set-Up
In this section, we describe the preprocessing and main steps of our work, which are also outlined in Figure 4. The boxes with Sukumba dataset indicate steps that were only applied to this dataset, and the rest of the boxes indicate steps applied to both datasets.

Preprocessing
As shown in Figure 4, the accuracy of the classifiers was analyzed regarding the number of features. Table 1 shows the number of samples, features, and classes for each dataset. Additional features were generated ( Table 2) for Sukumba dataset by obtaining Vegetation Indices (VIs) and Gray-Level Co-Occurrence Matrix (GLCM) features from spectral bands. These additional features were concatenated with the original spectral features to form an extended dataset for Sukumba.  Table 2. List of VIs used in this study together with a sort explanation of the them.

Formula Description
NDVI is a proxy for the amount of vegetation, and helps to distinguish the vegetation from the soil while minimizing the topographic effects, though does not eliminate the atmospheric effects [51].
DVI also helps to distinguish between soil and vegetation, yet does not deal with the difference between the reflectance and radiance from atmosphere or shadows [52] [52].
SAVI is similar to the NDVI, yet it suppresses the soil effects by using an adjustment factor, L, which is a vegetation canopy background adjustment factor. L varies from 0 to 1 and often requires prior knowledge of vegetation densities to be set [53].
MSAVI is a developed version of SAVI where the L-factor dynamically is adjusted using the image data and MSAVI2 is an iterated version of MSAVI [54].
TCARI indicates the relative abundance of chlorophyll using the reflectance at the wavelengths of 700 (i.e., R700), 670 and 550 and reduces the background (soil and non-photosynthetic components) effects compared to the initial versions of this index [55].
EVI is developed to improve the NDVI by optimizing the vegetation signal with using blue reflectance to correct the soil background and atmospheric influences [56].
The Sukumba dataset, which originally contains 56 bands, was extended by Normalized Difference Vegetation Index (NDVI), Difference Vegetation Index (DVI), Ratio Vegetation Index (RVI), Soil Adjusted Vegetation Index (SAVI), Modified Soil-Adjusted Vegetation Index (MSAVI), Transformed Chlorophyll Absorption Reflectance Index (TCARI), and Enhanced vegetation index (EVI) increasing the number of the features until 105. Next, the number of features for Sukumba dataset was extended by adding the GLCM textures to the spectral features and VIs. Texture analysis using the Gray-Level Co-Occurrence Matrix is a statistical method of examining texture that considers the spatial relationship of pixels [57]. The GLCM textures derived for Sukumba dataset are presented and explained comprehensively in [58]. For each spectral feature, statistical textures including angular second moment, correlation, inverse difference moment, sum variance, entropy, difference entropy, information measures of correlation, dissimilarity, inertia, cluster shade, and cluster prominence are obtained [58]. Concatenating spectral, VI and GLCM features increase the number of features to 1057. Salinas dataset with 204 features used as a benchmarking dataset with its original number of features.

Experimental Set-Up
First, the polygons of the Sukumba dataset were split into four sub-polygons of approximately the same size to extract the training and test samples. Unlike a random selection of train and test samples, this step avoids selecting close samples in the training and test sets, which would inflate the performance of the classifiers. Two sub-polygons were selected to choose the training samples and the other two, the test samples. Both the train and test sets were split into ten random subsets, with a balanced number of subsets per class (130 and 100 samples per class for training and test, respectively). A random sampling was used in the Salinas dataset (like in previous studies using this dataset). The samples were randomly split into train and test sets and 10 subsets are selected randomly from train and test sets separately, with the number of samples per class balanced (again, 130 and 100 samples per class for training and test).
In all the experiments, the optimization of the classifier parameters was required. The number of trees in RF was set to 500, according to the literature. The mtry parameter influence partially on the classification results of RF [11,39]. Hence, we explored the influence of mtry on the SVM-RFK classification results. First, the RFK is obtained by training RF with the default value of this parameter. Next, an RFK was obtained by optimizing mtry parameter for RF in a range of [N f (−1/2) − 10, N f (−1/2) + 10] in steps of two. Then, the RFKs were obtained from the corresponding RF classifiers.
Taking the advantage of RF to select the most important features in high dimensional datasets, this method was used to select the top features in the extended dataset of Sukumba. The feature importance values provided by RF were used to select the 100 MIF, and an RFK was obtained using a subsequent RF model trained with the 100 features. Using RFKs in an SVM, a 5-fold cross-validation approach was used to find the optimal C value in the range [5,500]. For the RBF kernel, we use the same range for the C parameter and the optimum bandwidth was found using the range [0.1, 0.9] of the quantiles of the pairwise Euclidean distances (D = x − x 2 ) between the training samples. In all the cases, the one-versus-one multiclass strategy implemented in LibSVM [59] was used. An equal number of 11 candidates is considered when optimizing mtry for RF, as well as the bandwidth parameter of SVM-RBF. Classification results are compared in terms of their Overall Accuracy (OA), their Cohen's kappa index (κ), the F-scores of each class, and the timing of the methods. The computational times for each classifier were estimated using the ksvm function in the kernlab package of R [60]. The built-in and custom kernel of this package were respectively used to obtain RBF and RFKs classifications in an SVM. To obtain RF models and RFKs, randomForest package of R is used [61]. In addition, the generated RF-based and RBF kernels are compared through both visualization and HSIC measures. Finally, crop classifications maps are provided for the best classifiers.

Results and Discussion
This section presents the classification results obtained with the proposed RF-based kernels and with the standard RF and SVM classifiers. All results were obtained by averaging the results of the 10 subsets used in each experiment. Results obtained with the default value of mtry are shown with RF d and RFK d , and those obtained with optimized mtry are shown by RF and RFK.
The OA and κ index averages of ten subsets are shown in Table 3 and Figure 5. In both cases, Sukumba and Salinas, results show high accuracies for all the classifiers for spectral features. The computational times for each classifier are depicted in Figure 6. Table 3 and Figure 5 show that the three classifiers compete closely in the experiments using only spectral features. Comparing SVM-RFK and RF, SVM-RFK improves the results compared to RF in terms of OA and κ for all Sukumba and Salinas datasets. Focusing on only the spectral features, the RFK improvement is marginal. Optimizing the mtry parameter also helps the RF and SVM-RFK to outperform marginally compared to the models with the default values of the mtry. Although RF and RFK get better results by optimizing mtry parameter, the higher optimization cost required allows us to avoid it ( Figure 6). This fact also make evident that optimizing the RF parameters is not crucial for obtaining an RFK.
Focusing on spectral features, the SVM-RBF yields slightly better results than SVM-RFK in terms OA and κ, reaching a difference of 1.41% and 0.74% in OA for Salinas dataset and Sukumba datasets, respectively. However, considering the Standard Deviation (SD) of these OAs, the performances of the classifiers are virtually identical (Table 3). Moreover, Figure 6 shows that the computational time for RFK is considerably lower than the RBF kernel for Salinas specifically without the mtry optimization. For spectral features of Sukumba, RFK and RBF computational times are at about the same level.
A notable fact is that SVM-RFK results improve considerably by extending the Sukumba dataset from 56 to 1057 dimensions, whereas RF and SVM-RBF classifiers get less accuracy with the extended dataset. For the extended Sukumba dataset, SVM-RFK outperforms SVM-RBF and RF with a difference of 4.34% and 1.48% in OA, respectively. Furthermore, RFK gets similar results for both mtry default and mtry optimized, whereas the computational time is three times higher using optimized parameter ( Figure 6). Moreover, the time required to perform SVM-RFK d is also about seven times less than that of SVM-RBF ( Figure 6). This fact could be seen as the first evidence of the potential of RFKs to deal with data coming from the latest generation of Earth observation sensors, which are able to acquire and deliver high dimensional data at global scales. More evidence for the advantages of the RFKs is presented in Table 4  Moreover, the HSIC measures presented in Table 5 reveal the alignment of the kernels with an ideal kernel for the training datasets. The lower separability of the classes results in poorer alignment between input and the ideal kernel matrices, and that leads in a lower value of HSIC [47]. Focusing on the spectral features, RFKs slightly outperform RBF for both Salinas and Sukumba datasets while both show almost equal alignment with an ideal kernel. The higher value of the HSIC measure for the RFKs compared to RBF is noticeable when the number of features is increased for the Sukumba dataset.    The analysis of the classifications results for each class is carried out by mean of the F-scores. Tables 6 and 7 show the results of F for each classifier, spectral case and dataset. In Sukumba (Table 6), the F has little variability, with standard deviations smaller or equal to 0.04. Furthermore, all classes have an F value larger than 0.75 (i.e., good balance between precision and recall). The classes Millet, Sorghum have the best F values, whereas the classes Maize and Peanut are harder to classify, irrespective of the chosen classifier. Focusing on the SVM-RBF and SVM-RFK classifiers, we see that the relative outperformance of SVM-RBF in terms of OA for spectral features (Table 3 and Figure 5) is mainly caused by the Maize and Millet classes, and this is while SVM-RFK and SVM-RBF show equal F values for classes Peanut and Sorghum, and SVM-RFK improves slightly the F value for the class Cotton compared to SVM-RBF. Moreover, SVM-RFK d competes closely with SVM-RFK and SVM-RBF while presenting slightly poorer F values. Table 6. F-score average (F) and standard deviation (SD) of the different classifiers using 56 features (Spectral features) and 1057 features (Spectral, VIs, and GLCM features) for the Sukumba dataset. Notation: RF and SVM-RFK denote classifiers created with an optimized mtry value, and RF d and SVM-RFK d denote classifiers created with the default mtry value.

Test
Classes Regarding Salinas, the F show results above 0.91 for all the classes except for Grapes untrained, and Vineyard untrained. For the latter two classes, the F are respectively around 0.69 and 0.71 for the RF-based classifiers. However, SVM-RFK improves the F values to 0.76 for both these classes. In this dataset, the SD values have also little variability (same as the ones found in Sukumba), with standard deviations smaller or equal to 0.05. For Salinas dataset, SVM-RFK d also competes closely with SVM-RFK and SVM-RBF while it presents slightly poorer F values.
A deeper analysis of the SVM-based classifiers can be achieved by visualizing their kernels. Figure 7 shows the pairwise similarity of training and test samples sorted by class. Here, we only visualize the RFK (with optimized mtry) because of the similarity of the results to RFK d . Table 7. F-score average (F) and standard deviation (SD) of the different classifiers using 204 features (Spectral features). Notation: RF and SVM-RFK are respectively RF and SVM-RFK with optimized mtry, and RF d and SVM-RFK d are respectively RF and SVM-RFK with default mtry. Focusing on the spectral features, this figure shows that the kernels obtained for Salinas are more "blocky" than those obtained for Sukumba. This makes it evident that a higher number of relevant features can improve the representation of the kernel. It also shows that the RFKs generated for Sukumba are less noisy than the RBF kernels. However, the similarity values of the RFKs are lower than those obtained for the RBF kernels. The visualization of the kernels confirms the higher F values found in the Salinas dataset. A detailed inspection of the RFKs obtained from this dataset shows low similarity values for classes 8 and 15, which correspond to Grapes untrained and Vineyard untrained. As stated before, these classes have the largest imbalance between precision and recall. Increasing the number of features to 1057 by extending the spectral features for Sukumba dataset represents a blockier kernel, by improving only the intraclass similarity values. However, the RBF kernel loses the class separability by increasing both intraclass and interclass similarity values by increasing the number of features for Sukumba dataset; this can be observed by RFK visualizations in Figure 7 and f-score values in Table 6. Focusing on the RFK, there are samples that their similarity values to other samples in their class are low for the RFK (Gaps inside the blocks), these samples could be outliers since RFK is based on the classes and the features while the RBF kernel is based on the Euclidean distances between the samples. Thus, removing outliers using RF can improve the representation of the RFK. Figure 8 shows the kernel visualization of RFK based on the 100 most important features selected by RF. As it can be observed in this figure, the similarity between the samples in the same classes is increased in particular for the classes one and five compared to the kernel using all 1057 features.

Test Classes
Finally, we present the classification maps obtained using the trained classifiers with spectral features. For Sukumba dataset, we also obtain the classification maps using SVM-RFK based on the top 100 features. For visibility reasons, we only present classified fields for Sukumba and classification maps for Salinas. In particular, Figure 9 shows two fields for each of the classes considered in Sukumba. These fields were classified using the best training subset of the ten subsets, and the percentage of pixels correctly classified are included on the top of each field. In general, the SVM classifiers perform better than the RF classifiers. Focusing on the various kernels, the RFKs outperform the results of RBF for the majority of the polygons.
Moreover, we observe a great improvement in the OA for all polygons by using the SVM-RFK-MIF. This means that RF can be used intuitively to define an RFK based on only the top 100 features, and this kernel can improve the results significantly compared to RF, SVM-RBF, and SVM-RFK.    Classification maps for Salinas and their corresponding OAs are depicted in Figure 10. In this dataset, all classifiers have difficulties with fields where Brocoli_2 (class 2) and Soil_Vineyard (class 9) are grown. Moreover, it is worth mentioning that the performance of three classifiers is at about the same level. However, the SVM-RFK classifier has a marginally higher OA than the RF classifier, and SVM-RBF slightly outperforms SVM-RFK. This can be explained by the relatively high number of training samples used to train the classifiers compared with the dimensionality of the Salinas image. However, the computational time of classification for SVM-RBF is higher compared to RF and SVM-RFK ( Figure 6). Ground Figure 10. Ground truth and three classification maps (and the OA (%) calculated using all the pixels in the dataset on the top) for the RF, SVM-RBF, and SVM-RFK classifiers using the AVIRIS spectral features.

Conclusions
In this work, we evaluate the added value of using an RF-based kernel in an SVM classifier (i.e., RFK) by comparing its performance against that of standard RF and SVM-RBF classifiers. This comparison is done using two datasets: a time series of WV2 images acquired over Sukumba (Mali), and a hyperspectral AVIRIS image over Salinas (CA, USA). The obtained OAs and their SD values indicate that three classifiers perform at about the same level in most of the experiments. Our findings show that there are alternatives to the expensive tuning process of SVM-RBF classifiers. The proposed RFK led to competitive results for the datasets with a lower number of features while reducing the cost of the classification. Our findings prove that optimizing the mtry for RF leads to minor changes in the SVM-RFK. Thus, with a small trade-off in OA for the datasets with a low number of features, the cost of the classification can be reduced through skipping the mtry optimization. More importantly, our results show that RFKs created using high dimensional and noisy features considerably improve the classification accuracies obtained by the standard SVM-RBF while reducing the cost of classification. For the higher number of features, SVM-RFK results are also slightly better than the ones obtained by the standard RF classifier. Moreover, by exploiting the RF characteristics through defining the most important features, the results of the classification for SVM-RFK considerably improve, with OA around 7% better than those obtained with an SVM-RBF classifier. In short, our results indicate that RFK can outperform standard RF and SVM-RBF classifiers in problems with high data dimensionality. Further work is required to evaluate this kernel in additional classification problems and against other land cover classification approaches (e.g., based on deep learning). Other characteristics of RF (outlier detection) can be exploited to estimate the RFK more accurately. Furthermore, the proposed RFK is based on a rough estimation of the similarity between samples according to their terminal node. Future work is required to design and test more advanced and alternative estimations of similarity using RF classification results.