Dimensionality Reduction and Feature Selection for Object-Based Land Cover Classiﬁcation based on Sentinel-1 and Sentinel-2 Time Series Using Google Earth Engine

: Mapping Earth’s surface and its rapid changes with remotely sensed data is a crucial task to understand the impact of an increasingly urban world population on the environment. However, the impressive amount of available Earth observation data is only marginally exploited in common classiﬁcations. In this study, we use the computational power of Google Earth Engine and Google Cloud Platform to generate an oversized feature set in which we explore feature importance and analyze the inﬂuence of dimensionality reduction methods to object-based land cover classiﬁcation with Support Vector Machines. We propose a methodology to extract the most relevant features and optimize an SVM classiﬁer hyperparameters to achieve higher classiﬁcation accuracy. The proposed approach is evaluated in two di ﬀ erent urban study areas of Stockholm and Beijing. Despite di ﬀ erent training set sizes in the two study sites, the averaged feature importance ranking showed similar results for the top-ranking features. In particular, Sentinel-2 NDVI, NDWI, and Sentinel-1 VV temporal means are the highest ranked features and the experiment results strongly indicated that the fusion of these features improved the separability between urban land cover and land use classes. Overall classiﬁcation accuracies of 94% and 93% were achieved in Stockholm and Beijing study sites, respectively. The test demonstrated the viability of the methodology in a cloud-computing environment to incorporate dimensionality reduction as a key step in the land cover classiﬁcation process, which we consider essential for the exploitation of the growing Earth observation big data. To encourage further research and development of reliable workﬂows, we share our datasets and publish the developed Google Earth Engine and Python scripts as free and open-source software.


Introduction
Mapping Earth's surface and its rapid changes with remotely sensed data are a crucial task to help understand the impact of an increasingly urban world population on the environment.The information provided by urban scene classification and change maps are quite important for urban planners, environmental engineers, and decision makers in general.Land Use/Land Cover (LULC) classes are established categorical variables that represent the status of the Earth's surface in a viable way.In the last couple decades, several studies have investigated the classification of urban scenes using remote sensing data.Remotely sensed data provides cheap, comprehensive, and easy to use source to map the location and the spatial extent of different LULC classes.With its rich information content, multispectral optical images have been used intensively for urban scene classification [1][2][3][4][5][6].
Unlike optical, radar images have lower spectral information content.Nevertheless, these images are not affected by atmospheric and solar illumination conditions.Several studies have shown the potential of these images in urban classification [7][8][9][10][11][12][13]. The successful use of remote sensing data for urban scene classification depends on several considerations (e.g., image spatial resolution, acquisition time, spectral information, classification scheme, etc.).Another factor of crucial importance and that has strong impact on the classification accuracy, in general, is the input features and their quality.It is quite common to use the raw input spectral images (e.g., green, red, NIR, radar cross section, etc.), or/and product derived from these spectral bands (e.g., NDBI, NDVI, NDWI) as input features.Texture measures (e.g., GLCM texture) have also played important role in increasing the discrimination between different LULC classes [14,15].Features that can be used for scene classification are not restricted to raw image spectral bands or products derived from them.Several studies have shown that combining features extracted from images acquired at different times could help significantly improving the classification accuracy [11,[16][17][18][19].The basic idea is that despite the fact that some LULC classes would look very similar in one season/time, they tend to look quite different in another.By combining images acquired in different times, therefore, it is possible increase classes' separability.
Because of their different modalities, optical and SAR provide quite different information about earth surface.The response of earth surface materials to the short wavelength energy used by optical sensors depends to a high extent on the biochemical properties of the observed objects.Conversely, radar response is often associated with other factors such as objects geometrical properties (e.g., roughness), and its moisture content.It is therefore natural to think of SAR and optical images as complementary to each other [20].Several studies have investigated the possibility to improve classification accuracy through the fusion of SAR and optical images [21][22][23].Fusion can be carried out at the feature, pixel, or information level.The former approach consists of combining features extracted from different data sources into a one augmented feature vector; it can then be an input of any classification algorithm.Because of the simplicity involved, this approach has been the focus of several studies [24][25][26][27][28].In fact, the multitemporal image classification approach discussed in the previous section can be envisaged as a feature-level fusion technique; this approach was further extended by the fusion of multitemporal and multi-source features extracted from Radarsat-2 and QuickBird images for urban land-cover mapping [29].Despite the availability of a large selection of features that can be used for the classification task, most studies only consider a few of them.This could be attributed to the fact that in many cases features will be highly correlated, i.e., they contain redundant information.Including such features would serve nothing other than slowing down the classification process.Another reason is that features computation (e.g., GLCM textures) is time consuming and handling large dataset with many features is not easy in common image classification software [30].For this reason, the analyst faces the problem of selecting few out of hundredth of available features (e.g., spectral band, GLCM textures, and indices).Usually, expert knowledge is used to manually select a few, allegedly promising features, even though numerous algorithms exist that could perform the dimensionality reduction task automatically [31][32][33][34][35][36].
Dimensionality reduction is the process of simplifying the classification task by either transforming the original features to a representation in a set of lower dimensionality (feature extraction) or by removing redundant and irrelevant features (feature selection) [37].A well-known feature extraction method is principal component analysis (PCA) [38].It projects the input feature space to a new space in which feature will be uncorrelated.It also compresses (i.e., dimensionality reduction) the information into few output features that contain most of the information available in the original space.Unlike PCA, linear discriminant analysis (LDA) is a supervised feature extraction method that projects the input feature space in a way that maximizes the separation between the classes [39].With large representative training data, LDA is expected to outperform PCA since it takes samples classes into consideration [40].Feature selection methods can be grouped into three main categories: filters, wrappers, and embedded methods [37,41].While filters rank features according to some predefined statistic and remove lowest ranking features, wrappers utilize learning machines to find the best performing subset of features.Embedded methods find optimal feature subsets during training and are integral parts of certain learning machines.In particular, filters are independent of the learning machine and they evaluate feature importance based on data characteristics [42].Univariate filters compute a feature's quality individually, assuming independence between the features, while multivariate filters take possible dependencies into account and perform assessment in batches.
The frequent acquisition of satellite imagery leads to an impressive amount of data that is collected and published on a daily basis.An efficient means of access and an infrastructure that allows large-scale processing of this big EO data are needed to fully exploit it.Google Earth Engine (GEE) is a cloud-based platform for planetary-scale geospatial analysis that utilizes the massive computational capabilities of Google's servers.It facilitates access to geographical and remote sensing data in their "multi-petabyte analysis-ready data catalog" [43].GEE allows for effective processing through subdivision and distribution of computations.In general, GEE lowers the threshold for large-scale geospatial computation and makes it possible "for everyone" [43].Though, GEE offers valuable opportunities for EO data analysis it could be vastly enhanced with an integration with Google Cloud Platform (GCP) (https://cloud.google.com/).GCP provides Google Compute Engine, which offers scalable and customizable virtual machines for cloud computing and Google Cloud Storage, a service for storing and accessing data.It allows different types of storage based on the desired access-frequency and the intended use.On GCP, external libraries can be used for further processing and deeper analysis of data derived from GEE.One example is Scikit-learn, an open-source machine learning Python module [44].It contains a variety of machine learning models for classification, regression, and clustering [45].Besides implementations of machine learning models, it contains tools for model selection, cross-validation, hyperparameter tuning, and model evaluation as well as dimensionality reduction, feature decomposition, and feature selection.
In this study, we will utilize the computational power of Google Earth Engine (GEE) and Google Cloud Platform (GCP) to generate an oversized feature set extracted form Sentinel1 and Sentinel2 multitemporal images.We will explore feature importance and analyze the potential of different dimensionality reduction methods.A large feature set is evaluated to find the most relevant features which discriminate the classes well and thereby contribute most to achieve high classification accuracy.In doing so, we present an automated alternative to the sensitive knowledge-based but tedious and sometimes biased selection of input features.Three methods of dimensionality reduction, i.e., linear discriminant analysis (LDA), mutual information-based (MI), and Fisher-criterion-based (F-Score), will be tested and evaluated.LDA is a feature extraction method that transform the original feature space into a projected space of lower dimensionality.MI and F-score belong to the group of filter-based univariate feature selection methods that rank and filter the features according to some statistic.We evaluate the optimized feature sets against an untreated feature set in terms of classification accuracy, training and prediction performance, data compression, and sensitivity to training set sizes.
For features classification, a support vector machine (SVM) is chosen.SVM is a supervised non-parametric machine learning algorithm [46].Although it is commonly used for classification tasks, it has also been used for regression.SVM maps the input features space to a higher dimensional space using kernel functions (e.g., radial basis function).The training samples are separated in the new space by a hyperplane (defined by the support vectors) that guarantee the largest margin between the classes.It has been used successfully in different types of remote sensing applications-for example, classification [47][48][49], change detection [50][51][52], and in image fusion [53].Depending on the size of the training sample, SVM is known for being slow during the training phase.However, the classification phase is quite fast since it only depends on a few training samples known as the support vector.The spatial resolutions of Sentinel-1 (S1) SAR data and Sentinel-2 (S2) MSI sensors are moderate (e.g., 10 m) and the classification will often use the pixel-based approach.Geographic object-based image analysis (GEOBIA) is often applied when using high spatial resolution data [54,55].However, given the large extent of the study areas, which surpasses even the largest case studies compared in a recent review of object-based methods by a factor of a thousand [56], we adopted an object-based classification even though the analysis will be based on S1 and S2 images.The application of GEOBIA to moderate spatial resolution is not very common, however is not rare either [57].The object-based paradigm allows us to reduce the computational complexity immensely, and consequently, allow us to put more emphasis on the main goal of this paper, which is the investigation of feature selection for urban scenes classification.
In summary, we provide a framework that derives features from GEE's data catalogue.We deliver a prototype for GCP that applies dimensionality reduction and feature set optimization methods and tunes the classifier's hyperparameters in an exhaustive search.Performances on the different feature sets are compared to each other with statistical measures, different visualization methods, and a non-inferiority test for the highest dimensionality reduction satisfying the chosen performance requirements.The overall aim is the exploration of feature importance in the big EO data that are available today.Understanding the feature importance leads to an improved classification accuracy by the removal of irrelevant features and to an increased classification throughput by the reduction of the datasets.We demonstrate the applicability of our methodology in two different study sites (i.e., Stockholm and Beijing) based on a multitemporal stack of S1 and S2 imagery.For the given classification scheme, we find the best feature set that achieve the highest overall accuracy.A further analysis of the feature selection allows us to evaluate the importance of individual features to the classification task.

Study Areas and Data Description
We demonstrate our methodology on two study areas.One is based on a scheme of 10 LULC classes in the urban area of Stockholm, while the other is based on 8 LULC classes scheme in the urban area of Beijing.The study areas are shown in (Figure 1) and cover 450,000 ha in Stockholm and 518,000 ha in Beijing.The classes represent the dominant land cover and land use types in the respective areas as demonstrated in [58,59], and are derived from urban planning applications and urban sprawl monitoring.The classification schemas adopted for the study areas were defined in previous projects [58,59] to monitor the urbanization process and evaluate the corresponding environmental impact; Table 1 provides an overview of the selected classes.We used reference points that were manually collected by remote sensing experts not involved in this study [58,59] and we divided them in training and validation sets.In Stockholm, these reference points are approximately uniformly distributed over most classes (≈1000 samples per class, except 500 for bare rock and wetlands).In Beijing, there are in general fewer reference points with more imbalanced classes (70 samples for urban green space up to 175 for forests).As mentioned earlier, an object-based approach has been adopted in order to reduce the computational burden and focus more on the feature selection and classification.The segmented image-objects should aim to depict the entities of the urban land cover types.For the detection of low-density built-up areas, which are a dense mix of built-up and vegetated pixels, an object-based  As mentioned earlier, an object-based approach has been adopted in order to reduce the computational burden and focus more on the feature selection and classification.The segmented image-objects should aim to depict the entities of the urban land cover types.For the detection of low-density built-up areas, which are a dense mix of built-up and vegetated pixels, an object-based approach is beneficial.Additionally, the separation of high-density built-up areas and roads can benefit from the geometric properties of the objects (e.g., minimum enclosing circle, minimal area rotated rectangle, and least-squared fitted ellipse).
For both study areas, the segmentation is performed using a multiresolution module of eCognition software [60].The segmentation is based on the S2 10 m spectral bands.The segments are created using a scale parameter of 50, a shape vs. color ratio of 0.7:0.3 and a compactness vs. smoothness ratio of 0.7:0.3.The parameters were selected based on repeated tests for a subjectively good segmentation (compare Figures 2-4).The criteria of this selection were mainly the proper distinction of segments containing LDB (low density built-up) in comparison to UGS (urban green spaces); and R (roads and railroads) in comparison to HDB (high density built-up).The adequacy of a segmentation layout can only be assessed in the context of an application.Consequently, the traditional approach has been to use human expert to evaluate the quality of the segmentation [59].However, when facing big EO data, the segmentation process and determination of its parameters needs to be automated.Several approaches have been proposed [61][62][63] and should be evaluated for efficient integration in the workflow.As mentioned earlier, an object-based approach has been adopted in order to reduce the computational burden and focus more on the feature selection and classification.The segmented image-objects should aim to depict the entities of the urban land cover types.For the detection of low-density built-up areas, which are a dense mix of built-up and vegetated pixels, an object-based approach is beneficial.Additionally, the separation of high-density built-up areas and roads can benefit from the geometric properties of the objects (e.g., minimum enclosing circle, minimal area rotated rectangle, and least-squared fitted ellipse).For both study areas, the segmentation is performed using a multiresolution module of eCognition software [60].The segmentation is based on the S2 10 m spectral bands.The segments are created using a scale parameter of 50, a shape vs. color ratio of 0.7:0.3 and a compactness vs. smoothness ratio of 0.7:0.3.The parameters were selected based on repeated tests for a subjectively the classifier hyperparameters [64].The estimated hyperparameters are then used to train the classifier.The results are analyzed using a non-inferiority test to detect the best classifier using the least number of features.The generated reports and graphs give useful insights that can be used to refine the first step of feature computation and reduce the data load.If a satisfying classifier has been found, in (3) the prediction of the land cover classes is performed, and a resulting land cover map is produced in GEE

Feature Set and Classifier Optimization
Step two is the most important step in the abovementioned workflow.Figure 4 shows this step in more details.It mainly consists of features selection/extraction step intertwined with a crossvalidated exhaustive grid search for the optimum parameters of the machine learning model, i.e., the SVM.The outputs are a set of selected features together with the best cross-validated classifier as specified by a selected metric (i.e., overall accuracy) and a table of the full cross-validation results.The latter is used to find improved feature sets with the non-inferiority test.The workflow performs the following steps:

•
The first step is to scale the input features.Scaling ensures that all features have the similar ranges of value, which are beneficial or even essential for some dimensionality reduction methods as well as for the SVM classifier.We applied scaler that uses the second and third quantiles of the data to rescale the data linearly in the range [0,1].

•
The second step is a reduction of the feature set dimensionality either through the feature extraction or feature selection methods.

•
Given the selected feature set, the last step is to find, via grid search, the optimum set of hyperparameters for the SVM classifier.
This workflow is computationally excessive operation, but as it consists of many independent Using the reference data, the segments were labeled based on the class of the reference point located in the segment.Inside Stockholm lies a total of 5800 labeled segments and an additional 200,000 unlabeled segments (~3% labeled).In Beijing 1100 segments are labeled and 458,000 segments are unlabeled (~0.25% labeled).For Stockholm and Beijing, the analysis considers S1 and S2 images from the summer of 2015 and 2016, respectively (Table 2).For S1 images, two orbit directions (i.e., ascending and descending) with two polarizations (i.e., VV and VH) are used and treated as individual inputs.Four temporal stacks are created (one for each direction/polarization).Each temporal stack is reduced by computing its temporal mean and standard deviation.This way, the speckle noise is reduced while still capturing temporal variations of the different LULC classes.For each mean and standard deviation image, 17 GLCM texture measures (Table 3) estimated with kernel size of 9 × 9 are computed.Finally, for each segment, the mean, standard deviation, max, and min of the abovementioned features are computed (see Table 4).The S2 stack is filtered such that only images with less than 15% cloud cover is included.All S2 spectral bands were resampled to 10 m spatial resolution.In addition to the 12 spectral bands, normalized difference-vegetation and water indices (NDVI, and NDWI) are computed and added to the S2 stack.Additionally, 17 GLCM texture measures are computed for NIR (10 m) spectral band: we could have computed GLCM textures for all S2 spectral band but this would have increased the number of features tremendously (extra 748 features) without adding substantial new information considering the high correlation of texture between the available S2 bands.Moreover, since several S1 textures have been included, choosing only NIR texture will be enough for the objective of this paper.For each segment, four statistics (i.e., mean, standard deviation, min, and max) are computed for all the available images.Finally, 12 features describing the segments' geometric properties are computed (see Table 4 for details).In total, 712 features are computed in the GEE platform and exported into GCP (see Table 4).

Methodology
An overview of our workflow is presented in Figure 3.It consists of three main steps: (1) feature computation, (2) dimensionality reduction and classifier parameters estimation, and (3) classification.In the first step, segments and reference points are imported to GEE and input time series are chosen from GEE's data catalogue.The segments are labeled based on the reference points.Features are computed as statistics for each segment and exported to GCP.Please refer to the previous section for more details about data preparation and features computation.In the second step, different dimensionality reduction methods are performed, and an exhaustive grid search is used to optimize the classifier hyperparameters [64].The estimated hyperparameters are then used to train the classifier.The results are analyzed using a non-inferiority test to detect the best classifier using the least number of features.The generated reports and graphs give useful insights that can be used to refine the first step of feature computation and reduce the data load.If a satisfying classifier has been found, in (3) the prediction of the land cover classes is performed, and a resulting land cover map is produced in GEE.

Feature Set and Classifier Optimization
Step two is the most important step in the abovementioned workflow.Figure 4 shows this step in more details.It mainly consists of features selection/extraction step intertwined with a cross-validated exhaustive grid search for the optimum parameters of the machine learning model, i.e., the SVM.The outputs are a set of selected features together with the best cross-validated classifier as specified by a selected metric (i.e., overall accuracy) and a table of the full cross-validation results.The latter is used to find improved feature sets with the non-inferiority test.The workflow performs the following steps:

•
The first step is to scale the input features.Scaling ensures that all features have the similar ranges of value, which are beneficial or even essential for some dimensionality reduction methods as well as for the SVM classifier.We applied scaler that uses the second and third quantiles of the data to rescale the data linearly in the range [0,1].

•
The second step is a reduction of the feature set dimensionality either through the feature extraction or feature selection methods.

•
Given the selected feature set, the last step is to find, via grid search, the optimum set of hyperparameters for the SVM classifier.
This workflow is computationally excessive operation, but as it consists of many independent operations it can be parallelized and thus be run efficiently on GCP using a virtual machine with many CPUs.The next few subsections describe these steps in more details.

Data Sets and Cross-Validation
For the optimization, only the labeled segments are relevant.Unlabeled segments are left aside until creating a final land cover classification map.The labeled segments are split into a set of training and testing data with a specified ratio.The testing set is withheld from the optimization step and is only used for a final assessment.The training set serves as an input to feature selection/extraction and grid search.A repeated stratified k-fold of the training set is performed for cross-validation.It splits with n repetitions the training set into k equally large folds while preserving the distribution of the target classes.Cross-validation is then performed by leaving out one of the folds for validation of the classifier that is trained on the remaining folds.The number of folds k should be chosen in consideration of the training set size and especially the number of samples in the least represented classes.The number of repetitions n should be set according to the available amount of time and the required certainty in the cross-validations.

Dimensionality Reduction Step
Because of the involved simplicity and fast implementation, we decided to test and evaluate three different methods in the dimensionality reduction step.In particular, we tested the linear discriminant analysis (LDA), mutual information (MI), and Fisher's criterion (F-Score).LDA is a supervised feature extraction method that takes training sample categories into consideration [65,66].The input feature space is projected onto a linear subspace of directions that maximize the separation between class means while minimize interclass variance [67].The number of the output dimensions must necessarily be smaller than the number of classes, which usually results in high data compression.MI and F-Score are feature selection methods from the subgroup of univariate filters.In both techniques, a certain criterion is used to rank the features according to their relevance to the target classes, and the best K features are then selected.In MI the ranking criterion is the mutual information, which is a non-negative value, measuring the mutual dependency between two random variables.It is zero if and only if the two variables are strictly independent, higher values indicating higher dependency.MI utilizes a non-parametric entropy estimation from k-nearest neighbor distances [68].F-Score rank a feature based on the ratio of the between classes variance to the within classes variances.It assigns high rank to features that maximizes the distance between classes and minimizes the within class variance.Unlike the MI, the F-Score cannot model non-linear dependencies [69].

SVM Hyperparameters Estimation
In this step, cross-validated guided exhaustive grid search over the specified SVM parameter values is implemented.The grid search requires a learning machine, i.e., a classifier, to guide the process and a parameter grid describing the classifier's settings.For each combination of parameters, grid search fits the learning machine to the training data.It then performs a prediction on withheld data and assesses the accuracy using the overall accuracy.This is repeated for each combination of training sets as specified by the chosen cross-validation method.

Non-Inferiority Test
Non-inferiority testing is a modification of the traditional hypothesis testing framework.Primarily used in clinical studies to prove that new therapies do not perform significantly worse than established therapies, when they might offer advantages such as fewer side effects, lower cost, easier application, or fewer drug interactions.This concept can be transferred to the classification problem.There, the secondary benefits could be the reduction of data, of computational complexity or of the sensitivity to the training set size.We use the non-inferiority test to find a well-performing classifier that has the secondary benefit of achieving a higher reduction in dimensionality.In contrast to the traditional hypothesis test, the non-inferiority test formulates a null hypothesis that states that the efficacy of the new method is inferior to the efficacy of the old method within the chosen non-inferiority margin, while the alternative hypothesis states that the efficacy of the new method is greater or equal to the efficacy of the old method within the non-inferiority margin.In rejecting this null hypothesis, one can be certain that the new method does not perform worse than the old method.The implementation of the non-inferiority test uses a Student's t-distribution and applies the Welch's t-test for samples with unequal variances.This test is performed using the overall accuracy as the single metric.The two parameters for this test are the non-inferiority margin and the significance level α.The non-inferiority margin describes the acceptable deviation of efficacy from the best performing method.The significance level describes the certainty of the hypothesis test as probability of falsely rejecting the null hypothesis.Both should be carefully selected for the problem at hand.If the computed p-value < α, non-inferiority is established.

Classification
In the last step, after finding the best set of features (i.e., dimensionality reduction) and the optimum SVM hyperparameter values, the classifier is trained using the whole training data set and the land cover classes for all segments are predicted.The results are joined with the segment polygons in GEE and an appropriate color scheme is applied for the presentation of the final land cover map.Finally, the test set is used to independently evaluate the accuracy of the classification map.

Stockholm Study Area
The feature set is forwarded to a virtual machine on GCP, which is specified with 16 virtual Intel Haswell CPUs, 20 GB memory, and Ubuntu 16.04.4LTS as operating system.The labeled segments were split into a training set of 70% and a withheld testing set of 30%, stratified by classes.The repeated stratified k-fold was set to three folds with three repetitions.This enables still a relatively large number of samples in each fold while enabling nine validations on the leave-one-out cross-validation.
The grid search is run separately for each feature extraction and feature selection method using the parameters shown in Table 5.The feature range column indicates the range of numbers of features that have been tested (discriminants of the LDA are referred to as features for convenience).While Default SVM uses all features, the LDA can only generate maximum one component less than the number of classes, which is why the maximum range ends at 9. All other methods were tested for a set of 30 approximately logarithmically spaced numbers of features spanning from 1 to 712.The C-and γ-parameters are in a decadic space within the range of exponents indicated in the columns start and stop; the number of parameters to test is specified in the column num.The search space C and γ has been verified visually with heat maps, as shown in Figure 5.If the highest overall accuracy was achieved at the borders of the parameter ranges, the range was extended.
The best-performing classifiers identified by the grid search are presented in Table 6.It first shows the results without a non-inferiority test.Then the results with non-inferiority margins of 0.5%, 1%, 3%, and 5% respectively are shown.A significance level of 2.5% was chosen for the non-inferiority test.The mean overall accuracy and the standard deviation of overall accuracies are shown for the nine cross-validation results.Additionally, the overall accuracy of the prediction on the 30% withheld testing segments are shown.Furthermore, the table presents numbers of features used and hence the compression of dimensionality (i.e., total number of available features divided by number of features used).The mean training and prediction times-as measured during the grid search-are shown as decrease factors compared to the default SVM.The results without the non-inferiority test show that all methods achieved better accuracies and reduced computational costs compared to the Default SVM.In terms of data compression and training and prediction time, the LDA is unsurpassable.MI and F-Score achieve the highest accuracy values on the withheld testing data; however, both still use much more features.Inside a 0.5% non-inferiority margin, MI and F-Score achieve a higher dimensionality compression, while keeping a high accuracy on withheld testing data, or in the case of F-Score even surpassing the initial result.No LDA with fewer features was found inside the 0.5% margin.When the non-inferiority margin is extended, higher dimensionality compressions can be achieved, however the accuracy on withheld testing data slightly decreases.Even with a large margin of 5%, all methods outperform the Default SVM on the withheld testing data, while achieving a drastic reduction of the dimensionality.It can also be noted that all methods apart from the Default SVM achieve higher accuracies on the withheld testing data than on the training data in the cross-validation.Taking a closer look at the results for the individual methods, the influence of the number of features can be visualized: Figure 6 shows the classifier performances during the cross-validation in the grid search highlighting the mean overall accuracy with shaded areas representing ±3σ.It can be observed that LDA surpasses the default SVM's accuracy with less than five components and reaches its maximum with all possible components (one less than the number of classes).MI shows a large variance especially when few features are used.It surpasses the results of the Default SVM between 5 and 10 selected features and achieves more reliable results with less variance thereafter.Between 30 and 500 selected features it shows an almost constant behavior.It seems that more features neither improve nor harm the classifier's performance.F-Score is more stable than MI when few features are used, though it achieves lower accuracy scores.The default SVM is surpassed with fewer features than MI and in general there is less variance in the performance while obtaining similar accuracy scores.Only when more than 500 features are included, MI and F-Score, as expected, fall back to the performance of the default SVM.
Taking a closer look at the results for the individual methods, the influence of the number of features can be visualized: Figure 6 shows the classifier performances during the cross-validation in the grid search highlighting the mean overall accuracy with shaded areas representing ±3σ.It can be observed that LDA surpasses the default SVM's accuracy with less than five components and reaches its maximum with all possible components (one less than the number of classes).MI shows a large variance especially when few features are used.It surpasses the results of the default SVM between 5 and 10 selected features and achieves more reliable results with less variance thereafter.Between 30 and 500 selected features it shows an almost constant behavior.It seems that more features neither improve nor harm the classifier's performance.F-Score is more stable than MI when few features are used, though it achieves lower accuracy scores.The default SVM is surpassed with fewer features than MI and in general there is less variance in the performance while obtaining similar accuracy scores.Only when more than 500 features are included, MI and F-Score, as expected, fall back to the performance of the default SVM.Taking a closer look at the results for the individual methods, the influence of the number of features can be visualized: Figure 6 shows the classifier performances during the cross-validation in the grid search highlighting the mean overall accuracy with shaded areas representing ±3σ.It can be observed that LDA surpasses the default SVM's accuracy with less than five components and reaches its maximum with all possible components (one less than the number of classes).MI shows a large variance especially when few features are used.It surpasses the results of the Default SVM between 5 and 10 selected features and achieves more reliable results with less variance thereafter.Between 30 and 500 selected features it shows an almost constant behavior.It seems that more features neither improve nor harm the classifier's performance.F-Score is more stable than MI when few features are used, though it achieves lower accuracy scores.The default SVM is surpassed with fewer features than MI and in general there is less variance in the performance while obtaining similar accuracy scores.Only when more than 500 features are included, MI and F-Score, as expected, fall back to the performance of the default SVM.
Taking a closer look at the results for the individual methods, the influence of the number of features can be visualized: Figure 6 shows the classifier performances during the cross-validation in the grid search highlighting the mean overall accuracy with shaded areas representing ±3σ.It can be observed that LDA surpasses the default SVM's accuracy with less than five components and reaches its maximum with all possible components (one less than the number of classes).MI shows a large variance especially when few features are used.It surpasses the results of the default SVM between 5 and 10 selected features and achieves more reliable results with less variance thereafter.Between 30 and 500 selected features it shows an almost constant behavior.It seems that more features neither improve nor harm the classifier's performance.F-Score is more stable than MI when few features are used, though it achieves lower accuracy scores.The default SVM is surpassed with fewer features than MI and in general there is less variance in the performance while obtaining similar accuracy scores.Only when more than 500 features are included, MI and F-Score, as expected, fall back to the performance of the default SVM. Figure 7 shows the confusion matrices of the prediction of the withheld testing data for the different classifiers at 0% non-inferiority-margin.The default SVM shows a high shift from roads to HDB but also from UGS to LDB and HDB and from bare rock to forests and HDB.It is notable that the shift is dominant in the direction of well-represented classes (compare Table 1).
All methods, LDA, MI, and F-Score, can remove some of these confusions.LDA reduces the shift from roads to HDB best but cannot avoid it completely, F-Score just slightly improves it, while MI shows only a minor improvement.
Roads and HDB have a very similar signature response and they are the most challenging classes in the presented classification schema; moreover, the spatial resolution of Sentinels imagery (10-20 m) is not sufficient to detect small roads.These are the main reasons why there is a higher confusion among these two classes across the results.The shifts from UGS and bare rock are reduced by all methods and best by F-Score.The default SVM achieves high accuracies of more than 90% on the remaining classes, however every method improves these further, reaching up to 99% and 100% for forest, water, and wetland classes.Roads and HDB have a very similar signature response and they are the most challenging classes in the presented classification schema; moreover, the spatial resolution of Sentinels imagery (10-20 m) is not sufficient to detect small roads.These are the main reasons why there is a higher confusion among these two classes across the results.The shifts from UGS and bare rock are reduced by all methods and best by F-Score.The default SVM achieves high accuracies of more than 90% on the remaining classes, however every method improves these further, reaching up to 99% and 100% for forest, water, and wetland classes.In general, the main data processing advantage of the proposed framework is to use GEE and GCP to reduce the computational time and the download and preprocessing time to handle S1 and S2 dense time series.GEE is efficient in producing the feature data and is free of charge for non-commercial applications; GCP can be configured to have a good balance between computational power vs. operational costs.The main bottleneck for integration of GEE and GCP is the data transmission between the two platforms; however, it is the only technical solution that we found to combine the GEE platform with advanced machine learning libraries (i.e., scikit-learn).Moreover, we adopted an object-based approach and computed all the object statistics directly in the GEE platform.Hence, we only transfer a small amount of data between these two platforms (around 200,000 sample statistics in Stockholm case for example) limiting the side effect of data transfer.Figure 8 depicts the prediction times over the number of features for the different methods.For comparison, the times of the default SVM are is shown as dashed black line.It can be observed that LDA outperforms the default SVM drastically in prediction times.The figure clearly shows the expected outcome, i.e., the prediction time grows exponentially with number of features.
Hence, we only transfer a small amount of data between these two platforms (around 200,000 sample statistics in Stockholm case for example) limiting the side effect of data transfer.Figure 8 depicts the prediction times over the number of features for the different methods.For comparison, the times of the default SVM are is shown as dashed black line.It can be observed that LDA outperforms the default SVM drastically in prediction times.The figure clearly shows the expected outcome, i.e., the prediction time grows exponentially with number of features.LDA shows a remarkably unstable behavior for variable training set sizes.Though it achieves reasonable testing scores on very small training set sizes, its performance drops drastically for medium-sized training sets.Only with very large training sets does the accuracy reach the level seen in Figure 8.We assume that the between-class scatter matrix is becoming singular for this specific number of around 500 training samples, which is a known problem when there are less training samples than input dimensions [70].One explanation could be that LDA is more sensitive to the quality of the training samples.Note that, for a given size, the training samples are selected randomly and could therefore not be representative of all the classes.The remaining gap between training scores and testing scores shows that this classifier could further be improved with larger and more balanced training sets.In fact, the decline in the training score is just starting to become visible at the largest training sets.LDA shows a remarkably unstable behavior for variable training set sizes.Though it achieves reasonable testing scores on very small training set sizes, its performance drops drastically for medium-sized training sets.Only with very large training sets does the accuracy reach the level seen in Figure 8.We assume that the between-class scatter matrix is becoming singular for this specific number of around 500 training samples, which is a known problem when there are less training samples than input dimensions [70].One explanation could be that LDA is more sensitive to the quality of the training samples.Note that, for a given size, the training samples are selected randomly and could therefore not be representative of all the classes.The remaining gap between training scores and testing scores shows that this classifier could further be improved with larger and more balanced training sets.In fact, the decline in the training score is just starting to become visible at the largest training sets.
LDA shows a remarkably unstable behavior for variable training set sizes.Though it achieves reasonable testing scores on very small training set sizes, its performance drops drastically for medium-sized training sets.Only with very large training sets does the accuracy reach the level seen in Figure 8.We assume that the between-class scatter matrix is becoming singular for this specific number of around 500 training samples, which is a known problem when there are less training samples than input dimensions [70].One explanation could be that LDA is more sensitive to the quality of the training samples.Note that, for a given size, the training samples are selected randomly and could therefore not be representative of all the classes.The remaining gap between training scores and testing scores shows that this classifier could further be improved with larger and more balanced training sets.In fact, the decline in the training score is just starting to become visible at the largest training sets.
The testing scores for MI and F-Score already surpass the default SVM at the smallest training set size.The F-Score shows slightly larger variances on the testing score compared to MI, indicating its higher dependency on meaningful training samples.The accuracy of both methods can be increased with larger training set sizes.However, compared to the default SVM the gap between training and testing scores is smaller and the convergence can clearly be seen.Figure 10 shows the S1 temporal mean VV-VH-composite, the S2 false-color composite and one of the predicted land cover maps with F-Score feature selection using 102 features.medium-sized training sets.Only with very large training sets does the accuracy reach the level seen in Figure 8.We assume that the between-class scatter matrix is becoming singular for this specific number of around 500 training samples, which is a known problem when there are less training samples than input dimensions [70].One explanation could be that LDA is more sensitive to the quality of the training samples.Note that, for a given size, the training samples are selected randomly and could therefore not be representative of all the classes.The remaining gap between training scores and testing scores shows that this classifier could further be improved with larger and more balanced training sets.In fact, the decline in the training score is just starting to become visible at the largest training sets.The testing scores for MI and F-Score already surpass the default SVM at the smallest training set size.The F-Score shows slightly larger variances on the testing score compared to MI, indicating its higher dependency on meaningful training samples.The accuracy of both methods can be increased with larger training set sizes.However, compared to the default SVM the gap between training and testing scores is smaller and the convergence can clearly be seen.Figure 10 shows the S1 temporal mean VV-VH-composite, the S2 false-color composite and one of the predicted land cover maps with F-Score feature selection using 102 features.

Beijing Study Area
The split of 70% training and 30% testing set was performed in the same way as in the Stockholm study area and the same cross-validation method and parameters were chosen.However, the parameters needed to be adjusted.More specifically, the maximum number of features used for the LDA had to be reduced to 7 accordingly with the number of classes and the search ranges for the C-

Beijing Study Area
The split of 70% training and 30% testing set was performed in the same way as in the Stockholm study area and the same cross-validation method and parameters were chosen.However, the parameters needed to be adjusted.More specifically, the maximum number of features used for the LDA had to be reduced to 7 accordingly with the number of classes and the search ranges for the C-and γ-parameters had to be adjusted to find the optimal hyperparameters as indicated in Table 7. Table 8 shows the best found by the grid search, again without and with a non-inferiority margin.In contrast to the Stockholm study area, no better classifiers were found within non-inferiority margins of 0.5% or 1%, hence only 3% and 5% margins are presented.The performances of all classifiers are worse and less stable than in the Stockholm study area.Considering the learning curves in Figure 9, this behavior is not surprising considering the smaller training set of this study area.The Default SVM only reaches a mean overall accuracy of 76% on the training set and shows a very large standard deviation of almost 7%.On the withheld testing set, this performance is only marginally better.LDA achieves higher mean overall accuracies on the training set than the default SVM, it performs very poorly (77.5% overall accuracy) on the testing set, both with seven and six features.Considerably better, are MI's and F-Score's results.While both show a high instability on their performance with a standard deviation of around 2.5%, their overall accuracy surpasses 90%.Inside the 3% non-inferiority margin, MI can be reduced from 226 to 13 features, however the accuracy is reduced below 90%.F-Score still uses 110 features and achieves an accuracy of just 90% on the withheld data.With a 5% margin, MI is further reduced to 10 features, with a slightly increasing accuracy on withheld data compared to the 3% margin.F-Score is reduced to 21 features and achieves the same accuracy on the withheld testing data as without a non-inferiority margin.It can be observed by comparing Figures 8 and 11 that all three methods perform worse in Beijing than in Stockholm and show a higher instability.As expected, with the smaller training set size, it is more difficult to find a good classifier and a good subset of features.LDA surpasses the default SVM just slightly, when all seven features are used, but still shows a large variance at that point.MI and F-Score can both surpass the default SVM in a similar way as for the Stockholm study area.Once past the default SVM accuracy, MI reaches higher accuracy scores with fewer features than F-Score.Thus, F-Score appears to be more sensitive to the combination of few features and small training sets.The normalized confusion matrices in Figure 12 show that the default SVM has a high shift from LDB to HDB and roads, from UGS to agriculture, from golf courses to UGS, and from agriculture to forests and HDB.LDA cannot reduce these confusions and introduces new errors.Many segments were falsely predicted as forests.Overall, MI and F-Score improve these classification results and are able to reduce confusions.It can be observed by comparing Figures 8 and 11 that all three methods perform worse in Beijing than in Stockholm and show a higher instability.As expected, with the smaller training set size, it is more difficult to find a good classifier and a good subset of features.LDA surpasses the default SVM just slightly, when all seven features are used, but still shows a large variance at that point.MI and F-Score can both surpass the default SVM in a similar way as for the Stockholm study area.Once past the default SVM accuracy, MI reaches higher accuracy scores with fewer features than F-Score.Thus, F-Score appears to be more sensitive to the combination of few features and small training sets.The normalized confusion matrices in Figure 12 show that the default SVM has a high shift from LDB to HDB and roads, from UGS to agriculture, from golf courses to UGS, and from agriculture to forests and HDB.LDA cannot reduce these confusions and introduces new errors.Many segments were falsely predicted as forests.Overall, MI and F-Score improve these classification results and are able to reduce confusions.

Comparison of the Study Areas
Whereas all methods-LDA, MI, and F-Score-could improve the classification accuracies in the

Comparison of the Study Areas
Whereas all methods-LDA, MI, and F-Score-could improve the classification accuracies in the case study of Stockholm with a comparably large training set size, LDA appears to be unsuited for cases with small training set sizes as in Beijing.The estimation of the class means and variances is essential for LDA and thus a sufficiently large training set is required.MI and F-Score work reasonably well also with smaller training sets.However, all methods produced more stable results in Stockholm than in Beijing.
Figure 13 visualizes an excerpt from the feature ranking by MI and F-Score.The top 20 features averaged over both methods are presented grouped by S1 and S2 for Stockholm and Beijing.This ranking was created with cross-validation of five stratified folds of the whole training set.The averaged rank is indicated next to the feature name.MI and F-Score rank the features differently as they are using different measures.It is interesting, but not surprising, to see that the features from descending and ascending S1 passes are always paired up in this ranking.VV polarizations appear to be slightly more informative than VH polarizations, as the same features always appear in the same order for the respective polarizations.Another interesting observation is that the VV_mean has a very high rank in general.SAR response is very sensitive to geometric properties of the observed objects and could therefore, help discriminating between certain LULC classes especially in the urban environment.As an example, built-up areas are characterized by double bounce scattering, and consequently, appear very bright unlike roads with their smooth surface, which appear very dark in radar images.The highest-ranking features from S2 are the indices NDVI and NDWI.They are followed by red edge, near-and short-wave-infrared bands (B6-B11).The first feature from the visible spectrum was the mean of B4 (red) that held rank 54 in Stockholm and rank 30 in Beijing.The GLCM sum The highest-ranking features from S2 are the indices NDVI and NDWI.They are followed by red edge, near-and short-wave-infrared bands (B6-B11).The first feature from the visible spectrum was the mean of B4 (red) that held rank 54 in Stockholm and rank 30 in Beijing.The GLCM sum average (savg) delivers the highest ranked textural feature and appears for several bands in the top 20 ranking.Even though the rankings are not exactly the same, the results appear to be consistent for the different study areas-the same or similar features received the highest ranks.This is especially noteworthy as the accuracy scores in the study area of Beijing are way lower and more unstable than in Stockholm.Still the cross-validated, averaged feature importance ranking is quite similar.
The geometric features received low scores in this ranking and were hence excluded from Figure 13.In the Stockholm study area, the first geometric feature, the width of the fitted ellipse, was placed on rank 288.Two features, which were anticipated to reduce the confusion between roads and HDB, the aspect ratios of the minimum area rectangle and the fitted ellipse, were placed on rank 579 and 591 respectively.Inspecting the value distributions for these two features in Figure 14, it can be seen that they are useful to separate specific classes (roads, golf courses, and wetlands) but they do not contain useful information for the separation of the remaining classes.These features could be used rather for rule-based post-classification corrections to reduce the confusions between roads and HDB.What should be considered for this ranking is that MI and F-Score are univariate feature selection methods, which do not account for correlation between the features.Thus, they do not detect redundancy in the feature set.Plotting the training samples for the two highest-ranking S1 features, it becomes obvious that they have a positive correlation (Figure 15).The plot for the highestranking S2 features shows a strong negative correlation.However, it is interesting to plot the highestranked S1 and S2 features; the separation of the classes becomes clearly visible showing the great potential of merging multispectral and SAR data.What should be considered for this ranking is that MI and F-Score are univariate feature selection methods, which do not account for correlation between the features.Thus, they do not detect redundancy in the feature set.Plotting the training samples for the two highest-ranking S1 features, it becomes obvious that they have a positive correlation (Figure 15).The plot for the highest-ranking S2 features shows a strong negative correlation.However, it is interesting to plot the highest-ranked S1 and S2 features; the separation of the classes becomes clearly visible showing the great potential of merging multispectral and SAR data.What should be considered for this ranking is that MI and F-Score are univariate feature selection methods, which do not account for correlation between the features.Thus, they do not detect redundancy in the feature set.Plotting the training samples for the two highest-ranking S1 features, it becomes obvious that they have a positive correlation (Figure 15).The plot for the highestranking S2 features shows a strong negative correlation.However, it is interesting to plot the highestranked S1 and S2 features; the separation of the classes becomes clearly visible showing the great potential of merging multispectral and SAR data.

Conclusions
This study demonstrates the feasibility of GEE and GCP infrastructure to apply dimensionality reduction through feature extraction and feature selection methods for object-based land cover classifications with oversized feature sets.The presented workflow allows thorough assessments of different features as well as different dimensionality reduction methods for specific GEOBIA applications.It incorporates the dimensionality reduction as a key step in the land cover classification process, which we consider essential for the exploitation of the growing EO big data.
The LDA showed the highest compression of the initial feature space and can obtain remarkable results in comparison to the default SVM.One disadvantage, however, is that this method gives no intuitive indication about the contribution of individual features to the accuracy and is less reliable with small training sets.The feature selection methods appear very promising and provide exactly this insight into the features' quality.With a sensitive non-inferiority margin, both MI and F-Score allowed high compressions of the feature set and achieved notable improvements of the accuracy.This emphasizes the fact that dimensionality reduction should form a key step in land cover classification using SVM.Thanks to the availability of cloud computing, these dimensionality reduction processes are no longer limited by the lack of computational power and can easily be integrated into the classification workflow.
Despite the different training set sizes in the two study areas, the averaged feature importance ranking showed similar results for the top-ranking features.It strongly indicates that a feature-level fusion of SAR data from Sentinel-1 and multispectral imagery from Sentinel-2 allows for a better discrimination between different LULC classes.It should be acknowledged, however, that the optimal set of features is specific for each classification scheme.Different land cover classes require different features to be separable from one another.Future research should therefore expand this method to different classification schemes but also further investigate the importance of features for each individual class.To explore the relevance of features for a land cover classification more broadly, additional features need to be included in the analysis.More spectral indices should be included; thorough multi-temporal analyses of optical or SAR imagery are promising candidates to improve land cover; analysis of the phenology through harmonic or linear fitting of the NDVI, for example, help to distinguish between different vegetation classes.
Despite their limitation, LDA, MI, and F-Score serve as a demonstration of the integration in the workflow.In future work, different feature selection methods should be tested following the proposed methodology.Multivariate filter methods, in particular, should be explored, since the

Conclusions
This study demonstrates the feasibility of GEE and GCP infrastructure to apply dimensionality reduction through feature extraction and feature selection methods for object-based land cover classifications with oversized feature sets.The presented workflow allows thorough assessments of different features as well as different dimensionality reduction methods for specific GEOBIA applications.It incorporates the dimensionality reduction as a key step in the land cover classification process, which we consider essential for the exploitation of the growing EO big data.
The LDA showed the highest compression of the initial feature space and can obtain remarkable results in comparison to the default SVM.One disadvantage, however, is that this method gives no intuitive indication about the contribution of individual features to the accuracy and is less reliable with small training sets.The feature selection methods appear very promising and provide exactly this insight into the features' quality.With a sensitive non-inferiority margin, both MI and F-Score allowed high compressions of the feature set and achieved notable improvements of the accuracy.This emphasizes the fact that dimensionality reduction should form a key step in land cover classification using SVM.Thanks to the availability of cloud computing, these dimensionality reduction processes are no longer limited by the lack of computational power and can easily be integrated into the classification workflow.
Despite the different training set sizes in the two study areas, the averaged feature importance ranking showed similar results for the top-ranking features.It strongly indicates that a feature-level fusion of SAR data from Sentinel-1 and multispectral imagery from Sentinel-2 allows for a better discrimination between different LULC classes.It should be acknowledged, however, that the optimal set of features is specific for each classification scheme.Different land cover classes require different features to be separable from one another.Future research should therefore expand this method to different classification schemes but also further investigate the importance of features for each individual class.To explore the relevance of features for a land cover classification more broadly, additional features need to be included in the analysis.More spectral indices should be included; thorough multi-temporal analyses of optical or SAR imagery are promising candidates to improve land cover; analysis of the phenology through harmonic or linear fitting of the NDVI, for example, help to distinguish between different vegetation classes.
Despite their limitation, LDA, MI, and F-Score serve as a demonstration of the integration in the workflow.In future work, different feature selection methods should be tested following the proposed methodology.Multivariate filter methods, in particular, should be explored, since the applied univariate methods fail to identify the dependency between similar high-ranked features.Moreover, wrappers and embedded methods especially designed for the chosen classifier should be included in the analysis as well.

Figure 2 .
Figure 2. Generated segments: (left) high-density and low-density built-up, transportation, water, and urban green space land cover/land use classes in Stockholm's urban center; (right) agricultural, forest, water, and wetland land cover/land use classes in Stockholm's periphery.

Figure 2 .
Figure 2. Generated segments: (left) high-density and low-density built-up, transportation, water, and urban green space land cover/land use classes in Stockholm's urban center; (right) agricultural, forest, water, and wetland land cover/land use classes in Stockholm's periphery.

Figure 3 .
Figure 3. General workflow with the three-step approach of feature computation, classifier optimization, and classification.

Figure 3 .
Figure 3. General workflow with the three-step approach of feature computation, classifier optimization, and classification.Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 25

Figure 4 .
Figure 4. Features selection and classifier optimization workflow.

Figure 4 .
Figure 4. Features selection and classifier optimization workflow.

Figure 5 .
Figure 5. Heat map of the search space in the grid search.Left: Highest accuracy at the edge of the search space.Right: Adjusted search space.

Figure 5 .
Figure 5. Heat map of the search space in the grid search.Left: Highest accuracy at the edge of the search space.Right: Adjusted search space.

Figure 6 .
Figure 6.Overall accuracy as function of the number of features used (Stockholm).Figure 6. Overall accuracy as function of the number of features used (Stockholm).

Figure 6 .
Figure 6.Overall accuracy as function of the number of features used (Stockholm).Figure 6. Overall accuracy as function of the number of features used (Stockholm).

Figure 6 .
Figure 6.Overall accuracy as function of the number of features used (Stockholm).

Figure 7 .
Figure 7. Normalized confusion matrices of the predictions on testing data (Stockholm).

Figure 7 .
Figure 7. Normalized confusion matrices of the predictions on testing data (Stockholm).

Figure 8 .
Figure 8. Prediction times as function of the number of features used for different methods (Stockholm).
Figure 9 shows the learning curves for the different classifiers at 0% non-inferiority margin using 2% to 70% of the initial training set.All learning curves have been crossvalidated with six stratified shuffled splits, withholding 30% for testing.Shaded areas indicate ±3σ of the overall accuracy.As might be expected, the accuracy of the default SVM is worst with the smallest training set size.With an increasing training set size, the testing accuracy increases while showing less variance and the training score decreases slightly.The gap between both scores at the largest training set size indicates that this classifier could still be improved with the further addition of training samples.

Figure 8 .
Figure 8. Prediction times as function of the number of features used for different methods (Stockholm).To analyze sensitivity to the training set size, learning curves are generated.Plotting the training and testing accuracy over adjusted training set sizes can give an indication of how training set size affects the classifier's accuracy.Figure 9 shows the learning curves for the different classifiers at 0% non-inferiority margin using 2% to 70% of the initial training set.All learning curves have been cross-validated with six stratified shuffled splits, withholding 30% for testing.Shaded areas indicate ±3σ of the overall accuracy.As might be expected, the accuracy of the default SVM is worst with the smallest training set size.With an increasing training set size, the testing accuracy increases while showing less variance and the training score decreases slightly.The gap between both scores at the largest training set size indicates that this classifier could still be improved with the further addition of training samples.LDA shows a remarkably unstable behavior for variable training set sizes.Though it achieves reasonable testing scores on very small training set sizes, its performance drops drastically for medium-sized training sets.Only with very large training sets does the accuracy reach the level seen in Figure8.We assume that the between-class scatter matrix is becoming singular for this specific number of around 500 training samples, which is a known problem when there are less training samples than input dimensions[70].One explanation could be that LDA is more sensitive to the quality of the training samples.Note that, for a given size, the training samples are selected randomly and could therefore not be representative of all the classes.The remaining gap between training scores and testing scores shows that this classifier could further be improved with larger and more balanced training sets.In fact, the decline in the training score is just starting to become visible at the largest training sets.LDA shows a remarkably unstable behavior for variable training set sizes.Though it achieves reasonable testing scores on very small training set sizes, its performance drops drastically for medium-sized training sets.Only with very large training sets does the accuracy reach the level seen in Figure8.We assume that the between-class scatter matrix is becoming singular for this specific number of around 500 training samples, which is a known problem when there are less training samples than input dimensions[70].One explanation could be that LDA is more sensitive to the quality of the training samples.Note that, for a given size, the training samples are selected randomly and could therefore not be representative of all the classes.The remaining gap between training scores and testing scores shows that this classifier could further be improved with larger and more balanced training sets.In fact, the decline in the training score is just starting to become visible at the largest training sets.The testing scores for MI and F-Score already surpass the default SVM at the smallest training set size.The F-Score shows slightly larger variances on the testing score compared to MI, indicating its higher dependency on meaningful training samples.The accuracy of both methods can be increased with larger training set sizes.However, compared to the default SVM the gap between training and testing scores is smaller and the convergence can clearly be seen.Figure10shows the S1 temporal Figure 9 shows the learning curves for the different classifiers at 0% non-inferiority margin using 2% to 70% of the initial training set.All learning curves have been cross-validated with six stratified shuffled splits, withholding 30% for testing.Shaded areas indicate ±3σ of the overall accuracy.As might be expected, the accuracy of the default SVM is worst with the smallest training set size.With an increasing training set size, the testing accuracy increases while showing less variance and the training score decreases slightly.The gap between both scores at the largest training set size indicates that this classifier could still be improved with the further addition of training samples.

Figure 9 .
Figure 9. Learning curves showing the influence of training set sizes (Stockholm).

Figure 10 .
Figure 10.Display of input images and a resulting land cover map.

Figure 11 .
Figure 11.Overall accuracy as function of the numbers of features (Beijing).Figure 11.Overall accuracy as function of the numbers of features (Beijing).

Figure 11 .
Figure 11.Overall accuracy as function of the numbers of features (Beijing).Figure 11.Overall accuracy as function of the numbers of features (Beijing).

25 Figure 12 .
Figure 12.Normalized confusion matrices of the predictions on testing data (Beijing).

Figure 12 .
Figure 12.Normalized confusion matrices of the predictions on testing data (Beijing).
Remote Sens. 2020, 12, x FOR PEER REVIEW 19 of 25 consequently, appear very bright unlike roads with their smooth surface, which appear very dark in radar images.
Remote Sens. 2020, 12, x FOR PEER REVIEW 20 of 25 contain useful information for the separation of the remaining classes.These features could be used rather for rule-based post-classification corrections to reduce the confusions between roads and HDB.
Remote Sens. 2020, 12, x FOR PEER REVIEW 20 of 25 contain useful information for the separation of the remaining classes.These features could be used rather for rule-based post-classification corrections to reduce the confusions between roads and HDB.

Table 1 .
Classification schema in Stockholm and Beijing.

Table 2 .
Overview of the acquired satellite imagery

Table 4 .
Overview of all generated features

Table 5 .
Parameter grid for the exhaustive grid search (Stockholm)

Table 6 .
Overview of the best-performing classifiers (Stockholm)

Table 7 .
Parameter grid for the exhaustive grid search (Beijing).

Table 8 .
Overview of the best performing classifiers (Beijing)