Imbalanced Learning in Land Cover Classiﬁcation: Improving Minority Classes’ Prediction Accuracy Using the Geometric SMOTE Algorithm

: The automatic production of land use/land cover maps continues to be a challenging problem, with important impacts on the ability to promote sustainability and good resource management. The ability to build robust automatic classiﬁers and produce accurate maps can have a signiﬁcant impact on the way we manage and optimize natural resources. The difﬁculty in achieving these results comes from many different factors, such as data quality and uncertainty. In this paper, we address the imbalanced learning problem, a common and difﬁcult conundrum in remote sensing that affects the quality of classiﬁcation results, by proposing Geometric-SMOTE, a novel oversampling method, as a tool for addressing the imbalanced learning problem in remote sensing. Geometric-SMOTE is a sophisticated oversampling algorithm which increases the quality of the instances generated in previous methods, such as the synthetic minority oversampling technique. The performance of Geometric- SMOTE, in the LUCAS (Land Use/Cover Area Frame Survey) dataset, is compared to other oversamplers using a variety of classiﬁers. The results show that Geometric-SMOTE signiﬁcantly outperforms all the other oversamplers and improves the robustness of the classiﬁers. These results indicate that, when using imbalanced datasets, remote sensing researchers should consider the use of these new generation oversamplers to increase the quality of the classiﬁcation results.


Introduction
The production of accurate land use/land cover (LULC) maps offers unique monitoring capabilities within the remote sensing domain [1]. LULC maps are being used for a variety of applications, ranging from environmental monitoring, land change detection, and natural hazard assessment to agriculture and water/wetland monitoring [2]; therefore, accurate and timely production of LULC maps is of great significance. LULC maps are usually produced by two main procedures: photo-interpretation by the human eye, which is time and resource consuming and is not suitable for operational LULC-mapping over large areas; and second, automatic mapping using remotely sensed data and different classification algorithms.
The availability and a swift update of high-quality satellite remote sensing data has brought tremendous progress in providing up-to-date and accurate land cover information. Multispectral images, particularly, are an essential resource for building LULC maps, allowing for the use of classification algorithms to automate their production. Although significant progress has been made 1. Cost-sensitive solutions. They introduce a cost matrix that applies higher misclassification costs for the examples of the minority class. 2. Algorithmic level solutions. They modify the algorithmic procedure to reinforce the learning of the minority class. 3. Resampling solutions. They rebalance the class distribution either by removing instances from the majority class or by generating artificial data for the minority class(es).
The latter method constitutes a more general approach, since it can be used for any classification algorithm and it does not require any type of domain knowledge in order to construct a cost matrix.
There are several resampling solutions to deal with the imbalanced learning problem, which also can be divided into three categories: 1. Undersampling algorithms reduce the size of the majority class. 2. Oversampling algorithms attempt to even the distributions by generating artificial data for the minority class(es). 3. Hybrid approaches use both oversampling and undersampling techniques to ensure a balanced dataset.
In this paper, we compare the performance of various oversampling algorithms on EUROSTAT's publicly available Land Use/Cover Area Statistical Survey (LUCAS) dataset [14] with Landsat 8 data. The experimental procedure included a comparison of five oversamplers using five classifiers and three evaluation metrics. Specifically, the oversampling algorithms were Geometric-SMOTE (G-SMOTE) [15], the synthetic minority oversampling technique (SMOTE) [16], Borderline-SMOTE (B-SMOTE) [17], the adaptive synthetic sampling technique (ADASYN) [18] and random oversampling (ROS), while no oversampling was included as a baseline method. Results show that G-SMOTE outperforms every other oversampling technique, for the selected evaluation metrics. This paper is organized in five sections: Section 2 analyzes the resampling methods, Section 3 describes the proposed methodology, Section 4 shows the results and discussion, and Section 5 presents the conclusions drawn from this study.

Resampling Methods
Data modification through resampling has been the most popular approach to deal with the imbalanced learning problems in machine learning in general and remote sensing in particular [5]. As mentioned above, by decoupling the imbalance problem from the classification algorithms, resampling allows the users to apply any standard algorithm once the resampling preprocessing step is done. This stratagem is especially convenient for users that are not machine learning experts and want to use several classifiers. Additionally, resampling methods can be naturally applied to multiclass imbalanced data, which is relevant for LULC classification. In this section, we present the most relevant applications of resampling methods for imbalanced remote sensing data classification.

Random Resampling
Random resampling refers to non-informed strategies that remove instances from the majority class or replicate instances from the minority class. As such, the selection of the data occurs randomly without exploiting any additional information.
Some of the existing remote sensing studies implement the random undersampling (RUS) method [19], which randomly reduces the number of the majority class training samples. However, this method has the disadvantage of information loss, as it discards samples from the majority class [5]. Contrary to RUS, ROS is a method that can be considered equivalent to Bootstrapping, as it avoids information loss. However, ROS simply replicates randomly selected instances of the minority class, increasing the risk of overfitting [20]. Reference [21] reports that balancing data with ROS affects the classification performance differently for various classifiers. In their study, land cover classification with highly imbalanced data was carried out with six different models. The application of ROS slightly improved the performance of the random forest (RF) and support vector machine (SVM) classifiers. On the other hand, it reduced the classification accuracy for ecision tree (DT), artificial neural network (ANN), k-nearest neighbors (KNN) and boosted DT classifiers.

Informed Resampling
In the above section, the disadvantages of RUS and ROS have been pointed out. Informed resampling methods aim to overcome these insufficiencies. More specifically, they use the local or global information of the class distribution to remove or generate instances. Our focus is on oversampling algorithms, since the size of the LUCAS dataset does not favor the use of undersampling approaches. Additionally, reference [22] carried out a comparative analysis of undersamplers' and oversamplers' performance for land cover classification with the rotation forest ensemble classifier, showing that oversampling methods outperform undersampling methods.
SMOTE is the most popular informed oversampling method, and it has been used to successfully deal with the class imbalance problem in land cover classification [23]. In this approach, the minority class is oversampled by randomly selecting a minority class instance and generating synthetic examples along the line segment joining it with one of its minority class neighbors. A number of studies report significant improvements in LULC mapping accuracy with the use of SMOTE oversampling. For instance, the variational semi-supervised learning (VSSL) proposed by [23] aims to deal with the imbalance problem in LULC mapping. VSSL is a semi-supervised learning framework consisting of a deep generative model. It allows learning successfully from both labeled and unlabeled samples while using SMOTE to balance the data. In [24], they used OpenStreetMap crowdsourced data and Landsat time series for LULC classification. Similarly, the application of SMOTE improved the classification results. Other examples of the successful application of SMOTE in remote sensing can be found in [25,26].
Although recent studies demonstrate the usefulness of SMOTE for remote sensing applications, it still has some drawbacks. The SMOTE algorithm has the disadvantage of generating noisy data [27]. In order to mitigate this problem, many variations of SMOTE have been developed. B-SMOTE is one of the most popular SMOTE-based oversamplers. Similarly to SMOTE, it uses the knearest neighbors selection strategy. The main difference to the original algorithm is that it modifies the data generation mechanism by generating samples closer to the decision boundary. B-SMOTE has also been reported to perform better than SMOTE in a number of studies [28,29]. ADASYN is another well-known variation of SMOTE. It is based on the idea of adaptively generating minority class instances according to their weighted distribution: more instances are generated for those minority class instances that are harder to learn compared to ones that are easier to learn [18].
The SMOTE algorithm can be decomposed into two parts: the selection strategy for the minority class instances and the data generation mechanism. The first part is related to the generation of noisy instances since the SMOTE selection strategy considers all the minority samples as equivalent. The above-mentioned SMOTE variations (B-SMOTE and ADASYN) aim to deal with this problem. On the other hand, the second part is responsible for the diversity of the artificial instances. There are scenarios where the linear interpolation mechanism used in SMOTE generates nearly duplicate instances that may lead to overfitting. The G-SMOTE algorithm is an extension of SMOTE that aims to deal with both problems. G-SMOTE defines a flexible geometric region around each minority class instance for synthetic data generation. The shape of this area is controlled by a set of hyperparameters. This element significantly increases the diversity of instances generated. Furthermore, G-SMOTE is designed to avoid noisy sample generation since it modifies the SMOTE selection strategy. G-SMOTE has been shown to outperform SMOTE and its above-mentioned variations across 69 imbalanced datasets for various classifiers and evaluation metrics. Figure 1 depicts the data generation mechanisms of both SMOTE and G-SMOTE using a deformed geometric region.

Methodology
This section describes the evaluation process of G-SMOTE's performance. A description of the study area, dataset, oversamplers, classifiers, evaluation metrics, and the experimental procedure is provided. Figure 2 represents the flowchart of the steps applied in this experiment.

Study Area
The area of study was within north-western Portugal, corresponding to the area covered by the Landsat 8 image from track 204 and row 32, shown in Figure 3. The area contains all eight main land cover types defined by LUCAS 2015: artificial land, cropland, woodland, shrubland, grassland, bare land, water, and wetlands.

Remote Sensing Data
The remotely sensed data includes eight images from the moderate-resolution Landsat 8 multi-spectral sensor. The images are Level-2 surface reflectance products (OLI/TIRS); one image was acquired each month from February to September 2015. The acquisition mode was descending. Data were pre-processed in order to remove pixels with cloud cover. Only bands 2, 3, 4, 5, 6, and 7 were used from each image. Accordingly, each reference point from the LUCAS dataset had 48 features, representing pixel values from each spectral band from each image.

LUCAS Dataset
The 2015 LUCAS data was used as reference data for both model training and validation. The LUCAS point label represents the corresponding land cover/use type within the radius of 1.5 m for homogeneous classes and a 20 m radius extent ("extended window") for heterogeneous classes (e.g., shrubland), gathered by field observation and a very high-resolution photo interpretation [6]. In order to reduce the risk of having Landsat pixel information represented wrongly in the field, we only kept points observed in situ from a close distance (<100 m). With the same objective we removed the points which had linear features in the observation (e.g., roads). This procedure was solely not applied to the class of "artificial land", as this would have removed most parts of the samples. Furthermore, points with cloudy pixels in the Landsat data were also excluded. This way, 1694 out of 2060 LUCAS points were retained. This dataset contains eight classes that represent the main land cover types for the study area.
This pixel selection excluded a large number of unacceptable reference points, and we assumed the remaining ones to be suitable enough to represent the land cover type in a Landsat pixel coverage area of 30 × 30 m. Further, we surmised that classifiers are capable of overcoming the noise caused by pixels having mixed land cover representation if such pixels are still available in the dataset.
The number of samples per class and the imbalance ratio (IR), defined as the ratio of the number of samples for the majority class over the number of samples for any of the minority classes, is presented in Table 1.  Table 2 presents a description of the LUCAS dataset, including information about the majority class C and the smallest minority class H to emphasize the imbalanced character of the dataset:

Evaluation Metrics
Amongst the possible choices existing for a classifier's performance evaluation, Accuracy, user's accuracy (or Precision) and producer's accuracy (or Recall) are the most common in LULC classification [30,31]. For a binary classification task, their calculation is given in terms of the true positives TP, true negatives TN, false positives FP, and false negatives FN [30]. More specifically, Precision = TP TP+FP and Recall = TP TP+FN . For the multiclass case, the average value across classes is used, as explained below.
The LUCAS dataset is highly imbalanced, having a wide range of IRs for the different minority classes. Therefore, the use of the metrics above is not an appropriate choice since they are mainly determined by the majority class contribution [32]. An appropriate evaluation metric should consider the classification accuracy of all classes. A simple approach for the multiclass case is to select a binary class evaluation metric; apply it to each binary sub-task of the multiclass problem, i.e., consider each class versus the rest; and finally, average its values. For this purpose, F-score and G-mean metrics were used as the primary evaluation methods, while Accuracy is provided for discussion: -The Accuracy is the number of correctly classified samples divided by the sum of all samples.
Assuming that the various classes are labeled by the index c, Accuracy is given by the following formula: -The F-score is the harmonic mean of Precision and Recall. The F-score for the multiclass case can be calculated using their average per class values [32]: -The G-mean is the geometric mean of Sensitivity and Specificity. Sensitivity is identical to the Recall while Specificity is given by the formula Specificity = TN TN+FP . Therefore, they are equal to the true positive and true negative rates, respectively. The G-mean for the multiclass case can be calculated using their average per class values: G-mean = Sensitivity × Speci f icity.

Machine Learning Algorithms
The main objective of the paper is to show the effectiveness of G-SMOTE when it is used on multiclass, highly imbalanced data of a remote sensing application and to compare its performance to other oversampling methods. Four oversampling algorithms were used in the experiment along with G-SMOTE. ROS was chosen for its simplicity. SMOTE was selected for being the most widely used oversampler. ADASYN and B-SMOTE were selected for representing popular modifications of the original SMOTE algorithm. Finally, no oversampling was applied as an additional baseline method.
For the evaluation of the oversampling methods, the classifiers logistic regression (LR) [33], k-nearest neighbors (KNN) [34], decision tree (DT) [35], gradient Boosting classifier (GBC) [36], and random forest (RF) [37] were selected. The choice of classifiers was made according to the following criteria: learning type, training time, and popularity within the remote sensing community. All these algorithms were found to be computationally efficient and commonly used for the proposed task, with the exception of LR, which is rarely used in remote sensing applications [2,21].

Experimental Settings
In order to evaluate the performance of each oversampler, every possible combination of oversampler, classifier, and metric was formed. The evaluation score for each of the above combinations was generated through an n-fold cross-validation procedure with n = 3. Before starting the training of each classifier, and in each stage i ∈ {1, 2, . . . , n} of the n-fold cross-validation procedure, synthetic data S i were generated using the oversampler, based on the training data T i of the n − 1 folds, such that the resulting S i ∪ T i training set became perfectly balanced. This enhanced training set, in turn, was used to train the classifier. The performance evaluation of the classifiers was done on the validation data V i of the remaining fold, where V i ∪ T i = D, V i ∩ T i = ∅ while D represents the dataset. The process above was repeated three times, and the results were averaged.
The range of hyperparameters used for each classifier and oversampler are presented in Table 3:

Software Implementation
The implementation of the experimental procedure was based on the Python programming language, using the Scikit-Learn [38], Imbalanced-Learn [39], and Geometric-SMOTE libraries. All functions, algorithms, experiments, and results reported are provided at the GitHub repository of the project. Additionally, the Research-Learn library provides a framework to implement comparative experiments, also being fully integrated with the Scikit-Learn ecosystem.

Results and Discussion
This section presents the results and analyses of oversamplers' comparisons on the LUCAS dataset. The classification results are shown for all combinations of oversamplers and classifiers used in the experiment. The next subsection covers their interpretation in detail.

Results
For each combination of classifier and metric, a cross-validation score for all oversamplers is provided in Table 4. The highest score for each row is highlighted: A ranking score was assigned to each oversampling method, with the best and worst performing methods receiving scores of 1 or 6, respectively. Table 5 presents the ranking scores per classifier and evaluation metric. The highest ranking for each row is highlighted: The percentage difference between G-SMOTE and NONE, ROS, and SMOTE, respectively, for every combination of metric and classifier, was calculated from the following formula: For each combination of an oversampler, classifier, and metric, a positive (negative) value of the above formula indicates the G-SMOTE's relative performance gain (loss) compared to the oversampler. Table 6 presents the results of the above calculation: Wilcoxon signed-rank test was used as an alternative to the paired Student's t-test when the distribution of the differences between the two samples cannot be assumed to be normally distributed. In our case, it was applied to test the null hypothesis that the pairwise difference between G-SMOTE's scores and the scores of the remaining oversampling methods follows a symmetric distribution around zero; i.e. G-SMOTE performs similarly to them. The values for the Accuracy metric are excluded in the NONE case, while for the remaining oversampling methods all metrics are used. This choice will be justified in the next section. Table 7 presents the p-values for the Wilcoxon tests:

Discussion
From Table 4, we can observe that G-SMOTE outperforms all other oversampling methods for both F-score and G-mean metrics on all classifiers. The absolute best results are achieved when G-SMOTE is combined with LR and RF. It is vital to notice that the Accuracy scores show the well-known bias towards the majority class, as discussed in Section 3.4. In a multiclass classification problem with an imbalanced dataset, where the prediction of all the classes are of equal importance as in many remote sensing applications, Accuracy should be of secondary importance compared to more robust metrics, such as F-score and G-mean. Nevertheless, even for the Accuracy metric, G-SMOTE shows the best performance among the oversamplers.
In Table 5, the rankings of the oversamplers are presented and show the superiority of G-SMOTE. Although ROS and SMOTE are the most popular oversampling methods in remote sensing applications, it is clear from the tables that they produce suboptimal results. Table 6 directly compares the performance of G-SMOTE with ROS and SMOTE, including also NONE as a baseline method. Table 7, provides a statistical confirmation of the previous conclusions. Using the Wilcoxon signed-rank test, the null hypothesis that the pairwise difference of scores between G-SMOTE and any of the remaining oversampling methods follows a symmetric distribution around zero is rejected at a significance level of al pha = 0.01.
This study is the first to present a systematic comparison of oversampling algorithms in remote sensing. However, several previous studies reported results consistent with our findings. Reference [25] reported an increase in F-score and G-mean when oversampling was applied, while Accuracy did not improve. Similarly, results obtained in [5] demonstrated increased classification performance when using SMOTE. According to our experiment, performance can be further increased by using G-SMOTE. A number of other studies [21,23] did not use specific imbalanced metrics; therefore, they cannot be directly compared to our results.

Conclusions
In this paper we applied G-SMOTE, a novel oversampling algorithm, on a LULC classification problem, using a highly imbalanced, multiclass dataset (LUCAS). G-SMOTE's performance was evaluated and compared with other oversampling methods. More specifically, ROS, SMOTE, B-SMOTE, and ADASYN were the selected oversamplers, while LR, KNN, DT, GBC, and RF were used as classifiers.
The experimental results show that using G-SMOTE can significantly improve the classification performance, resulting in higher values of F-score and G-mean. Therefore, readers should consider using G-SMOTE when accurately predicting the minority classes is of equal or higher importance compared to the accurate prediction of the majority class. Examples of the above case include the detection of land cover change and rare land cover type classification.
G-SMOTE can be a useful tool for remote sensing researchers and practitioners, as it systematically outperforms the currently widely used oversamplers. G-SMOTE is easily accessible to the users through an open source implementation. Acknowledgments: The authors would like to thank Direção Geral do Território (DGT) for supporting the data used in this study.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: