Machine Learning Classiﬁcation Ensemble of Multitemporal Sentinel-2 Images: The Case of a Mixed Mediterranean Ecosystem

: Land cover type classiﬁcation still remains an active research topic while new sensors and methods become available. Applications such as environmental monitoring, natural resource management, and change detection require more accurate, detailed, and constantly updated land-cover type mapping. These needs are fulﬁlled by newer sensors with high spatial and spectral resolution along with modern data processing algorithms. Sentinel-2 sensor provides data with high spatial, spectral, and temporal resolution for the in classiﬁcation of highly fragmented landscape. This study applies six traditional data classiﬁers and nine ensemble methods on multitemporal Sentinel-2 image datasets for identifying land cover types in the heterogeneous Mediterranean landscape of Lesvos Island, Greece. Support vector machine, random forest, artiﬁcial neural network, decision tree, linear discriminant analysis, and k-nearest neighbor classiﬁers are applied and compared with nine ensemble classiﬁers on the basis of di ﬀ erent voting methods. kappa statistic, F1-score, and Matthews correlation coe ﬃ cient metrics were used in the assembly of the voting methods. Support vector machine outperformed the base classiﬁers with kappa of 0.91. Support vector machine also outperformed the ensemble classiﬁers in an unseen dataset. Five voting methods performed better than the rest of the classiﬁers. A diversity study based on four di ﬀ erent metrics revealed that an ensemble can be avoided if a base classiﬁer shows an identiﬁable superiority. Therefore, ensemble approaches should include a careful selection of base-classiﬁers based on a diversity analysis.


Introduction
Remote sensing image classification is considered among the main topics of remote sensing that aims to extract land cover types on the basis of the spectral and spatial properties of targets in a study area [1]. The land cover/land use mapping is essential for many applications from a local to a global scale, i.e., environmental monitoring and management, detection of global change, desertification evaluation, support decision making, urban change detection, landscape fragmentation, and tropical deforestation [2][3][4]. A vast amount of remote sensing data is archived and can be accessed freely or with low cost while new data become available every day for the whole planet. The rapid growth of computational approaches, the evolution of sensors' characteristics and the availability of satellite data have fueled the development of novel methods in image classification. The most widely used methods are the supervised ones, including the traditional approaches of maximum likelihood and minimum distance, as well as more recently, the modern machine learning classifiers, especially as pixel-based classifiers [5][6][7][8][9][10].
despite the spectral similarities between classes [31]. Thus, the multitemporal data are essential for discriminating the vegetation types, resulting in higher classification accuracy results [32].
Most remote sensing classification studies have relied on a single classifier or a comparison of a number of them [33][34][35]. Since all classifiers perform within an accuracy range, an ensemble approach may show improved accuracy levels and increased reliability in remote sensing image classification [36]. To this end, several methods are reported in the literature to address the issue of how to develop an ensemble classifier that combines the decisions from multiple base-classifiers [37][38][39][40][41] that can be used either on hard or soft classifications [42]. Three categories of methods can be identified in the literature [36,43]: (i) algorithms that are based on training data manipulation including the well-known "bagging" and "boosting" [44,45] applied on a single based classifiers, i.e., SVM and DT [46,47], (ii) algorithms that are based on a chain of classifiers that perform in a sequential mode, i.e., the output of a classifier is the input for the next one in the chain, and (iii) algorithms that are based on parallel processing of the base classifiers and the combination of their outputs. The main method to combine the decisions of the base classifiers is a weighted or unweighted voting [48,49]. The weights usually depend on the majority, the estimated probability and the accuracy metrics of the base classifiers. Shen et al. [1] compared the producer's accuracy and overall accuracy and they concluded that the overall accuracy had stability issues, while the producer's accuracy performed better in the classification of different land cover types.
This paper aims to apply a number of machine learning approaches, i.e., DT, Linear Discriminant Analysis (DIS), SVM, KNN, RF, and ANN to classify multitemporal Sentinel-2 images and add to whether an ensemble of these base classifiers can further enhance the output accuracy. The classification is applied to an insular environment at the Mediterranean coastal region. Even if various studies have been conducted for Mediterranean environments, an ensemble classification on multitemporal Sentinel-2 data, to the best of our knowledge, has not been examined for this type of ecosystem. Previous studies were focused either on specific types, i.e., on applying machine learning on forested areas [50] or wetlands [51]. Our implementation is somehow different. Each one of the base classifiers uses its own validation dataset rather than a common one, while the final evaluation of the ensemble is compared to base classifiers by using a common and unseen testing dataset.

Materials and Methods
This chapter presents a detailed description of our study area of Lesvos Island, Greece. A thorough description of the input data and the classification methods is followed by the accuracy metrics. Finally, the ensemble voting methods, the diversity measures, and the accuracy metrics are analytically presented.

Study Area and Data
The island of Lesvos is located at the northeastern Aegean Sea of Greece and covers an area of 1636 km 2 and the total length of shore 382 km. The island has a variety of geological formations, climatic conditions, and vegetation types (Figure 1). The climate conditions are categorized as "Mediterranean", with warm and dry summers and mild and moderately rainy winters. Annual precipitation average is 710 mm; the average annual air temperature is 17 • C with high oscillations between maximum and minimum daily temperatures. The terrain is rather hilly and rough, with a highest peak of 960 m a.s.l. Slopes greater than 20% are dominant, covering almost two-thirds of the island. The soils of Lesvos are widely cultivated, mainly with rain-fed crops such as cereals, vines, and olives.
Due to low productivity, many sites were abandoned 50-60 years ago; after abandonment, these areas were moderately grazed, and the shrub regeneration has been occasionally cleared by illegal burning to improve forage production [52]. The vegetation of these areas, defined on the basis of the dominant species, includes phrygana or garrigue-type shrubs in grasslands, evergreen-sclerophylous or maquis-type shrubs, pine forests, deciduous oaks, olive groves, and other agricultural lands. In order to perform the classification, three cloud-free satellite images were retrieved in JPEG 2000 format from Copernicus Open Access Hub [53], acquired by the Sentinel-2A (S2A) and Sentinel-2B (S2B) MSI satellites. The dataset consists of the dates 28/04/2018 (S2A), 12/07/2018 (S2B), and 04/11/2018 (S2A) and the product type is Level-2A. We selected three images of spring, summer, and autumn for our multitemporal approach. According to previous works, a combination of spring, summer, and autumn image provides the highest classification accuracy and high class separability [27,54]. Level-2A products are radiometrically, atmospherically and geometrically corrected, providing the bottom of atmosphere (BOA) reflectance in Universal Transverse Mercator (UTM)/WGS84 projection. We used 10 bands (Table 1) out of 13 available. The final image composition includes in total 30 bands.

Methodology
The methodology consists of three stages: the ground truth data collection, the classification by applying six base classifiers including the estimation of accuracy and diversity metrics, and finally the application of nine ensemble voting methods.

Ground Truth Data Collection
As used by several other studies [55][56][57] the input dataset was created by visual interpretation of Google Earth's very high resolution images among with auxiliary data collected during field trips and a land cover map that was previously produced on the basis of a Worldview-2 image. A total of 1,119 homogenous polygons were identified and outlined with a total area of 127.4 km 2 across the island ( Table 2). Within these polygons, random points were created to extract the values from the 30 layers of the image composition. The next step was to randomly split the dataset into training and testing partitions based on the 80% and the 20% of the initial cases. The training dataset was used to train all base classifiers and was further randomly split into a secondary training and validation dataset including the 75% and the 25% respectively. Figure 2 presents the multitemporal spectral responses per class. The data of Figure 2 reveals significant differences in the spectral signatures within the date range especially in the near infrared (NIR) region except for pine forest. Furthermore, these data also address the phenology stages of deciduous, i.e., chestnut trees and agricultural lands. The variation of chlorophyll content in vegetation results in a significant variation of reflectance especially in infrared bands. These variations cause different phenological patterns for each vegetation cover type. According to previous studies, the different phenologies as described by these spectral responses is expected to improve the classification accuracy values compared to a single-date image especially in study areas where crops and vegetation are the dominant land cover types [58].

Base Classifiers
On the basis of the literature, six base classifiers have been selected for the case study. A widely used non-parametric approach is the decision tree (DT) classifier characterized also by its intuitive simplicity [59]. Within DT the input data is recursively split, based on a set of rules, into smaller and more homogenous groups forming the branches of the tree until the end-nodes, which are the target values. In our case, these target nodes forming the leaves represent the classes. One of the major advantages of DT is that it does not have any prerequisites about input data distribution. Moreover, a good generalization can be achieved by pruning the DT that means to remove some branches or turning some branches into leaves. Therefore, pruning will increase the accuracy by avoiding overfitting [11,60].
Another widely used classification approach, which is based on Fisher's score optimization, is the DIS [61]. DIS approaches has been extensively used in the classification of hyperspectral data, either the initial or modified methods [62][63][64]. One of the major drawbacks of DIS in hyperspectral classification is that these data are ill-posed when the number of data are less than the number of bands. In our case, our data are more than sufficient to avoid this phenomenon during the classification of a 30-dimension space. A third method that we applied is the Support Vector Machine (SVM) which aims to find a hyperplane that separates categorical data in a high dimension space with the maximum possible margin between the hyperplane and the cases [65]. The cases that are closest to the hyperplane are called support vectors. However, in most cases, the classes are not linearly separable, hence a slack variable is included, and a kernel function is used to perform a non-linear mapping into the feature space. The most widely used kernels in remote sensing applications are the polynomials and radial basis functions (RBF) [11,66].
The forth approach was a k-Nearest Neighbor (KNN) classifier. This algorithm calculates the distances between an unclassified case and the nearest k training cases and classifies the unclassified case to the majority class of the nearest k training cases. The user can choose from a variety of distance metrics, however, the most widely used is the Euclidean Distance which can be applied either unweighted or with a weight [67].
The next applied classification method was the random forest (RF). The concept of DTs is expanded and enhanced through the RF algorithm. Multiple DTs are trained on the basis of a subset of the training data where each one is trained on the basis of its own random sample. A majority vote of all the DTs defines the final class of each case. One of the advantages is that RF does not make any assumption about the probability distribution of the input data [13]. A more detailed description of the RF algorithm in remote sensing applications can be found in [68][69][70].
Finally, we developed and applied an artificial neural network (ANN) classifier. ANNs have been very popular and have been extensively used in pattern recognition and in modeling complex problems. In the last 30 years, ANNs play a fundamental role in remote sensing land cover classification applications [71][72][73][74], while the new trend in the classification of very high resolution images are the convolutional neural networks (CNN) [75,76]. The CNNs have proven, in the last years, to be very powerful classifiers in image recognition, object detection, image segmentation, and instance segmentation [77]. However, CNNs are fundamentally based on spatial-contextual dependencies of the input data with the majority of them being trained on high resolution RGB images [78,79]. Opposite to patch-based CNNs, pixel-based CNNs have been developed. However, according to the literature, the common problems of pixel-based classification, i.e., the salt and pepper effect and the boundary fuzziness effect within the classification result are quite severe in CNN implementations [80]. Other disadvantages of CNNs are the higher processing time and resources. We believe that, for the pixel-based hard classification of the present multispectral and multitemporal approach, ANNs are more suitable given also the nature of the other base classifiers. ANNs are characterized by their architecture, their training algorithm, and their activation function. The most well-known and efficient type is the Multilayer Perceptron (MLP) with three layers: input, hidden, and output, while ANNs with one hidden layer are able to map any nonlinear function. Various gradient descent learning methods have been proposed.
The evaluation, comparison, and voting during the ensemble of the applied methods were based on the below metrics: verall where TP: true positive, TN: true negative, FP: false positive, and FN: false negative. The overall accuracy (OA) is a single and very basic summary measure of the probability of a case being correctly classified and is based on the sum of the diagonal elements of the confusion matrix. User's accuracy (UA) and producer's accuracy (PA) provide an accuracy performance for each class. UA is a performance measure of the credibility of the output map that is how well the map represents the actual cover types.
On the other hand, PA measures the accuracy of how well the reference data is represented by the map. UA and PA are related with the commission and the omission errors respectively. The kappa coefficient on the other hand is a more advanced metric, which compares the observed accuracy against random chance. Opposite to OA, the kappa coefficient takes also into consideration the non-diagonal elements. Furthermore, the F1-score is a rather different measure of accuracy, defined as the weighted harmonic mean of both classification's precision and recall. It balances the use of precision and recall and provides a more realistic measure of performance. Finally, the Matthews correlation coefficient (MCC) is a more balanced metric that takes into account all parts of the confusion matrix and can handle under-represented classes. Each classifier produced a contingency matrix presenting the classification results of its validation dataset and the corresponding accuracy metrics. It should be noticed that for the calculation of the kappa, F1, and MCC for each class, each confusion matrix was converted to multiple binary matrices based on the 'one-vs-all' scheme. Moreover, four different diversity statistics were calculated on the basis of the results of the base classifiers to the testing datasets, as depicted by the following 2 × 2 table of the relationship between a pair of classifiers C i and C k (Table 3) [81]. Table 3. Relational table between a pair of classifiers.
Where N 11 is the correctly classified cases by both classifiers, N 10 is the correctly classified cases by C k classifier, N 01 is the correctly classified cases by Ci classifier, and N 00 is the incorrectly classified cases by both classifiers. The diversity measures were the Q-statistic, the disagreement measure, the double-fault measure, and the inter-kappa statistic given by [36,[81][82][83]:

Ensemble Voting Methods
After obtaining results from the six classifiers, an ensemble classifier should be constructed in order to classify the 4,910 cases of the testing dataset. For each testing case, all the base classifiers provided a prediction, and we tested nine different voting schemes to further evaluate: For all the above voting methods, the metrics of kappa, F1, and MCC are the ones of the corresponding metrics calculated for each class of the validation dataset during the training phase. The final comparison and selection of the best voting method was based on the kappa value. It should be noticed that even if we have computed the OA, we did not use it during the ensemble phase. Due to the imbalanced input dataset, the overall accuracy does not have an adequate performance, thus we used the kappa, the F1-score, and the MCC.
The nine ensemble models were further statistically compared by applying McNemar's test [84]. McNemar's test has been widely used in comparison of classifiers performances [85]. All models were compared in pairs and the McNemar's value was given by: where n 01 is the number of samples misclassified only by algorithm A and n 10 is the number of samples misclassified only by algorithm B. The null hypothesis is that both of the classification methods have the same error rate. McNemar's test is based on a x 2 test with one degree of freedom, where the critical x 2 value with a 95% confidence interval and a 5% level of significance is 3.841. If the computed McNemar's value for each pair is more than 3.841, then the null hypothesis is rejected, therefore, the two classification methods are significantly different. The ArcGIS 10.2 [86] was used for the spatial processing and visualization of the data, the Matlab 2018a [87] for the base classifications, while the ensemble of the classifiers through the voting methods was carried out using R, including the packages caret, dplyr, and magrittr [88][89][90][91]. Figure 3 presents the overall workflow of the current research.

Results and Discussion
Each base classifier was carefully designed and trained under different settings. This section presents the training results of the base classifiers and the classification ensemble. A diversity analysis of the base classifiers and a significant test of the voting methods provide a more comprehensive view of the results.

Base Classifiers Training
For the training of the DT classifiers we tested 4, 10, and 100 as the maximum number of splits. The best results obtained with 100 maximum splits based on Gini's diversity index. The SVM classifier was trained with three different kernels; linear, RBF, quadratic and cubic. The cubic kernel provided the best accuracy and was further analyzed. We are aware that our approach is a multiclass imbalanced problem. It was decided that the best approach was to apply a "one-vs-one" instead of "one-vs-all" coding scheme in order to reduce the effect of the imbalance problem [92,93]. At the same time, the kappa, UA, PA, F1, and MCC provide a better interpretation of accuracy in an imbalanced dataset opposed to the overall accuracy. For the KNN we tested 1, 10, 100 nearest neighbors and the best results were provided with 10 neighbors. During the training, we applied the Euclidean distance unweighted and with a weight, but the overall accuracy increased when we used as a weight the inverse square of the distance for each case. The RF model tested with 30, 40, 50, 70, and 100 trees. The final model included 30 trees based on overall accuracy. Finally, during ANN training we applied different architectures with one hidden layer with 16 to 35 hidden nodes. We also tested two gradient descent learning methods: the Levenberg Marquardt [94] and the scaled conjugate gradient [95]. Each network was trained 10 times with different random initial weights. The model with the best performance had 16 hidden nodes and was trained with a scaled conjugate gradient for 272 epochs.
The classification accuracy of each classifier was evaluated based on the confusion matrix of the validation dataset presented in Tables A1-A6. Table 4 shows the user's (UA), producer's (PA) accuracy per class and the OA for each classifier while Figure 4 presents the heat map of UA and PA where the colors are normalized for each classifier. According to the results, the SVM outperformed all the classifiers according to the OA and the kappa ( Figure 5). In most of the land cover classes, SVM presented the lower omission and commission errors while the DT and the DIS had the poorest performance with kappa 0.79 and 0.83, respectively. Aquatic bodies were almost perfectly classified by all classifiers, while brushwood, built up, Pinus brutia, and agricultural land classes also showed high accuracy. Figure 6 presents the diversity of UA and PA among all classifiers for each class. The base classifiers had significant different performances in omission error for other broadleaves, barren land, and grassland classes and different performances in the commission error for oak forest, Pinus nigra, and other broadleaves. It should be noticed that the UA of SVM for other broadleaves is an outlier, i.e., its value is more than 1.5 times the interquartile range above the upper quartile.  Results are consistent with what has been found in previous studies. Shang et al. [96] applied SVM, RF and AdaBoost for the classification over an Australian eucalyptus forest. According to their results all three machine-learning algorithms outperformed the results produced by DIS. DT and DIS methods have also shown a poor performance in the comparison for the classification of Sentinel-2 data where RF outperformed followed by SVM and ANN [97]. However, the diversity of the results in the literature, reveals that the applied methods are data-driven and depended on the classification scheme, the number of training data and the type input data i.e. whether only the bands are taken into account or vegetation indices and other auxiliary data are used [19].

Classification Ensemble
During the ensemble, we tested the nine voting methods and the base classifiers with the testing dataset. Figure 7 shows the k coefficient for the base classifiers and the voting methods applied in the testing dataset. According to the results, the SVM outperforms not only the base classifiers but also all the voting methods. However, all the voting methods based on sums as well as the method based on the majority of the votes performed better that all the rest of the base classifiers. It is worth mentioning that DT present the lower k among all classifiers, while DIS has the second to last performance. Table 5 presents the kappa coefficient per class for each classifier for the testing dataset, while the Figure 8 presents the corresponding heatmap. It is observed that SVM shows a better performance in almost all classes. The voting methods based on sums and the majority of the votes performed slightly better in the built-up class. The confusion matrices of the testing dataset of the best classifier and the best ensemble methods are presented in Tables A7-A9. According to the results, it is evident that the combination of the classifiers does not provide always a better performance compared to the base classifiers. In a crop classification study in a fragmented arable landscape by Salas et al. [98], the authors concluded that when no classifier is clearly performing better than the others then an ensemble approach can be the best alternative. In our case SVM, RF, ANN, and KNN show a similar performance, however the SVM method performs better than all the applied voting methods. Therefore, a diversity study was applied in order to identify any potential dissimilar performance between SVM and the other base classifiers.   Table 6 presents the result of the four diversity statistics for all the possible pairs of the base classifiers for the testing dataset. According to the inter-kappa measure (Table 6a), all classifiers show a moderate agreement between them, except for SVM, which shows a fair agreement with DT and DIS and a moderate agreement with the rest of the classifiers. The high values of the Q-statistic (Table 6b) and the low values of the disagreement measure (Table 6c) suggest that there is not any significant diversity of the classifiers. The same conclusion results from the double-fault measure (Table 6d). However, from Figure 9, it is evident that the SVM presents a diverse performance especially based on the double-fault and the inter-kappa measures (Figure 9a,d). SVM's double-fault measures are tightly grouped while the values are quite low. Furthermore, the group of SVM's inter-kappa measures are lower, while the rest of the classifiers have a similar performance. Therefore, from the combination of the classification performance of the base classifiers with the diversity results is revealed that a voting method does not provide a better performance when a base classifier has a small but identifiable better performance than the rest of the classifiers.  On the other hand, McNemar's test of the ensemble methods showed that the voting method based on greater kappa is significantly different from the rest of the voting methods (Table 7). More specifically, the x 2 exceeds the critical x 2 value of 3.84 and thus 'MaxK' is statistically significant at a 95% confidence interval for all the pair comparisons. Interestingly, the rest of the comparisons revealed that the null hypothesis cannot be rejected according to McNemar's test, hence the difference in accuracy between the ensemble methods is not statistically significant.

Conclusions
This work illustrates the potential use of a number of classifiers on identifying land cover types. The land cover type mapping is essential for the land management of the Mediterranean ecosystems. Long-term human activities along with geographic and climatic conditions have created a heterogeneous fragmented ecosystem that changes rapidly [99]. One of the main disturbances of Mediterranean ecosystems are wildfires. Land cover type mapping provides valuable information, i.e., vegetative fuel and socioeconomic inputs in wildfire risk assessment [100,101]. Furthermore, through remote sensing classification we can identify the change detection, possible land degradations and empower ecosystem monitoring [102,103].
An ensemble approach with nine voting methods has been developed for increased accuracy over classification algorithms using multi-temporal Sentinel-2 data from a mixed Mediterranean ecosystem. Each base classifier was trained with its own dataset in order to create the accuracy metrics that were used within the voting methods. All the base classifiers and the ensemble methods were applied to an unseen testing dataset. The result shows that the combination of multiple classifiers based on the examined voting schemes does not always provide a better performance in land cover classification. The SVM algorithm outperformed all the classifiers and was proven as the most accurate approach especially for this quite unbalanced dataset.
The diversity measures can explain the outperformance of SVM. The double-fault measure clearly shows that SVM significantly differs from the rest of the classifiers. Therefore, diversity measures should be thoroughly examined before building an ensemble method. The diversity metrics can be evidence in identifying possible overperformance within base classifiers hence an ensemble may not be always necessary. On the other hand, possible underperformances can be identified leading to the exclusion of some base classifiers. To sum up, our voting methods were influenced by the number of classifiers with a lower performance opposed to SVM. Hence, the accuracy of the ensembles are lower than the best base classifier, probably due to the 'curse of conflict' problem [104].
Potential further improvements of this methodology should include the incorporation of additional base algorithms and more ensemble methods. Moreover, opposed to pixel-based approaches, ensembles of segmentation approaches can be explored including traditional segmentation algorithms and CNNs. An interesting potential improvement of this work should be the comparison of multiple Mediterranean areas based on the very same ensemble of algorithms. Nevertheless, this work has proven that contemporary computational approaches along with advanced algorithmic measures show potential for land cover classification of unbalanced data.

Acknowledgments:
We thank the four anonymous reviewers for providing constructive comments for improving the overall manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A Table A1. Confusion matrix for the validation dataset during the training phase of the base classifiers for the decision tree (DT) model.