A Cloud-Based Multi-Temporal Ensemble Classifier to Map Smallholder Farming Systems

Smallholder farmers cultivate more than 80% of the cropland area available in Africa. The intrinsic characteristics of such farms include complex crop-planting patterns, and small fields that are vaguely delineated. These characteristics pose challenges to mapping crops and fields from space. In this study, we evaluate the use of a cloud-based multi-temporal ensemble classifier to map smallholder farming systems in a case study for southern Mali. The ensemble combines a selection of spatial and spectral features derived from multi-spectral Worldview-2 images, field data, and five machine learning classifiers to produce a map of the most prevalent crops in our study area. Different ensemble sizes were evaluated using two combination rules, namely majority voting and weighted majority voting. Both strategies outperform any of the tested single classifiers. The ensemble based on the weighted majority voting strategy obtained the higher overall accuracy (75.9%). This means an accuracy improvement of 4.65% in comparison with the average overall accuracy of the best individual classifier tested in this study. The maximum ensemble accuracy is reached with 75 classifiers in the ensemble. This indicates that the addition of more classifiers does not help to continuously improve classification results. Our results demonstrate the potential of ensemble classifiers to map crops grown by West African smallholders. The use of ensembles demands high computational capability, but the increasing availability of cloud computing solutions allows their efficient implementation and even opens the door to the data processing needs of local organizations.


Introduction
Smallholder farmers cultivate more than 80% of the cropland area available in Africa [1] where the agricultural sector provides about 60% of the total employment [2].However, the inherent characteristics of smallholder farms such as their small size (frequently less than 1 ha and with vaguely delineated boundaries), the ir location in areas with extreme environmental variability in space and time, and the use of mixed cropping systems, have prevented a sustainable improvement on smallholder agriculture in terms of volume and quality [3].Yet, an increase of African agricultural productivity is imperative because the continent will experience substantial population growth in the coming decades [4].Realizing that croplands are scarce, the productivity increase should have the lowest reasonable environmental impact and should be as sustainable as possible [5].A robust agricultural monitoring system is then a prerequisite to promote informed decisions not only at executive or policy levels but also at the level of daily field management.Such a system could, for example, help to reduce price fluctuations by deciding on import and export needs for each crop [6], to establish agricultural insurance mechanisms, or to estimate the demand for agricultural inputs [6,7].
Crop maps are a basic but essential layer of any agricultural monitoring system and are critical to achieve food security [8,9].Most African countries, however, lack reliable crop maps.Remote sensing image classification is a convenient approach for producing these maps due to advantages in terms of cost, revisit time, and spatial coverage [10].Indeed, remotely sensed image classification has been successfully applied to produce crop maps in homogeneous areas [11][12][13][14].
Smallholder farms, which shape the predominate crop production systems in Africa, present significant mapping challenges compared to homogeneous agricultural areas (i.e., with intensive or commercial farms) [8].Difficulties are not only in requiring very high spatial resolution data, but also in the spectral identification of farm fields and crops because smallholder fields are irregularly shaped and their seasonal variation in surface reflectance is strongly influenced by irregular and variable farm practices in environmentally diverse areas.Because of these peculiarities, the production of reliable crop maps from remotely sensed images is not an easy task [15].
In general, a low level of accuracy in image classification is tackled by using more informative features, or by developing new algorithms or approaches to combine existing ones [16].Indeed, several studies have shown that classification accuracy improves when combining spectral (e.g., vegetation indices), spatial (e.g., textures), and temporal (e.g., multiple images during the cropping season) features [17].Compared to single band, spectral indices are less affected by atmospheric conditions, illumination differences, and soil background, and thus bring forward an enhanced vegetation signal that is normally easier to classify [18].Spatial features benefit crop discrimination [19], especially in heterogeneous areas where high local variance is more relevant when very high spatial resolution images are applied [20,21].Regarding temporal features, multi-temporal spectral indices have been exploited in crop identification because they provide information about the seasonal variation in surface reflectance caused by crop phenology [13,[22][23][24].
The second approach to increase classification accuracy (i.e., by developing new algorithms) has been extensively used by the remote sensing community, which has rapidly adopted and adapted novel machine learning image classification approaches [25][26][27].The combination of existing classifiers (ensemble of classifiers) has, however, received comparatively little attention, although it is known that ensemble classifiers increase classification accuracy because no single classifier outperforms the others [28].A common approach to implement a classifier ensemble, also known as a multi-classifier, consists of training several "base classifiers", which are subsequently applied to unseen data to create a set of classification outputs that are next combined using various rules to obtain a final classification output [28,29].At the expense of increased computational complexity, ensemble classifiers can handle complex feature spaces and reduce misclassifications caused by using non-optimal, overfitted, or undertrained classifiers and, hence, the y improve classification accuracy.Given the increasing availability of computing resources, various studies have shown that ensemble classifiers outperform individual classifiers [30][31][32].Yet, the use of ensemble classifiers remains scarce in the context of remote sensing [33] and is limited to image subsets, mono-temporal studies, or to the combination of only a few classifiers [34][35][36].
Ensemble classifiers produce more accurate classification results because they can capture and model complex decision boundaries [37].The use of ensembles for agricultural purposes as reported in various studies has shown that they outperformed individual classifiers [34,35,38].Any classifier that provides a higher accuracy than one obtained by chance is suitable for integration in an ensemble [39], and may contribute to shape the final decision boundaries [29].In other words, the strength of ensembles comes from the fact that the base classifiers misclassify different instances.For this purpose, several techniques can be applied.For example, by selecting classifiers that rely on different algorithms, by applying different training sets, by training on different feature subsets, or by using different parameters [40,41].
In this study, we evaluate the use of a cloud-based ensemble classifier to map African smallholder farming systems.Thanks to the use of cloud computing, various base classifiers and combination rules were efficiently tested.Moreover, it allowed training of the ensemble with a wide array of spectral, spatial, and temporal features extracted from the available set of very high spatial resolution images.

Materials and Methods
This section provides a description of the available images and the approach used to develop our ensemble classifiers.

Study Area and Data
The study area covers a square of 10 × 10 km located near Koutiala, southern Mali, West Africa.This site is also an ICRISAT-led site contributing to the Joint Experiment for Crop Assessment and Monitoring (JECAM) [42].For this area, a time series of seven multi-spectral Worldview-2 images was acquired for the cropping season of 2014.Acquisition dates of the images range from May to November covering both the beginning and the end of the crop growing season [42].The exact acquisition dates are: 22 May, 30 May, 26 June, 29 July, 18 October, 1 November, and 14 November.The images have a pixel size of about 2 m and contain eight spectral bands in the visible, red-edge and near-infrared part of the electromagnetic spectrum.Figure 1 illustrates the study area and a zoomed in view of the area with agricultural fields.All the images were preprocessed using the STARS project workflow which uses the 6S radiative transfer model for atmospheric correction [43].The images were atmospherically and radiometrically corrected, co-registered, and trees and clouds were masked.Crop labels for five main crops namely maize, millet, peanut, sorghum, and cotton, were collected in the field.A total of 45 fields were labeled in situ in the study area indicated in Figure 1b.This ground truth data was used to train base classifiers and to assess the accuracy of both base classifiers and ensembles.
Remote Sens. 2018, 10, x FOR PEER REVIEW 3 of 18 array of spectral, spatial, and temporal features extracted from the available set of very high spatial resolution images.

Materials and Methods
This section provides a description of the available images and the approach used to develop our ensemble classifiers.

Study Area and Data
The study area covers a square of 10 × 10 km located near Koutiala, southern Mali, West Africa.This site is also an ICRISAT-led site contributing to the Joint Experiment for Crop Assessment and Monitoring (JECAM) [42].For this area, a time series of seven multi-spectral Worldview-2 images was acquired for the cropping season of 2014.Acquisition dates of the images range from May to November covering both the beginning and the end of the crop growing season [42].The exact acquisition dates are: 22 May, 30 May, 26 June, 29 July, 18 October, 1 November, and 14 November.The images have a pixel size of about 2 m and contain eight spectral bands in the visible, red-edge and near-infrared part of the electromagnetic spectrum.Figure 1 illustrates the study area and a zoomed in view of the area with agricultural fields.All the images were preprocessed using the STARS project workflow which uses the 6S radiative transfer model for atmospheric correction [43].The images were atmospherically and radiometrically corrected, co-registered, and trees and clouds were masked.Crop labels for five main crops namely maize, millet, peanut, sorghum, and cotton, were collected in the field.A total of 45 fields were labeled in situ in the study area indicated in Figure 1b.This ground truth data was used to train base classifiers and to assess the accuracy of both base classifiers and ensembles.

Methods
Figure 2 presents a high-level view of the developed workflow.First, in the data preparation step (described more fully in Section 2.2.1), we extract a suite of spatial and spectral features from the available images and select the most relevant ones for image classification.Then, multiple classifiers

Methods
Figure 2 presents a high-level view of the developed workflow.First, in the data preparation step (described more fully in Section 2.2.1), we extract a suite of spatial and spectral features from the available images and select the most relevant ones for image classification.Then, multiple classifiers are trained, tested, and applied to the images (Section 2.2.2).Finally, we test various approaches to create ensembles from the available classifiers and assess their classification accuracy using an independent test set (Section 2.2.3).
Remote Sens. 2018, 10, x FOR PEER REVIEW 4 of 18 are trained, tested, and applied to the images (Section 2.2.2).Finally, we test various approaches to create ensembles from the available classifiers and assess their classification accuracy using an independent test set (Section 2.2.3).

Data Preparation
A comprehensive set of spectral and spatial features is generated from the (multi-spectral) time series of Worldview-2 images.The spectral features include the vegetation indices listed in Table 1.
Textural features are calculated as the average of their values in four directions (0, 45, 90, 135), applying a window of 3 × 3 pixels to the original spectral bands of each image.This configuration corresponds to the default setup in GEE and is deemed appropriate for our study since our goal is to create an efficient ensemble and not to optimize the configuration to extract spatial features.
The extraction of spectral and spatial features, computed for each pixel, results in 140 features for a single image and in 980 features for the complete time series (Table 2).Although GEE is a scalable and cloud-based platform, a timely execution of the classifiers is not possible without reducing the number of features used.Moreover, we know and empirically see (results not shown) that many features contain similar information and are highly correlated.Thus, a guided regularized random forest (GRRF) [54] is applied to identify the most relevant features.This feature selection step helps to make our classification problem both more tractable in GEE and more interpretable.GRRF requires the optimization of two regularization parameters.The most relevant features are obtained using the criteria of gain regularized higher than zero.This optimization is done for ten subsets of training data generated by randomly splitting 2129 training samples.Each subset is fed to the GRRF

Data Preparation
A comprehensive set of spectral and spatial features is generated from the (multi-spectral) time series of Worldview-2 images.The spectral features include the vegetation indices listed in Table 1.

Vegetation Index (VI) Formula
Normalized Difference Vegetation Index (NDVI) [44] (NIR − R)/(NIR + R) Green Leaf Index (GLI) [45] (2 Spatial features are based on the Gray Level Co-occurrence Matrix (GLCM).Fifteen features proposed by [51] and three features from [52] are derived.This selection fits with their function availability in the Google Earth Engine (GEE) [53].Formulas of these features are shown in Tables A1  and A2.
Textural features are calculated as the average of their values in four directions (0, 45, 90, 135), applying a window of 3 × 3 pixels to the original spectral bands of each image.This configuration corresponds to the default setup in GEE and is deemed appropriate for our study since our goal is to create an efficient ensemble and not to optimize the configuration to extract spatial features.
The extraction of spectral and spatial features, computed for each pixel, results in 140 features for a single image and in 980 features for the complete time series (Table 2).Although GEE is a scalable and cloud-based platform, a timely execution of the classifiers is not possible without reducing the number of features used.Moreover, we know and empirically see (results not shown) that many features contain similar information and are highly correlated.Thus, a guided regularized random forest (GRRF) [54] is applied to identify the most relevant features.This feature selection step helps to make our classification problem both more tractable in GEE and more interpretable.GRRF requires the optimization of two regularization parameters.The most relevant features are obtained using the criteria of gain regularized higher than zero.This optimization is done for ten subsets of training data generated by randomly splitting 2129 training samples.Each subset is fed to the GRRF to select the most relevant spectral and spatial features after optimizing the two regularization parameters.The selected features are then used to train an RF classifier using all the training samples.The best set of spatial and spectral features is determined by ranking the resulting RF classifiers according to their OA for 1258 test samples.Several classifiers are used to create our ensembles, after performing an exploratory analysis with the available classifiers in GEE.Five classifiers are selected to create our ensembles based on their algorithmic approach and overall accuracy (OA): Random Forest (RF; [55]), Maximum Entropy Model (MaxEnt; [56]), Support Vector Machine (SVM; [57]) with linear, polynomial and Gaussian kernels.A combination with other types of classifier, e.g., a deep learning algorithm could easily be allowed when such becomes available in GEE (with inclusion of TensorFlow).This is expected to happen given the active research being performed in this field.The following paragraphs briefly describe our chosen classifiers and explain how they are used in this study.
RF is a well-known machine learning algorithm [58][59][60][61] created by combining a set of decision trees.A typical characteristic of RF is that each tree is created with a random selection of training instances and features.Once the trees are created, classification results are obtained by majority voting.RF has reached around 85% OA in crop type classification using a multi-spectral time series of RapidEye images [62], and higher than 80% for a time series of Landsat7 images in homogeneous regions [13].RF has two user-defined parameters: the number of trees and the number of features available to build each decision tree.In our study, an RF with 300 trees is created and we set the number of features per split to the square root of the total number of features.These are standard settings [63].
MaxEnt computes an approximated probability distribution consistent with the constraints (facts) observed in the data (predictor values) and as uniform as possible [64].This provides maximum entropy while avoiding assumptions on the unknown, hence the name of the classifier.MaxEnt was proposed to estimate geographic species distribution and potential habitat [56], to classify vegetation from remote sensing images [65], and groundwater potential mapping [66].In our study, MaxEnt was applied with default parameter values in GEE as follows: weight for L1 regularization set to 0, weight for L2 regularization set to 0.00001, epsilon set to 0.00001, minimum number of iterations set to 0, and maximum number of iterations set to 100.
SVM is another well-known machine learning algorithm that has been widely applied for crop classification [11,67].SVM has demonstrated its robustness to outliers and is an excellent classifier when the number of input features is high [12].The original binary version of SVM aims to find the optimal plane that separates the available data into two classes by maximizing the distance (margins) between the so-called support vectors (i.e., the closest training samples to the optimal hyperplane).Multiple binary SVMs can be combined to tackle a multi-class problem.When the training data cannot be separated by a plane, it is mapped to a multidimensional feature space in which the samples are separated by a hyperplane.This leads to a non-linear classification algorithm that, thanks to the so-called kernel trick, only needs the definition of the dot products among the training data [68].Linear, radial, and polynomial kernels are commonly used to define these dot products.The linear SVM only requires fixing the so-called C parameter, which represents the cost of misclassifying samples, whereas the radial and polynomial kernels require the optimization of an additional parameter, respectively called gamma and the polynomial degree.In this work, all SVM parameters were obtained by 5-fold cross validation [69] All classifiers are trained and applied separately using a modified leave-one-out method in which the training set is stratified and randomly partitioned into k (10) equally sized subsamples.Each base classifier is trained with k − 1 subsamples, leaving one subsample out [40].Using ten different seeds to generate the subsamples, the se methods allow us to generate 100 subsets of training data that, in turn, allow 20 versions of each base classifier to be generated and a total of 100 classification models when combining the five classifiers as presented in Figure 3.This training method prevents overfitting of the base classifiers because 10% of the data is discarded each time.Overfitting prevention is desirable because the ensemble is not trainable.Metrics reported are OA and kappa coefficient.Producer accuracy (PA) per class is also computed and is used to contrast performance of individual classifiers versus ensemble classifiers.All classifiers are trained and applied separately using a modified leave-one-out method in which the training set is stratified and randomly partitioned into k (10) equally sized subsamples.Each base classifier is trained with k − 1 subsamples, leaving one subsample out [40].Using ten different seeds to generate the subsamples, these methods allow us to generate 100 subsets of training data that, in turn, allow 20 versions of each base classifier to be generated and a total of 100 classification models when combining the five classifiers as presented in Figure 3.This training method prevents overfitting of the base classifiers because 10% of the data is discarded each time.Overfitting prevention is desirable because the ensemble is not trainable.Metrics reported are OA and kappa coefficient.Producer accuracy (PA) per class is also computed and is used to contrast performance of individual classifiers versus ensemble classifiers.

Ensemble Classifiers
Two combination rules, namely majority and weighted majority voting, are tested in this study to create ensemble classifiers.In the case of majority voting, the output of the ensemble is the most assigned class by classifiers, whereas in the weighted majority voting rule, a weight is assigned to each classifier to favor those classifiers with better performance in the voting decision.Both rules are easily implemented and produce results comparable to more complicated combination schemes [30,36,70].Moreover, these rules do not require additional training data because they are not trainable [40] which means that the required parameters for the ensemble are available as the classifiers are generated and their accuracy assessed.
Majority voting works as follows.Let x denote one of the decision problem instances, let L be

Ensemble Classifiers
Two combination rules, namely majority and weighted majority voting, are tested in this study to create ensemble classifiers.In the case of majority voting, the output of the ensemble is the most assigned class by classifiers, whereas in the weighted majority voting rule, a weight is assigned to each classifier to favor those classifiers with better performance in the voting decision.Both rules are easily implemented and produce results comparable to more complicated combination schemes [30,36,70].Moreover, the se rules do not require additional training data because they are not trainable [40] which means that the required parameters for the ensemble are available as the classifiers are generated and their accuracy assessed.
Majority voting works as follows.Let x denote one of the decision problem instances, let L be the number of base classifiers used, and let C be the number of possible classes.The decision (output) of classifier i on x is represented as a binary vector d x,i of the form (0, . . ., 0, 1, 0, . . ., 0), where d x,i,j = 1 if and only if the classifier labels that instance x with class C j .Further, we denote vector summation by ∑ and define the function idx@max as the index at which a maximum value is found in a vector.This function resolves ties as follow: if multiple maximal values are found, the index of the first occurrence is picked and returned.The majority voting rule of an ensemble classifier on decision problem x defines the class number D x as: following [29].
Weighted majority voting is an extension of the above and uses weights w i per base classifier i.
In this, we choose where k is the kappa coefficient of base classifier i over an independent sample set [29].
As mentioned in Section 2.2.2, our training procedure yields 20 instances of each base classifier.This allows creating two 100-classifier ensembles as well as a larger number of ensembles formed by 5, 10, 15, . . ., 95 classifiers.The latter ensembles serve to evaluate the impact of the size of the ensemble.To avoid biased results, we combine the base classifiers while keeping the proportion of each type of classifier.For example, the 10-classifier ensemble is created by combining two randomly chosen base classifiers of each type.This experiment means that we evaluate the classification accuracy of 191 ensembles.Classification accuracy is assessed by means of their OA, the ir kappa coefficient and the producer's accuracy of each class.Besides, results of the most effective configuration of ensembles and the individual classifier with higher accuracy are compared to get insights into their performance.Examples of their output are analyzed by visual inspection.

Data Preparation
A feature selection method is applied before the classification to reduce the dimensionality of data without losing classification efficiency.In our study, we selected the GRRF method because it selects the features in a transparent and understandable way.The application of the GRRF to the expanded time series (i.e., the original bands plus spectral and spatial features), leads to the selection of 45 features as shown in Table 3; bands, spectral, and spatial features were selected.In general, spatial features were predominantly selected in almost all the images, whereas vegetation indices were selected in only five images.Vegetation indices have more influence when taken from images acquired when the crop has grown than when the field is bare.
A more detailed analysis of Table 3 shows that the selected multi-spectral bands and vegetation indices respectively represent 24.44% and 26.66% of the most relevant features.Textural features represent 48.88% of the most relevant features, which emphasizes the relevance of considering spatial context when analyzing very high spatial resolution images.As an example, Figures 4 and 5 show the temporal evolution of a vegetation index and of one of the GLCM-based spatial features.In Figure 4, changes in TCARI are presented.Figure 4a shows a low vegetation signal since the crop is at an initial stage.In Figure 4b,c, a higher vegetation signal is shown, which relates to a more advanced growth stage.TCARI was selected for three different dates underlining the importance of changes in vegetation index for crop discrimination.Similarly, Figure 5 displays a textural feature (sum average of band 8) for a specific parcel, which points at variation in spatial patterns as the growing season goes by.A more detailed analysis of Table 3 shows that the selected multi-spectral bands and vegetation indices respectively represent 24.44% and 26.66% of the most relevant features.Textural features represent 48.88% of the most relevant features, which emphasizes the relevance of considering spatial context when analyzing very high spatial resolution images.As an example, Figures 4 and 5 show the temporal evolution of a vegetation index and of one of the GLCM-based spatial features.In Figure 4, changes in TCARI are presented.Figure 4a shows a low vegetation signal since the crop is at an initial stage.In Figure 4b,c, a higher vegetation signal is shown, which relates to a more advanced growth stage.TCARI was selected for three different dates underlining the importance of changes in vegetation index for crop discrimination.Similarly, Figure 5

Base Classifiers and Ensembles
The accuracy of the 20 base classifiers created for each classification method is assessed using ground truth data.Table 4 lists the number of pixels per crop class used for the training and testing phase.A more detailed analysis of Table 3 shows that the selected multi-spectral bands and vegetation indices respectively represent 24.44% and 26.66% of the most relevant features.Textural features represent 48.88% of the most relevant features, which emphasizes the relevance of considering spatial context when analyzing very high spatial resolution images.As an example, Figures 4 and 5 show the temporal evolution of a vegetation index and of one of the GLCM-based spatial features.In Figure 4, changes in TCARI are presented.Figure 4a shows a low vegetation signal since the crop is at an initial stage.In Figure 4b,c, a higher vegetation signal is shown, which relates to a more advanced growth stage.TCARI was selected for three different dates underlining the importance of changes in vegetation index for crop discrimination.Similarly, Figure 5 displays a textural feature (sum average of band 8) for a specific parcel, which points at variation in spatial patterns as the growing season goes by.

Base Classifiers and Ensembles
The accuracy of the 20 base classifiers created for each classification method is assessed using ground truth data.Table 4 lists the number of pixels per crop class used for the training and testing phase.

Base Classifiers and Ensembles
The accuracy of the 20 base classifiers created for each classification method is assessed using ground truth data.Table 4 lists the number of pixels per crop class used for the training and testing phase.Figure 6 illustrates the mean performance of all base classifiers as a boxplot.The mean OA of each classifier method ranges between 59% and 72%.SVMR obtained higher accuracy than SVMP and SVML [26,71].Lower accuracy of SVML means that linear decision boundaries are not suitable for classifying patterns in this data [72].RF had slightly better performance than SVMR.This result is consistent with [58].MaxEnt presented the lowest performance confirming the need for more research before it can be operationally used in multi-class classification contexts [73].Figure 6 illustrates the mean performance of all base classifiers as a boxplot.The mean OA of each classifier method ranges between 59% and 72%.SVMR obtained higher accuracy than SVMP and SVML [26,71].Lower accuracy of SVML means that linear decision boundaries are not suitable for classifying patterns in this data [72].RF had slightly better performance than SVMR.This result is consistent with [58].MaxEnt presented the lowest performance confirming the need for more research before it can be operationally used in multi-class classification contexts [73].A comparison between the performance of base classifiers and ensembles was carried out.Thus, Table 5 summarizes minimum, mean, and maximum overall accuracy and kappa coefficient for both base classifiers and ensembles.We observe that ensemble classifiers in all cases outperform base classifiers with a rate of improvement ranging from 5.15% to 29.50%.On average, majority voting provides an accuracy that is 2.45% higher than that of the best base classifier (RF).Improvements are higher, at 4.65%, when a weighted voting rule is applied.This is because more effective base classifiers have more influence (weight) in the rule created to combine their outputs.Table 5 also reports associated statistics for kappa, but these values should be considered carefully [74].A comparison between the performance of base classifiers and ensembles was carried out.Thus, Table 5 summarizes minimum, mean, and maximum overall accuracy and kappa coefficient for both base classifiers and ensembles.We observe that ensemble classifiers in all cases outperform base classifiers with a rate of improvement ranging from 5.15% to 29.50%.On average, majority voting provides an accuracy that is 2.45% higher than that of the best base classifier (RF).Improvements are higher, at 4.65%, when a weighted voting rule is applied.This is because more effective base classifiers have more influence (weight) in the rule created to combine their outputs.Table 5 also reports associated statistics for kappa, but these values should be considered carefully [74].The number of classifiers to build an ensemble was analyzed.In Figure 7, the mean and standard deviation of the OA is presented for each ensemble size.The weighted voting scheme outperforms the simple majority voting.The accuracy of the ensembles increases as the number of classifiers grows.However, maximum accuracy is reached when the number of classifiers is 75 for weighted voting and 45 for majority voting.This means that the majority voting approach tends to saturate with fewer classifiers than the weighted majority voting approach.The standard deviation shows a decreasing trend because as the size of the ensemble increases, results become more stable.These results are congruent with the theoretical basis of ensemble learning [29,39].The number of classifiers to build an ensemble was analyzed.In Figure 7, the mean and standard deviation of the OA is presented for each ensemble size.The weighted voting scheme outperforms the simple majority voting.The accuracy of the ensembles increases as the number of classifiers grows.However, maximum accuracy is reached when the number of classifiers is 75 for weighted voting and 45 for majority voting.This means that the majority voting approach tends to saturate with fewer classifiers than the weighted majority voting approach.The standard deviation shows a decreasing trend because as the size of the ensemble increases, results become more stable.These results are congruent with the theoretical basis of ensemble learning [29,39].We contrast results of an ensemble sized 75 (hereafter called ensemble-75) with results obtained by an instance of RF because it had the best performance among base classifiers.Also, we compared the performance of ensemble-75 with the ensemble composed of 100 classifiers (hereafter called We contrast results of an ensemble sized 75 (hereafter called ensemble-75) with results obtained by an instance of RF because it had the best performance among base classifiers.Also, we compared the performance of ensemble-75 with the ensemble composed of 100 classifiers (hereafter called ensemble-100).OA for ensemble-75 is 0.7591, our chosen RF has an OA of 0.7170 and ensemble-100 has 0.7543.In Table 6, we present the confusion matrix obtained for the selected RF.Regarding the comparison between the performance of ensemble-75 and ensemble-100, we notice that ensemble-100 has a slightly lower OA and ensemble-75 produces better results in four of five crops.The improvement of ensemble-100 in Millet is only 0.82%, whereas there is no difference in Cotton.Sorghum, Maize, and Peanut display a lower performance with 0.43%, 2.39%, and 2.5% respectively.This means that the maximum accuracy is obtained when 75 classifiers are combined, and that addition of more classifiers does not improve the performance of ensembles.
Figure 8 presents example fields to illustrate the classification results produced by ensemble-75, ensemble-100, and the selected RF.We extracted only the fields where ground truth data was available.We observe that in both ensembles, millet is less confused with peanut and cotton than in the RF classification.Cotton is less confused with sorghum as well.Besides, confusion between maize and sorghum is lower in the ensembles than in RF.This is also true for millet.Misclassifications could obey to differences in management activities in those fields (i.e., weeding) because multiple visits by various team confirmed that a single crop was grown.Moreover, by visual analysis, it can be observed that a map produced by an ensemble seems less heterogeneous than the map produced by a base classifier (RF).Differences between maps produced by ensemble-75 and ensemble-100 are visually hardly noticeable.

Conclusions and Future Work
Reliable crop maps are fundamental to address current and future resource requirements.They support better agricultural management and consequently lead to enhanced food security.In a smallholder farming context, the production of reliable crop maps remains highly relevant because reported methods and techniques applied successfully to medium and lower spatial resolution images do not necessarily achieve the same success in heterogeneous environments.In this study, we introduced and tested a novel, and cloud-based ensemble method to map crops using a wide array of spectral and spatial features extracted from time series of very high spatial resolution images.The experiments carried out demonstrated the potential of ensemble classifiers to map crops grown by West African smallholders.The proposed ensemble obtained a higher overall accuracy (75.9%) than any individual classifier.This represents an improvement of 4.65% in comparison with the average overall accuracy values (71.7%) of the best base classifier tested in this study (random forest).The improvements over other tested classifiers like linear support vector machines and maximum entropy are larger, at 21.5% and 25.6% respectively.As theoretically expected, the weighted majority voting approach outperformed majority voting.A maximum performance was reached when the number of classifiers was 75.This indicates that at a certain point the addition of more classifiers does not lead to improvement of the classification results.
From a technical point of view, it is important to note that the generation of spectral and spatial features as well as the optimal use of ensemble learning, demand high computational capabilities.Today's approaches to image processing (big data and cloud-based) allow this concern to be overcome and hold promise for practitioners (whether academic or industrial) in developing nations, as the historic setting has often confronted them with technical barriers that were hard to overcome.Data availability, computer hardware, software, or internet bandwidth have often been in the way of a more prominent uptake of remote sensing based solutions.These barriers are slowly eroding, and opportunities are arising as a consequence.In our case, GEE was helpful in providing computational capability for data preparation and allowed the systematic creation and training of up to 100 classifiers and their combinations.Further work to extend this study includes the classification of other smallholder areas in sub-Saharan African, and the addition of new images such as Sentinel-1 and -2 as time series.
Table A1.Textural feature formulas from Gray Level Co-occurrence Matrix, as described in [51].

Name/Formula
Name/Formula Angular Second Moment Table A2.Textural features included in the classification as described in [52].

Figure 1 .
Figure 1.Study area.(a) Location of the study area in Mali; (b) The study's field plots overlapping a Worldview-2 image of the study area on 18 October 2014 using natural color composite.

Figure 1 .
Figure 1.Study area.(a) Location of the study area in Mali; (b) The study's field plots overlapping a Worldview-2 image of the study area on 18 October 2014 using natural color composite.

Figure 2 .
Figure 2. Overview of the ensemble classifier system.X represents the features extracted during preprocessing, Y and Ytest represent ground truth of training and test data,  �  is the prediction of a classifier and  � the ensemble prediction.  is the kappa obtained by a classifier.

Figure 2 .
Figure 2. Overview of the ensemble classifier system.X represents the features extracted during pre-processing, Y and Y test represent ground truth of training and test data, Ŷclass is the prediction of a classifier and Ŷ the ensemble prediction.K class is the kappa obtained by a classifier.

Figure 3 .
Figure 3. Leave-one-out strategy using ten seeds for generating 100 training datasets to train base classifiers (BC).

Figure 3 .
Figure 3. Leave-one-out strategy using ten seeds for generating 100 training datasets to train base classifiers (BC).
displays a textural feature (sum average of band 8) for a specific parcel, which points at variation in spatial patterns as the growing season goes by.

Figure 7 .
Figure 7. Mean and standard deviation of the overall accuracy using majority voting and weighted majority voting.

Figure 7 .
Figure 7. Mean and standard deviation of the overall accuracy using majority voting and weighted majority voting.

Figure 8 .
Figure 8.Comparison between field classifications produced by a 75-classifiers ensemble (E75), the 100-classifiers ensemble (E100), and a random forest classifier (RF).PA: Accuracy per class is listed below each crop.Mask corresponds to trees inside fields or clouds.VHR: overlapping area in a World-View2 image on 7 July 2014 using natural color composite.

2 (i − f 8 ) 2 2 p 1 p 1 p 1 |i−
P x+y (i) x+y (i)log p x+y (i) Difference Variance f 10 = variance of p x−y (i)log p x−y (i) Information Measures of Correlation 1 f 12 = HXY−HXY1 max{HX,HY} where, (i, j)log p x (i)p y (j) HX and HY are entropies of p x and p y Information Measures of Correlation 2 f 13 = 1 − e [−2.0(HXY2−HXY)]x (i)p y (j) log p x (i)p y (j)Maximal Correlation Coefficient f 14 = (sec ond largest eigen value of Q) j| 2 p(i, j)

Table 2 .
Type and number of features extracted from a single multi-spectral WorldView-2 image, and from the time series of seven images.Gray Level Co-occurrence Matrix (GLCM).

Table 4 .
Number of pixels per crop class for training base classifiers and assessing accuracy (testing).

Table 4 .
Number of pixels per crop class for training base classifiers and assessing accuracy (testing).

Table 5 .
Summary statistics for the overall accuracy and kappa coefficient of base classifiers and ensembles.Maximum Entropy Model (MaxEnt).Random Forest (RF).Support Vector Machine (SVM) with linear kernel (SVML).SVM with polynomial kernel (SVMP).SVM with Gaussian kernel (SVMR).Majority voting (Voting).Weighted majority voting (WVoting).In bold, the maximum OA (mean) and the maximum kappa (mean) for base classifiers and ensembles.

Table 6 .
Confusion matrix applying a base RF classifier, PA: Producer accuracy per class.

Table 7
shows the confusion matrix of the selected ensemble and in Table8results of applying 100 classifiers are presented.

Table 7 .
Confusion matrix applying an ensemble of 75 classifiers.PA: Producer accuracy per class.

Table 8 .
Confusion matrix applying an ensemble of 100 classifiers.PA: Producer accuracy per class.