Multiple Classifier System for Remote Sensing Image Classification: A Review

Over the last two decades, multiple classifier system (MCS) or classifier ensemble has shown great potential to improve the accuracy and reliability of remote sensing image classification. Although there are lots of literatures covering the MCS approaches, there is a lack of a comprehensive literature review which presents an overall architecture of the basic principles and trends behind the design of remote sensing classifier ensemble. Therefore, in order to give a reference point for MCS approaches, this paper attempts to explicitly review the remote sensing implementations of MCS and proposes some modified approaches. The effectiveness of existing and improved algorithms are analyzed and evaluated by multi-source remotely sensed images, including high spatial resolution image (QuickBird), hyperspectral image (OMISII) and multi-spectral image (Landsat ETM+). Experimental results demonstrate that MCS can effectively improve the accuracy and stability of remote sensing image classification, and diversity measures play an active role for the combination of multiple classifiers. Furthermore, this survey provides a roadmap to guide future research, algorithm enhancement and facilitate knowledge accumulation of MCS in remote sensing community.


Introduction
Remote sensing images are widely used for land cover classification, target identification and thematic mapping from local to global scales owing to its technical advantages such as multi-resolution, wide coverage, repeatable observation and multi/hyperspectral-spectral records [1]. With the advancement of remote sensing data acquisition technology, remote sensing images can be acquired by various sensors, for example, hyperspectral imaging spectrometer, high resolution sensors, polarimetric synthetic aperture radar, etc. [2]. Under this situation, image classification techniques are exposed with new challenges to process the multi-source data and serve to multidisciplinary applications [2,3]. One effective solution is to generate a classifier ensemble by combining some individual classifiers, which is named as multiple classifier system (MCS) or classifier ensemble [4][5][6][7][8][9][10]. In the past twenty years, MCS has developed rapidly and been widely used in various fields such as pattern recognition, image processing and target identification. Besides, MCS has become a hot topic in the attractive series international workshops, mainly because of its capability to improve accuracy and efficiency [11][12][13][14].
MCS, which has been utilized by the remote sensing society recently, is viewed as one effective way to improve classification performance of remotely sensed imagery [3,4]. Many researchers investigated the scheme of combining several classifiers to generate a single result. Both theoretical and empirical studies indicate that a good MCS is one where individual classifiers are accurate and at the same time disagree on some different parts of the input space [15]. Two popular approaches for creating accurate ensemble are Bagging and Boosting [16,17]. Bagging uses bootstrap sampling to generate accurate ensemble [16]. Boosting is a general method of producing a very accurate prediction rule by combining rough and moderately inaccurate learner [17].
Some previous studies demonstrated that the simple majority vote of the classifier prediction is an effective strategy. Within this scheme, a pixel is classified as the class that was chosen by the majority of the individual classifiers. Foody et al. used simple majority vote rule to integrate multiple binary classifiers for the mapping of a specific class [7]. Benediktsson and Swain proposed a consensus theoretic classification approach and designed a hybrid consensus theoretic classification scheme [18,19]. The experimental results using hybrid consensus theoretic classification scheme outperform other statistical/neural network and consensus theoretic algorithms. Doan and Foody investigated the combination of soft classification methods for remote sensing image and found that classification combination could improve the accuracy [8]. Fauvel et al fused the classification results derived from a neural network and a fuzzy classifier using fuzzy sets theory, and the proposed method improved the classification accuracy compared to the separate use of different classifiers [20]. Smits focused on the family of MCSs based on dynamic classifier selection (DCS) and proposed a modification to dynamic classifier selection by local accuracy (DCS-LA) [9]. Experiments had shown that DCS-LA is a valid alternative method to majority vote.
In addition, MCS was also used for multi-source and hyperspectral remote sensing images. Briem et al. applied multiple classifiers (minimum Euclidean distance classifier, maximum likelihood classifier, conjugate-gradient back propagation, decision table, J48, 1R) to multisource remote sensing data (multispectral, synthetic aperture radar and topographical data sets) and demonstrated that Bagging, Boosting and consensus theory obtained higher overall user and producer accuracies than traditional classifiers [4]. Debeir et al. integrated spectral features, spatial and contextual information and experimented Bagging and subset selection algorithms with high resolution remotely sensed images, and the accuracy was higher than neural network and decision tree classifier [21]. Kumar et al. designed a hierarchical fusion strategy for classifying hyperspectral remote sensing image, and the classifier fusion method outperformed other traditional approaches, including maximum likelihood classifier (MLC) etc. [11]. Waske et al. fused Support Vector Machines (SVM) for classifying SAR and multispectral imagery from agricultural areas [22]. The presented approaches are superior to all other techniques, such as a standard SVM. Ceamanos et al. designed a classifier ensemble for classifying hyperspectral data [15]. Firstly, hyperspectral image is divided into several parts according to the similarity of spectral bands. Then, each part is used as the input of SVM classifier. Finally, all outputs are applied to be combined together by an additional SVM classifier. The results clearly demonstrated that the proposed SVM fusion scheme outperforms a standard SVM classifier. Owing to its technical superiorities, MCS is also applied to change detection from multi-temporal remote sensing images [23,24].
As mentioned above, the classification results to be combined should be diverse. If the classification results are identical or similar, it would not improve the accuracy by combining them. Therefore, diversity is a vital requirement for the success of ensemble, demonstrated by theoretical and empirical studies [25,26]. Kuncheva and Whitaker summarized the diversity measures in classifier ensembles [27]. A special issue called "Diversity Measure in Multiple Classifier System" published in Information Fusion journal indicates that diversity measure is an important research direction in MCS [28]. Petrakos et al. applied agreement measure in decision fusion level combination [29]. Foody compared the different classification results from three aspects (similarity, non-inferiority, difference) using hypothesis tests and confidence interval algorithms [30]. Brown et al. indicated that the diversity for classification tasks is still an ill-defined concept, and defining an appropriate diversity measures for MCS is still an open question [28]. So far, there are many existing diversity measures, but in the remote sensing literatures, they are rarely used and compared for remote sensing image classification by MCS. In conclusion, a successful MCS system depends to a large extent on the proper selection of diverse classifiers for incorporation [31]. Kuncheva claims that the choice of an appropriate ensemble strategy can improve the performance of MCS [32]. More details about diversity measures and classifier ensemble approaches can be seen in Section 2.
The objective of this paper is to give a review on the uses of MCS in remote sensing and the ideas of improving MCS approaches, including classifier ensemble approaches and diversity measures, and some modified algorithms. In order to demonstrate the effectiveness of MCS, the mentioned MCS approaches are experimented with three typical optical remotely sensed images: high spatial resolution image (QuickBird image), hyperspectral image (OMISII image) and multi-spectral image (Landsat ETM+ image).
The remainder of this paper is organized as follows. In Section 2, we introduce some common approaches of MCS, summarize the advances of MCS to remote sensing image classification, and propose some improved algorithms. Experimental results are presented in Section 3. Section 4 is the conclusion of this paper.

MCS: A Primer
For remote sensing image classification, Giacinto et al. compared the performance of remote sensing classification approaches in different applications and found that no classifier could always perform well [33]. To alleviate this problem, MCS can provide the complementary information of the pattern classifiers and integrate the outputs of these pattern classifiers according to a certain combination approaches [5,[34][35][36]. Therefore, it can outperform any individual classifier. In addition to MCS in the domain of classification, some concepts are similar to MCS, including multiple classifier combination [37,38], classifier ensemble [10,27,39], decision level combination [40,41], ensemble learning [42,43], etc.
We assume M classes exist on a remote sensing image, represented by For hard classification, each pixel is allocated a specific class label J, J ∈ {1, 2, . . . , M }. For soft classification, each pixel is depicted by a vector {P (1), P (2), . . . , P (M )}, where P (i) is the probability or membership of the pixel belonging to the i th class, or the abundance of the i th endmember within the pixel. When multiple classifiers are combined, both hard and soft classification results can be integrated.
In order to introduce the MCS approaches clearly, three categories are adopted to summarize the classifier ensemble approaches: (1) Algorithms based on training sample manipulation. The most popular methods are Bagging and Boosting [16,17]. (2) Concatenation combination. In a concatenation combination, the classification result generated by a classifier is used as the input into the next classifier. The results achieved through each classifier are similarly passed onto the next classifier until a result is obtained through the final classifier in the chain [31,44,45] (Figure 1(a)). (3) Parallel combination. In a parallel combination, multiple classifiers are designed independently without any mutual interaction and their outputs are combined according to certain strategies [44,45] (Figure 1(b)). The main design decision is the selection of a reprehensive ensemble approach. If the approach is well designed, the MCS can obtain the accurate result [31]. Some of the popular and successful ensemble approaches are majority voting, fuzzy integral, D-S evidence theory, etc.
Besides concatenation and parallel combination, hierarchical combination that combines both concatenation and parallel combination is also used [31]. But in this paper, we just focus on the concatenation and parallel combination. As mentioned in Section 1, diversity of the classifier outputs is a vital requirement for the success of the ensemble. Diversity measures are often divided into parts: pairwise and non-pairwise [27]. Pairwise measures include kappa statistics, double fault, disagreement, etc., and non-pairwise measures consist of entropy, weighted count of errors and correct results (WCEC), etc. [27,46]. When pairwise measures are used to calculate the diversity among more than two classification results, it adopts the average of the pairwise measures between each two classification results. The non-pairwise diversity measures consider all the classifiers together and calculate directly one diversity value for the ensemble [12]. In this paper, both pairwise and non-pairwise diversity measures, including kappa statistics, double fault, entropy, WCEC and a novel measure are reviewed.
The following sections describe the classifier ensemble approaches and diversity measures. Section 2.2 introduces Bagging and AdaBoost. Section 2.3 lists out a collection of parallel combination methods. Section 2.4 gives the basic idea of concatenation combination schemes. We proposed three modified concatenation combination approaches in Section 2.5. Section 2.6 presents five diversity measures, three of them are pairwise and the rest are non-pairwise.

Bagging and AdaBoost
Unlike statistical voting theory which is based on the assumption of independent data sources and uses all training samples only one time, Boosting and Bagging are exerted by manipulating training samples [16,17].
Bagging is the abbreviation of bootstrap aggregating. In this algorithm, n samples are selected at random from a training set with k samples, and instructive iteration is exerted to create some different bags, and each bag is classified by vote to predict its class [16]. Boosting can process data with weights, and the weights of misclassified samples are increased to concentrate the learning algorithm on specific samples. Bagging has been shown to reduce the variance of the classification, while Boosting reduces both the variance and the bias of the classification. So in most cases, Boosting can produce more accurate classification results than Bagging. However, the computation time of Boosting is more than Bagging, and Boosting is sensitive to noise [47]. AdaBoost is the popular used approach of Boosting algorithms. The pseudo-code of Bagging and AdaBoost can be seen in Figure 2 [16,17,47]. In addition, there is a great variety of approaches drawn upon the basic idea of Bagging and Boosting. Detailed descriptions can be found from Wagging [48], Random Forest [49], Random Subspace [50], Logistic Boosting [51], MultiBoost [52], Rotation Forest [39], Rotboost [53] , etc.

Parallel Combination
Majority or weighted vote. The most popular ensemble approach is the majority vote by which each individual classifier "votes" for the specific class, and the class that collects the majority votes is predicted by the ensemble [12]. By generalizing this approach, Xu et al. proposed a threshold plurality vote, which imposed a threshold on the number of votes to select the class [44]. The major problem of majority vote is that all the classifiers have the same "authority" regardless of their respective abilities to classify properly [54]. Therefore, the weighting methods are best suited for problems where the individual classifiers perform the same task. The simple one is that the decision of each classifier is weighted according to the estimated accuracy on the training set. Furthermore, Moreno-Seco et al. proposed three variant of weighted voting approaches: re-scaled weighted vote (RSMV), best-worst weighted vote (BWWV) and quadratic best-worst weighted vote (QBWWV) [54]. Among all voting algorithm tested in [54], each method has shown the different performance in different datasets. In this paper, we adopted the specific class accuracy generated by the individual classifier as the weights.
Bayesian average method. Bayesian combiners are used to linearly combine the probabilistic predictions of multiple classifiers by weighting their posterior probabilities [55]. It can be defined as follows: Then, the class label is decided: where P k (X ∈ C i |X) represents the probability of class i of pixel X in k th classifier, M is the total number of classes, N is the total number of classifiers. L(X) represents the class label of pixel X.
Other simple operators such as Maximum, Minimum and Product have been applied to combine the posterior probabilities [56]. On the basis of Bayesian framework, Kittler et al. provided a theoretical underpinning of many existing combination schemes based on the product and sum rule, showing that the sum rule is less sensitive to the errors of subsets of base classifiers [57].
Fuzzy Integral. Fuzzy integral is an effective information fusion algorithm based on the fuzzy measure [58,59]. Sugeno integral is a widely used fuzzy integral algorithm, in which fuzzy measure g is defined as a function in measurable space: g : Θ → [0, 1], meeting the conditions as follows: Fuzzy measure g λ was introduced in Sugeno integral.
Suppose that (S, R, g) is the fuzzy measurement space, and f : S → [0, 1] is a measurable function on R. When S = s 1 , s 2 , . . . , s N is a finite set and meets the condition of 1 ≥ f (s 1 ) ≥ f (s 2 ) . . . ≥ f (s N ) ≥ 0, the fuzzy measurement can be calculated as the follows: where A i = {s 1 , s 2 , . . . , s i } is the classifier ensemble, and N is the amounts of classifiers. Sugeno integral is calculated in the following: When the member classifiers are more than two, the values of λ corresponding to each class is calculated as the following equation: In multiple classifier combination, f j (s i ) is defined as the probability of the i th classifier to the j th class in soft classification, or the accuracy of the i th classifier to the j th class in hard classification. We utilized the latter one, hard classification in this paper.
The final classification result is the class label corresponding to the maximum fuzzy integral value: Moreover, Ben Adballah et al. proposed a local fuzzy integral approach, called context extraction for local fusion with fuzzy integral (CELF-FI) for fusing different classifiers outputs [60]. This method based on a novel objective function that combines context identification and multi-algorithm fusion criteria into a joint objective function. This objective function consists of two terms: the first is designed to produce compact clusters, called contexts, and the second to produce Sugeno measures for fuzzy integral fusion for each context [60].
Consensus Theory. Consensus theory is suitable for the combination of classifiers with probability to specific classes as their outputs [18,19]. Usually logarithmic and linear consensus models are used in the following two ways: Each classifier is viewed as an expert with the membership or probability of the pixel belonging to every class as its output. T j (X) is the probability the pixel X being assigned to the jth class, and p i (C j |X) is the probability of pixel X being assigned to the j th class by the i th classifier, and λ ij is the weight. The simplest approach of the weighting scheme consists in giving all the individual outputs equal weights. Furthermore, the weights can be defined to not only weight the individual sources but also the individual classifiers [5]. Thus, we select the specific class accuracy as the weights. It means that λ ij represents classification accuracy of the i th classifier to the j th class in this paper.
The final classification result is the class label corresponding to the maximum value of T j (X), j = 1, . . . , M : D-S Evidence Theory and its improvement. D-S evidence theory, which assigns probability to the sets, is able to handle the uncertainty caused by unknown factors [12]. It uses discrimination framework, confidence function, likelihood function and probability allocation function to represent and process knowledge. Suppose that Θ = C 1 , C 2 , . . . , C i . . . C M is the discrimination framework and M is the number of classes, therefore basic probability allocation function m is a function from 2 Θ to [0, 1] and it meets the requirement of: If there are two or more different evidences, orthogonal sum can be used to combine those evidences. Assume that Z 1 , Z 2 , . . . , Z n are the probability allocation functions corresponding to evidence F 1 , F 2 , . . . , F n , and their orthogonal sum Z = Z 1 ⊕ Z 2 . . . ⊕ Z n is: When various evidences are inconsistent or contradictory from each other, the result may be unreasonable. In order to alleviate this problem, an improved evidence theory algorithm was proposed and experimented in [61], and it proved that the modified method was superior to traditional method while the evidences with high contradiction and inconsistency. For remote sensing image, different classifiers may generate different classified labels, resulting in the generation of evidence with high contradiction, so the improved evidence theory ensemble approach may improve the performance. The detailed equations are detailed as follows [61]: where:k where is the confidence of evidence,k is the average of contradiction level between two evidences, and K is the total contradiction level of all evidences. This evidence combination method can reduce the limitations caused by evidence inconsistency. For multiple classifier combination of remote sensing, the result of each classifier can be viewed as a piece of evidence. Probability allocation function can be represented by the classification accuracy of specific class. For example, if a pixel belongs to the i th class in a classifier, the basic probability is m(C i ) = P i , m(Θ) = 1 − P i , where P i is the accuracy of the i th class by the specific classifier. After the completion of evidence combination, the class with maximum evidence is selected as the final result. Dynamic Classifier Selection [9,62]. Smits applied dynamic classifier selection and improved models to multiple classifier combination for remote sensing images [9]. In this method, MCS are addressed that use estimates of each individual classifier's local accuracy in small regions of feature space surrounding an unknown test sample [9]. The local regions are defined by the k−nearest neighbors (KN N ). The drawback of this approach is the section of k, which is rather heuristic and subjective an exercise, and the classification result is sensitive to the choice of k [9,63,64].
The concept of local accuracy assumes that classes maintain a certain continuity in the feature space and neighboring pixels are expected to maintain a stronger relationship than pixels that are further away [9]. Dudani suggested that using distance-weighted k−NN can improve performance, especially with small training sample [65]. Suppose that d k , d t , d m are the distances of the k th , t th and m th samples to the pixel in the decreasing sort, the weight of the t th sample is: The local classification accuracy of the j th classifier is: where θ(X t ) is the class label of the t th neighboring sample, C j (X t ) is the assigned class label of the j th classifier to the t th sample. Finally, the label of the classifier with the biggest local accuracy is selected as the final classification result. Differences between these methods: In addition, according to the individual output, classifier ensemble approach in a parallel combination can be divided into three levels: abstract level, rank level and measurement level [66]. The abstract level combination methods are applied when each classifiers outputs a unique label [66]. Rank level makes use of a ranked list of classes where ranking is based on decreasing likelihood. In the measurement level, probability values of the classes provided by the each classifier are used in the combination. Therefore, majority/weighted vote, fuzzy integral, evidence theory and dynamic classifier selection belong to the abstract level combination methods, while others are measurement level methods. Table 1 summarizes the classifier ensemble approaches in parallel combination. Weighted vote, fuzzy integral, Dempster-Shafer evidence theory and consensus theory require validation set to calculate the weights while other methods do not. Dynamic classifier selection calculates the distance between the samples so it requires the original image. And the computation time of dynamic classifier selection is more expensive than other approaches.

Concatenation Combination
In the original concatenation combination approach, the classification result generated by a classifier is used as an input into the next classifier [31]. The results obtained through each classifier are similarly passed onto the next classifier until a result is obtained through the final classifier in the chain, which can be clearly shown in Figure 1(a). In this structure, decisions could be first made by a subset of the committee classifiers and their conclusions are then passed to another subset of classifiers. The construction of this algorithm is usually required to improve the accuracy of the entire system and to incorporate error checks [31]. Two potential ways to improve concatenation combination are enhancing the input diversity of the final combiner, and selecting an advanced combination algorithm.

Modified Concatenation Combination Approaches
On the basis of the previous improvement ideas, we propose three modified concatenation combination approaches of MCS in this section. The overall architecture is given in Figure 3. All the proposed schemes contain two training stages. Among the three methods, SVM classifier is selected as the second classifier because of its satisfactory capability to deal with the classification problem [67,68].

Diversity Measures
Kappa statistics, double fault, entropy and WCEC are used in this paper, and a novel measure is proposed for remote sensing based on the improvements in general MCS.
Kappa statistic. This measure is calculated by the following equation [29]: where N 11 represents the samples correctly classified by both classifiers, N 10 is the samples correctly classified by the first classifier and incorrectly by the second classifier, N 01 represents the samples incorrectly classified by the first classifier and correctly by the second classifier, N 00 is the samples incorrectly classified by both classifier. p 1 represents the samples correctly classified by the first classifier (it equals N 11 plus N 10 ), p 2 represents the samples correctly classified by the second classifier (it equals N 11 plus N 01 ), q 1 represents the samples incorrectly classified by the first classifier (it equals N 01 plus N 00 ), q 2 represents the samples incorrectly classified by the second classifier (it equals N 10 plus N 00 ).
It not only gives a measure of the degree of agreement but also has a test associated with itself to check if the apparent agreement can be attributed to chance only. In addition, we also use a new kappa statistic k and the final value is the average of the class' kappa statistic.
Double fault: Gacinto and Roli proposed the double-fault measure to select classifiers that are least related from a pool of classifiers. This diversity decreases when the value of the double-fault measure increases [55]. This index is calculated by: where N represents the number of sample. Entropy: Entropy is calculated in the following [27]: where L is the number of classifiers, N is the number of sample, rounds a number to the nearest integer. l(z m ) is the number of samples correctly classified by the classifiers z.
This measure is based on the assumption: if the results of L/2 is the same, the diversity has the biggest value. The range of this measure is 0 to 1, in which 1 represents the biggest diversity value and 0 represents the lowest diversity value.
Weighted count of errors and correct (WCEC). W CEC is also proposed by [46]. Compared to kappa statistics, WCEC not only gives a measure of the correctness of classifiers but also consider the diversity between two classifiers.
where N 00 dif f represents the different errors occurring in both classifiers, N 00 same represents the same errors occurring in both classifiers.
Disagreement-accuracy measure: Combining similar classifiers does not make too much sense as the results will not be improved [31]. If the sample is classified correctly or incorrectly in the classifiers, the final result is not improved. In order to consider both the classification accuracy and disagreement among the classification results comprehensively, we propose a new diversity measure, namely, disagreement and accuracy measure (DA). Firstly, we have made the classification results in descending order according to the overall accuracy. Then DA was calculated by the contribution of the classifiers to each sample. It can be defined as the following: where N is total number of samples, L is the number of classifiers, P i (z j ) is the specific class accuracy of correctly classified sample i by the classifier z j . This measure highlighted the contribution of classifier with high overall accuracy to the sample. That means if a sample is classified correctly by the classifier which has the highest overall accuracy, the contribution of the subsequent classifier to this sample should be decreased.

Experiments and Discussions
In order to evaluate the effectiveness and performance of MCS to remote sensing image classification, the aforementioned existing and modified methods are experimented in the three types of remote sensing images: high spatial resolution QuickBird image, airborne hyperspectral image, and medium-resolution multispectral Landsat ETM+ image.
The classifiers used in an ensemble should generally be accurate but different. Based on the survey to existing pixel-based classification algorithms by [3] and many suggestions on the selection of classifiers by [7], In this paper, we mainly focus on the final results of MCS, so we adopt the default parameters in the above classifiers. The iterations of AdaBoost and Bagging are 10. The different classifier ensemble algorithms are experimented in the three typical images, but the diversity measures are only experimented on QuickBird image.

Experiment 1: High Spatial Resolution Image Classification
QuickBird multispectral image (four bands of multi-spectral, spatial resolution: 2.44 m) is used as the example of high spatial resolution remote sensing image. The image size is 500 × 500. The study area is located in the northern suburb of Xuzhou city along the urban and rural connecting areas. Through the image analysis and field works, the image is classified into five land cover types: water, built-up area, green areas, vegetation and barren soil.  In addition to original multispectral image, textural features extracted from the Band 4 which has the richest information are also added for classification. Mean and variance extracted by gray level co-occurrence matrix (GLCM) are used to describe the textural features. Two schemes are used for classification. The first scheme is only using the spectral features of four multispectral bands as the inputs, and the second scheme is using both four spectral features and two textural features as the inputs. Seven classifiers are used, including: MMDC, MLC, SVM, MLPNN, RBFNN, J48 DT, and SFH-ARTMAP. The accuracies of all classifiers using only spectral features and spectral together with textural features are listed in Tables 2 and 3. The classification result can be seen in Figures 5 and 6.    Comparing the test areas, the accuracy assessment demonstrates that all the classifiers perform very accurately and achieve overall accuracies of 85.92% (MMDC), 92.78% (MLC), 93.41% (SVM), 93.49% (MLP), 92.83% (RBFNN), 92.42% (J48 DTC) and 93.15% (SFH-ARTMAP) using spectral and texture features (seen Table 3). As expected, the kappa values of the classification results using spectral and texture features are slightly higher than the ones only using spectral features. The specific accuracies of built-up areas of MMDC, MLC, MLPNN and SFH-ARTMAP are significantly lower (less than 79%). Using spectral and texture features together, most obvious is that the accuracies of built-up areas are increased by up to 7.8%, 8.14%, 8.72%, 9.44% of MMDC, MLC, MLPNN and SFH-ARTMAP, respectively. Furthermore, some other findings can be summarized as follows: (1) When textural features are introduced as the input of the classifiers, both the overall and class accuracies are enhanced in the most cases, especially for built-up areas, demonstrating the superiority of incorporating textural features in the classification process. (2) The classifiers have shown the different performance on the specific classes, indicating that the classifier performing well for one class may be poor for other classes. For instance, when only using spectral feature, MMDC performs weaker on the land cover type of "built-up areas" (48.48%) but SVM achieve the accuracy of 97.51% ( Table 2). The situation for spectral and texture features is similar to the one for spectral feature. The highlights in Tables 2 and 3 are the highest accuracy of class among the classifiers. From the above analysis, it is necessary to combine multiple classifiers to find a better result than any individual classifiers. In the following experiments on QuickBird image, the classification results derived from both spectral and texture features are adopted as the individual classification results.
For Bagging and AdaBoost, J48 DTC, RBFNN and MLPNN are used as the base classifiers, and the accuracies are summarized in Table 4. Both Bagging and AdaBoost obtain very accurately results and achieve very high overall accuracies (94.93% and 95.67% for J48 DTC, 95.09% and 95.43% for MLPNN, 93.36% and 95.35% for RBFNN). On the other hand, the accuracies of Bagging and AdaBoost increased by 0.89% to 1.96% and 1.94% to 2.25% for the three base classifiers. The classification maps can be seen in Figures 7-9. The theories of Bagging illustrate that it shows excellent performance when the base classifier is an unstable classifier such as decision tree, neural networks et al. [16]. In most cases, the performance of AdaBoost is superior to Bagging. Our experimental results are compatible with the theoretical analysis. In addition, we also studied the influence of bootstrapped sample size to classification accuracy. Here, we choose four sample sizes: 25%, 50%, 75% and 100%. The general trend is that the overall accuracies increase slightly (less than 1%) when the sampling rate rise, although the accuracy of RBFNN is a bit unsteady under different sampling rate. Perhaps the reason is that the iterations of RBFNN are not enough (only 10 in the experiment). The results indicate that Bagging and AdaBoost outperform the base classifier in terms of overall accuracy.     Figure 9 presents the classification results of MCS on parallel combination, and their accuracy statistics are shown in Table 5. Among the individual classifiers, SVM classifier achieves the highest accuracy of 93.49%. Compared to SVM classifier, Bayesian average achieves the accuracy of 94.99%, with the improvement of 1.5%; Majority vote and fuzzy integral improve the accuracy from 93.49% to 94.74% and 93.49% to 94.37%, respectively. Other parallel combination approaches improve the accuracy slightly with the improvement of 0.28% to 1%. The "local" ensemble method (DWDCS) results in 93.98% accuracy with the improvement of 0.49%. Compared with the overall accuracies, the kappa coefficients show similar characteristics. Bayesian average obtains the highest accuracy for all parallel combination schemes.  Unlike other combination algorithms, DWDCS pays attention to the local optimum while other combination algorithms achieve the global optimum. The computation time of DWDCS is proportional to the test sample size and the nearest neighbors. In our experiments, the test sample size is more than 500 and the nearest neighbor is selected to be 15. The computation time of DWDCS is more than 10 min and other parallel combination strategies are less than 20 s. Figure 10 presents the classification result of MCS based on concatenation combination and Table 6 shows the accuracy statistics. The accuracy assessment indicated that all the proposed concatenation combination algorithms perform very accurately and obtain overall accuracies of 95.21% (method 1), 97.06% (method 2) and 95.05% (method 3). Among all the classifier ensemble approaches in MCS, the proposed method 2 obtains the highest accuracy (97.06%) due to the following reasons: (1) This proposed method adopts the probabilities as the input of the second level and there is obvious diversity among these probabilities; (2) The output distributions of different classifier are different, for instance, the probability distribution of SVM and MLP are focused on 0 or 1, while the probability distribution of MEDC is ranged from 0.45 to 0.55. The linear combination rules such as linear consensus are not suited, and SVM can tackle non-linear problems to a good extent. Therefore, it can achieve the highest accuracy compared with others. Figure 10. Classification results based on proposed concatenation combination algorithms. In the experiment of diversity measures, we choose five member classifiers from the classifier pool to be combined based on diversity measure. The classifiers in the pool are MLPNN (1), SVM (2), SFH-ARTMAP (3), RBFNN (4), MLC (5), J48DT (6), MMDC (7). Thus, there are twenty-one selection schemes of five classifiers. We have calculated the diversity of the all combination. The optimum combination corresponding to the diversity measure can be seen in Table 7. Then, we adopted the different parallel combination approaches (majority voting, weighted voting, fuzzy integral, DS evidence theory) to evaluate the performance of the combination of classifiers selected by diversity measures.  Table 7 Comparison of these results with the accuracy assessment indicates that the combination selected by DA outperform the ones selected by other diversity measures in terms of overall accuracy. It demonstrates that the proposed diversity measure is a positive guidance for classifier combination. Furthermore, we ranked all the combination according to the DA measure in Table 8 and the highlights are the first three ones, whose accuracy produced by the parallel combination approaches can be seen in Table 7. Original airborne OMISII hyperspectral remote sensing image has 64 bands with spectral range: 0.45-1.09 µm. The spectral resolution is 10 nm. In this paper, Zhongguangcun, a high-tech zone of Beijing City is chosen as the study area. The five noisy bands are removed so it is remaining 59 bands for classification. The image size is 400 × 400. Figure 11(a) is the false color composite of the image by using Band 27, 25 and 2 as R, G and B components. Training samples and test samples are selected independently from the image. The land cover is classified into five classes: water, building, vegetation, forest and bare soil. From Table 9, SVM classifier has the highest classification accuracy of forest (92.69%) and bare soil (92.97%); MLPNN classifier produces the highest classification accuracy of vegetation (98.10%) and building (98.34%); RBFNN generates the highest classification accuracy of water (95.72%). These results indicate the diversity among classifiers, which is similar to the experiment on QuickBird image. MLPNN achieves the highest overall accuracy of 92.53%. Figure 11(b-e) shows the classification results of SVM, J48 DTC, RBFNN and MLPNN respectively. Based on diversity measure among all classifier combination schemes, MMDC, MLPNN and SVM classifier are selected to generate the final classification.
From Table 10, both AdaBoost and Bagging can enhance the classification accuracy of the base classifier. Bagging can improve the accuracy from 87.86% to 90.44% for J48 DTC, 89.46% to 89.88% for RBFNN and 92.53% to 93.09% for MLPNN. In contrast, AdaBoost performs a bit better (91% for J48 DTC, 90.65% for RBFNN and 93.27 for MLPNN) than bagging when the same base classifier is used. Figure 12 are the classification results based on AdaBoost and Bagging using RBFNN, MLPNN, J48 DTC as the base classifiers. In classification ensemble based on parallel/concatenation combination, both improved D-S evidence theory and concatenation combination (proposed method 1) can obtain very accurately classification result with the accuracies of 92.65% and 93.58% than the best individual classifier (92.53%). This demonstrates that these methods are effectively to improve the accuracy. Figure 13(a) is the result of improved D-S evidence theory using MMDC, MLPNN and SVM as member classifiers. Figure 13(b) is the result of concatenation combination (proposed method 1).

Figure 11.
False color composite of OMIS image and classification results of individual classifier.

Experiment 3: Medium Resolution Multi-Spectral Landsat ETM+ Image
A Landsat ETM+ image acquired on September 17th 2004 is used to validate the above methods. The used bands of Landsat ETM+ images are as follows: band 1: Blue band; band 2: Green band; band 3: Red band; band 4: NIR band; band 5: MIR band, band 7: LIR band. The image size is 500 × 500. The study area is the central part of Xuzhou City, China. Through the image analysis and field works, the land cover is classified into five classes: water, farmland, woodland, built-up land, and public green space. We use five member classifiers: RBFNN, J48DT, MMDC, MLC and MEDC. The adopted classifier combination methods are majority voting, weighted voting, D-S evidence theory, and fuzzy integral. Table 11 summarizes the accuracies of those classification results. MLC achieves the highest accuracy (91.19%). All the four classifier ensemble methods obtain higher overall accuracies than MLC classifier. The accuracy of majority vote increases from 91.19% to 93.55%. Weighted vote achieves the accuracy of 93.55%, with the improvement 2.36%. The accuracy is increased by up to 2.23% and 1.8% by DS evidence theory and fuzzy integral approaches. This indicates that these methods effectively improve the accuracy. Figure 14 presents the classification results of different classifiers and classifier combination strategies.

Discussion
Based on the above analysis, we identified that the incorporation of diversity among the classifiers is of great importance in order to obtain better classification results via a MCS [31]. Combining the similar classification results would not further improve the overall accuracy. Bagging uses bootstrap sampling to generate diversity [16]. In order to enhance the diversity, concatenation combination constructs a proper dataflow so that diverse decisions could be first made by a subset of the committee classifiers and their conclusions are then set to another subset of classifiers. It is worthy to point out that the most important issue to combine the classifiers in parallel combination method is to examine the difference among them using diversity measures. In addition, the selection of an appropriate ensemble strategy is also an important issue of MCS. This paper presents a comprehensive list of the most common parallel combination approaches and also analyzes the situation in which each of them are most useful. We also experimentally prove the effectiveness of the three modified concatenation combination algorithms within certain images. Furthermore, we studied the diversity measures to select the optimum combination. Compared to other diversity measures, the proposed diversity measure considers both the difference and accuracy among the classifiers. Thus, it can select the optimum combination whose accuracy is higher than the selected ones by other measures.

Conclusions
This review discusses the intensive contributions of MCS-based work in remote sensing. We present a comprehensive review of the most popular approaches and analyze the situations in which each of them is most useful. Those algorithms are experimented with three typical remotely sensed images, high resolution QuickBird image, airborne hyperspectral OMISII image, and medium Landsat ETM+ image. Some modified algorithms are proposed based on the ideas of improving MCS performance, and we experimentally proved the usefulness of these proposed methods.
Most of the findings from this paper show that both concatenation and parallel combination can enhance classification accuracy, but their performances are affected by different factors such as selected member classifiers, classifier combination criterion, etc. Furthermore, according to our experimental results, diversity measures can play active guidance for the selection of multiple classifiers combination.