Comparison of Classification Algorithms and Training Sample Sizes in Urban Land Classification with Landsat Thematic Mapper Imagery

Although a large number of new image classification algorithms have been developed, they are rarely tested with the same classification task. In this research, with the same Landsat Thematic Mapper (TM) data set and the same classification scheme over Guangzhou City, China, we tested two unsupervised and 13 supervised classification algorithms, including a number of machine learning algorithms that became popular in remote sensing during the past 20 years. Our analysis focused primarily on the spectral information provided by the TM data. We assessed all algorithms in a per-pixel classification decision experiment and all supervised algorithms in a segment-based experiment. We found that when sufficiently representative training samples were used, most algorithms performed reasonably well. Lack of training samples led to greater classification accuracy discrepancies than classification algorithms themselves. Some algorithms were more tolerable to insufficient (less representative) training samples than OPEN ACCESS Remote Sens. 2014, 6 965 others. Many algorithms improved the overall accuracy marginally with per-segment decision making.


Introduction
Since the launch of the first land observation satellite, Earth Resource Technology Satellite (later changed to Landsat-1) in 1972, substantial improvements have been made in sensor technologies.The spatial resolution has increased over 100 times, from the 80 m of the Landsat-1 to 0.41 m of the GeoEye-1 (Orbview-5) satellite.The spectral sampling frequency has increased nearly 100 times, from a few spectral bands to a few hundred spectral bands.Classification of land cover and land use types has been one of the most widely adopted applications of satellite data.Although a large number of algorithms have been developed and applied to map land cover from satellite imagery, and new algorithm proposers have reported improvements in accuracies of their mapping experiments (see [1] for a substantial review on classification algorithms), it is difficult to find a systematic comparison on the performance of newly proposed algorithms.This is particularly true for machine learning algorithms, as many of them have been introduced to the field of remote sensing for less than 10 years.Instead, classifier performance comparison has only been limited to the comparison of a new algorithm with a conventional classifier like the maximum likelihood classifier [2][3][4], or the comparison among a small number of two to three new algorithms [5].Through meta-analysis of a large number of published literatures on land cover and land use mapping, Wilkinson [6] found that accuracy improvement of land cover and land use mapping by new algorithms are hardly observable.However, this kind of analysis compares classification accuracies in different literature reporting applications over different study areas and/or with different types of satellite data.
As the number of machine learning algorithms increases, it is beneficial for the user community of machine learning algorithms to gain a better knowledge on the performances of each algorithm.In the field of remote sensing image classification, a more comprehensive comparison of major machine learning algorithms is needed.This must be done with the same land cover and land use classification scheme and the same satellite image.It is generally believed that final image classification results are dependent of a number of factors: classification scheme, image data available, training sample selection, pre-processing of the data including feature selection and extraction, classification algorithm, post processing techniques, test sample collection, and validation methods [7].The purpose of this research is to compare performances of 15 classification algorithms when applied to the same set of Landsat Thematic Mapper (TM) image acquired over Guangzhou City, China, while keeping the other factors the same.The urban area of Guangzhou has been selected for this purpose as it includes relatively complex land cover and land use patterns that are suitable for classification algorithm comparison.In addition to applying the algorithms on a pixel-by-pixel basis, we also tested the algorithms on a per-segment basis to compare the effect of including the object-based image analysis as a preprocessing step in the image classification process.

Study Site and Data
Our study site is located in the north of the Pearl River Delta (23°2'-23°25'N, 113°8'-113°35'E).As the capital city of Guangdong, Guangzhou is one of the fastest growing cities in China.It contains the core part of Guangzhou Municipality, and its rural-urban fringe (Figure 1).It can be divided into three regions: forest in the northeast, farmland in the northwest, and settlement in the south.As Guangzhou has been among the first group of cities that have undergone rapid development for over 20 years, it has been studied extensively for land use and land cover mapping and change detection (e.g., [8][9][10]).The Landsat Thematic Mapper (TM) image used here was acquired on 2 January 2009, in the dry season of this subtropical area.For classification on a single date image, there is no need to do atmospheric correction if the sky is clear, which is the case in this study [7,11].Geometric correction was applied to the raw imagery by co-registering this image with a previously georeferenced TM image acquired in 2005.A total of 153 ground control points were selected from the image.A second order polynomial resulted in the root mean squared error of 0.44 pixels.The original image was radiometrically resampled with a cubic convolution algorithm (for classification purposes nearest neighbor or bilinear resampling would work as well).Due to its coarser resolution, we experimented with a 6-band set of the TM data by excluding the thermal band.In order to estimate the potential of satellite data of similar resolution but with only visible and near-infrared bands (e.g., the 32 m resolution multispectral camera on board the Disaster Monitoring Constellation satellites, the 30 m multispectral sensor on board China's Huanjing-1A satellite), we also experimented with a 4-band set of the TM data by further excluding the two middle infrared bands.
At the time of image acquisition, some fruit trees (such as Litchi) and several vegetables were in their blooming stage and some fruit trees (such as citrus) were in fruit-bearing stage.The elevation is high in the northeast mountains and low in the southwest farmlands.Newly developed industrial areas are in the southeast.

Classification System
The land cover and land use classification system was developed to reflect the major land types in this area with reference to Gong and Howarth [2,12], and Gong et al. [13] Their meanings are self-explanatory (Table 1).On this basis, totally there were 14 subclasses for training, which were divided according to the spectral characteristics [2].For example, the Industrial/commercial was subdivided into 4 types due to the spectral differences by different roofing materials.

Training Samples
Training samples are primarily collected on a per-pixel basis to reduce redundancy and spatial-autocorrelation [7].They were selected through image interpretation with intensive field visits over this area.Although more training samples are usually beneficial, as they tend to be more representative to the class population, a small number of training samples is obviously attractive for logistic reasons [14].It is often recommended that a training sample size for each class should not be fewer than 10-30 times the number of bands [15][16][17].This is usually okay for classifiers that require few parameters to be estimated like the maximum likelihood classifier when applied to a handful number of bands.With many classification algorithms, no previous study has reported an optimal number of training samples.To test the sensitivity of an algorithm to the size of training samples, we selected training samples uniformly from the images to make sure each subclass has 240 samples for later experiments.We sampled the training data to construct 12 sets of training samples with 20,40,60,80,100,120,140,160,180,200,220, and 240 pixels.For object-based method, we selected the segments contained the training pixels (240 pixels per subclass) as training objects.

Test Samples
We separately collected 500 pixels as test data, 138 of which were from field visits (done in April and December 2009, and June 2010), and the remaining were selected according to prior knowledge.The size of test sample for each land class was greater than 40 pixels (Figure 2).We used Kappa coefficient as the evaluation criterion [18].

Classification Process
We tested 15 algorithms [19][20][21][22][23][24][25][26][27][28][29][30][31][32][33] all from easily accessible sources [34][35][36][37].These algorithms are selected because they are openly accessible and easy to use.As the number of algorithms is large, and they are clearly documented elsewhere, sources of references on the codes and documentation of the algorithms are provided in Table 2. Most algorithms require certain parameterizations.While the choice of optimal parameter set is desirable, it is extremely difficult to do so even with the original algorithm developer as the application conditions vary so widely from one environment to another and from one data type to another.However, it is generally safe to adopt the recommended range by the algorithm developers.In practice this is usually what has been done.Therefore, we designed experiments to cover a majority of parameter combinations for each algorithm (Table 2) while adopting the parameter ranges as recommended in the original sources of references.These algorithms were tested using both pixel-based and segment-based methods.For the two unsupervised classifiers, the Iterative Self-Organizing Data Analysis Technique (ISODATA) is a popular advanced clustering algorithm [19] while the Clustering based on Eigenspace Transformation (CBEST) is an efficient k-means algorithm [20].
The clusters obtained were grouped into informational classes by the same analyst who did the selection of training and test samples.It is assumed that the analyst is most familiar with the study area given sufficient field visits and consulting with local experts.
For segment extraction, we used BerkeleyImageSeg (http://berkenviro.com/berkeleyimgseg/) to perform image segmentations and then classified the segments by each algorithm.For the segmentation, the threshold is the most important parameter, which determines the size of the objects [38].Here, four threshold values {5, 10, 15, 20} were examined.The shape parameter and compactness parameter were set to 0.2 and 0.7, respectively.The statistical spectral properties of the segments were then used in the segment-based classification.The features [39] are listed in Table 3.There are a total of 24 features.The parameters used for this method are selected according to the empirical values from the pixel-based classification, and taking the number of features used into consideration.The standard deviations of the pixels in the segments for each spectral band (6 bands)

Active Learning
Active learning is an algorithm for selecting effective training samples.This kind of algorithms adds unlabeled samples as training samples from the sample pool through human-machine interaction [40].In this research, we used a margin-sampling algorithm [41,42], which takes advantage of SVM.It selected candidate samples lying within the margin of the model, and these samples are most conducive to the improvement of the classifier's performance.At the beginning, we randomly selected 20 samples for each class, and added 10 samples from the training set using margin sampling at a time.The best parameters of SVM are selected using simple grid search.

Pixel-Based Classification
Table 4 shows the best pixel-based classification accuracies of the algorithms.For the two unsupervised algorithms, they could produce as good results as some of the supervised algorithms when we cluster 150 spectral clusters.This is usually a very large number of clusters for an image analyst.Thus, we did not experiment for more clusters.Most supervised algorithms produce satisfactory results when the training samples are sufficient (more than 200 samples per class).
However, MLC only requires 60 pixels to reach its highest accuracy.This indicates the high level of robustness and capability of generalization.Table 4. Best classification accuracy for each algorithm using pixel-based approach.A small value of K (K = 3) for KNN is the better choice in this study, and the distance-based weighting improves the KNN results.For the simple classification tree algorithms (CART, C4.5, and QUEST), minNumObj means minimum number of samples at a leaf node, which determines when to stop tree growing.All the three simple tree algorithms achieve high accuracies when this value is less than 10.In other words, they all grow big trees and then prune them.However, the LMT needs a large minNumInstances to build the tree.For RF, numFeatures means the number of features to be randomly selected at each node and numTrees means number of trees generated.Usually, the suggested value of numFeatures is , where N is the number of features [43].However, in this research, we find a value smaller than is more suitable.For SVM, we used radial basis function (RBF) kernel, the space affected by each support vector is reduced as the kernel parameter gamma increases.A slightly large gamma (2 3 , 2 4 ) is the best choice for this research, which means more support vectors are used to divide the feature space.MinStdDev in RBFN is the minimum number of standard deviations for the clusters, controlling the width of Gaussian kernel function as gamma in SVM.numCluster is the number of clusters, determining the data centers of the hidden nodes.In this research, we found the numCluster equal to or slightly greater than the number of classes is a better choice.BagSizePercent in Bagging controls the percentage of training samples randomly sampled from the training sets with replacement.The results show that 60%-80% of the training set achieved better results.It is similar to weightThreshold in Adaboost, but the latter one resamples the training set according to the weight of the last iteration.It achieves good classification results using only 10 iterations.For SGB, bag.fraction controls the fraction of training set randomly selected without replacement.The best value of the sampling fraction is 0.2.This reduces the correlations between models at each iteration.The best shrinkage value, which is the learning rate is 0.1.

Algorithm Parameter Choice Accuracy
From Table 4 we can see that the best classification accuracy for the 6-band case is achieved by logistic regression, followed closely by the maximum likelihood classifier, neural network, support vector machine, and logistic model tree algorithms.Opposite to this, the CBEST and KNN produced the lowest accuracies.The range of Kappa coefficient from the lowest to the highest is 0.049.For the 4-band case, in general, there is a 0.02 to 0.04 difference in Kappa for each algorithm, confirming the fact that with fewer spectral bands there is indeed accuracy loss.However, in this experiment, the accuracy drop is quite small implying that the inclusion of the two middle infrared bands of the TM would not add a lot of power in separability to the classification of our classes.The maximum likelihood classifier produced the highest accuracy of 0.873 for the 4-band case, only 0.026 inferior to the highest accuracy with the 6-band case.The accuracy range for the 4-band case is between 0.818 and 0.873.

Objected-Oriented Classification
Table 5 shows the best classification accuracies using objected-oriented method.The results of this kind of classification are largely depending on the segmentation [44].The classification accuracies are the highest when the segmentation scale is set to 5 (the smallest).The best performer is SGB with an accuracy improvement of 0.025 over the best pixel-based classification results.This is followed closely by RF.The accuracies decrease with the increase of the threshold.A higher threshold produces larger objects.For the TM image, which is 30 m in resolution, fragmentation is relatively high in this urban area.High threshold brings more mixed information in the segments under this classification system.As small segments are relatively homogeneous, the classifiers utilizing statistical properties of the segments rather than individual pixel values improved the results.
Comparing Table 5 with Table 4, we can see all results are improved based on objected-oriented approach using spectral features only.Among them, SGB produced the best results, followed by RF, C4.5, LMT, LR, and MLC.From another perspective, these algorithms could deal with high-dimensional data.

Most Common Errors among the Classifiers
Figure 3 shows the test pixels of different classes that have been misclassified at least once.The clusters of repeatedly misclassified pixels are mainly found in the urban areas.The red circled area is the center of Guangzhou, where many residential, forest, commercial buildings, and construction land are mixed.The blue circled area is the new urban district, where bare land and industrial park are mixed.Most of the algorithms perform poorly in these complex areas given the fact that it is easy to have a wider range of spectral characteristics than it would normally have in natural environment within the same class.In addition, in the green circled area, there are black greenhouses and iron and steel enterprise.They are misclassified as residential area or water.From Figure 4 we can see residential area and water; natural forest and orchard; industrial/commercial and cleared land/land under construction, residential, water; farmland and idle land are more easily mixed in the feature space.
Tables 6 and 7 show that residential area and water, and residential area and industrial/commercial are well distinguished by ISODATA, CBEST, MLC, and LR.RBFN is good at distinguishing forest from orchard, while SVM is good at classifying industrial/commercial and cleared land/land under construction.LR, SVM, and LMT can better distinguish farmland and idle land.

The Comparison of the Algorithms Using Pixel-Based Method
Figure 5 summarizes classification accuracies from all the parameter combinations listed in Table 2 and with different-sized training sets.The two unsupervised classifiers are not included as they were only tested with two parameter settings.We can see that all algorithms tested in this research could achieve high accuracies with sufficient training samples and proper parameters.MLC and Logistic regression have superior performances to other algorithms as their accuracy range is narrow and they can be easily set to produce a high accuracy.Another traditional algorithm-K-nearest neighbor does not get as high accuracies as these two.From the ranges of the boxes, MLC, LR, and LMT are the most stable algorithms among all the algorithms.For the tree classifiers, the box ranges are large.They are sensitive to the selection of parameters and training samples.C4.5 and CART tested in this research both select only one feature to split on the nodes, while QUEST uses a linear combination of features to split classes.The latter divides the feature space more reasonably and flexibly when the spectral distribution is complex.RF as an advanced tree algorithm uses Bagging algorithm to generate different training sample sets, and ensembles the different trees created by these training sets.Shown in the figure, RF is superior to Bagging (C4.5) and other simple trees.Compared with Bagging, RF splits the node using features randomly selected.This can reduce the correlation between the trees, and then improve the stability of the classification results.SVM and RBFN show similar performance in our experiments, but the parameters of these two classifiers are difficult to set.In general, users could not get the most out of the two algorithms because of the difficulties in parameter setting.
The Adaboost shows better results than Bagging.It focuses on the wrongly classified samples in the previous iteration rather than randomly selected samples.Bagging and Adaboost classifiers are both built on different training sample sets.Their maximum accuracies are not higher than that of C4.5 indicating a larger variability of single tree classifiers like C4.5 but better stability with ensemble classifiers through Bagging and Boosting.SGB is another boosting algorithm, and it outperformed Adaboost.It fits an additive function to minimize the residuals at each iteration.It relies on the small data set randomly selected while Adaboost relies on the incorrectly classified samples.LMT is built on different classifiers.The algorithm is a tree classifier and builds logistic regression models at its leaves.It takes advantage of the decision tree, building logistic regression at a small and relatively pure space.
The training sets have been divided into smaller subclasses according to the spectral characteristics.Therefore, LMT does not fully show its advantage.

Algorithm Performances with Low-Quality Training Samples
The quality of training samples is reduced with the increase of thresholds in segmentation as segment statistics are becoming increasingly contaminated by information from potentially other classes.We could use this to assess algorithm performance.An algorithm showing good results on different thresholds performs well with low-quality training sets.Figure 8 shows that MLC, LR, RF, SVM, LMT, and SGB algorithms could deal with the contaminated information better than others.They could build more robust classifiers from weak training sets.In comparison, Adaboost algorithm shows good results only when the segmentation scale is smaller than 15.KNN results decrease sharply when the segmentation scale is greater than 10.As we have described hereinbefore, MLC, LR, SVM, RBFN, and LMT could perform well using a small training set.However, with deteriorating training samples, MLC, LR, SVM and LMT could still perform well.

Summary
In this study, we compared 15 classification algorithms in the classification of the same Landsat TM image acquired over Guangzhou City, China, using the same classification scheme.The algorithms were tested on a pixel-based and segment-based classification.In the pixel-based decision making, the algorithms were tested with two band sets: a 4-band set including only the visible and near infrared TM bands and a 6-band set with all TM bands excluding the thermal and panchromatic bands.All supervised classifiers were tested with 12 sets of different sized training samples.In the segment-based decision making, the algorithms were tested with different segment sizes determined by different scale factors.All tests were evaluated by the same set of test samples with the total overall accuracy measured by the Kappa coefficient.The results can be summarized in the following: (1) The 4-band set of TM data by excluding the two middle infrared bands resulted in Kappa accuracies in the range between 0.818 and 0.873.The inclusion of the two middle infrared bands in the 6-band case increased this range to 0.850 and 0.899.This indicates the potential loss of overall accuracies in urban and rural urban fringe environments with the lack of middle infrared bands could be within 3%-5%.(2) Unsupervised algorithms could produce as good classification results as some of the supervised ones when a sufficient number of clusters are produced and clusters can be identified by an image analyst who is familiar with the study area.The accuracy of the unsupervised algorithms produced better than 0.841 Kappa accuracies for the eight land cover and land use classes.With the increasing number of new algorithms emerging rapidly, there is a need to assess their performance and sensitivities to various kinds of environments.This need is best addressed by developing standard image sets with adequate classification scheme and sufficiently representative training and testing samples.This research represents one of such attempts.However, more datasets containing high quality training and test samples should be established for different types of remotely sensed data sets over typical environments in the world to support more objective assessment of new algorithms.Only when more comprehensive test data sets covering major environmental types of the world can we make more appropriate selection of algorithms for a particular application of remote sensing classification.Another important aspect that has not been assessed in this research is feature extraction and use of non-spectral features whose effectiveness has been demonstrated in the literature [45][46][47][48][49][50][51].Furthermore, use of multisource data including optical, thermal and microwave data in urban land classification should be systematically evaluated [52,53].Lastly, more analysis of the representativeness of training samples should be done in developing algorithm test image sample sets [54].These will be evaluated in a future research.

Figure 1 .
Figure 1.The study area.The image displays the green, red and near infrared band of the TM data with blue, green, and red color guns.

Figure 2 .
Figure 2. The distribution of test samples.

Figure 3 .
Figure 3.The distribution of misclassified test samples.

Figure 4 .
Figure 4.The distribution of the test samples in the feature space (principal component analysis (PCA) is used to reduce the dimension of data space).

Figure 5 .
Figure 5.Comparison of the pixel-based supervised classification.

Figure 6
Figure 6 shows the impact of training sample sizes on different classifiers.When the number of training samples is very small (e.g., 20, 40 samples), no algorithm performs well.The algorithms most affected by training sample size are the classification tree algorithms except for RF and Adaboost.They need sufficient samples to build the trees.MLC, LR, SVM, and LMT are the least affected algorithms.They could produce relative high accuracies using a smaller sized training set, and achieve stable results when there are 60 or more samples per class.All the algorithms except for MLC are improved in varying degrees by adding training samples.Generally speaking, MLC, LR, SVM, RBFN, and LMT could produce good results with small sized sample sets.

Figure 6 .
Figure 6.Pixel-based supervised classification with different sizes of training sets.
Figure 7 shows that when the total number of training samples increases to 560-840 (40-60 samples for each subclass), about 1/6-1/4 of the entire training set, the classification results are satisfactory.The results are relatively stable when the training sample size further increases.In other words, the whole training set only contains 1/6-1/4 useful information.On the contrary, we randomly added the training samples without any rules, and then the results increased slowly and became stable when the training samples were representative enough (at about 2,800 samples, 200 samples for each subclass).That is why most of the algorithms achieve their highest accuracies when there are more than 200 samples per class.Under such circumstances, using active learning algorithm to select training samples is an efficient way to achieving the optimal results before a large amount of trial and error tests.Active learning should be applied for representative sample selection to feed subsequent classification algorithms.

Figure 7 .
Figure 7. Active learning result and results based on training samples randomly selected.

( 3 )
Most supervised algorithms could produce high classification accuracies if the parameters are properly set and training samples are sufficiently representative.In this condition, MLC, LR, and LMT algorithms are more proper for users.These algorithms can be easily used with relatively more stable performances.(4) Insufficient (less representative) training caused large accuracy drops (0.06-0.15) in all supervised algorithms.Among all the algorithms tested, MLC, LR, SVM, and LMT are the least affected by the size of training sets.When using a small-sized training set, MLC, LR, SVM, RBFN, and LMT performed well.(5) In segment-based classification experiments, most algorithms performed better when the segment size was the smallest (with a scale factor of 5).At the scale of 5, SGB outperformed all other algorithms by producing the highest Kappa values of 0.924 and this is followed by RF.All algorithms are less sensitive to the large increase of data dimensionality.MLC, LR, RF, SVM, LMT, and SGB algorithms are the best choices to do the classification.They could produce relatively good accuracy at different scales.

Table 1 .
Land use classification system.

Table 2 .
Algorithm parameter set up and source of codes.

Table 3 .
The features used in the objected-based classification method.
FeatureMaximum value of the segments for each spectral band (6 bands) Mean or average values of the segments for each spectral band (6 bands) Minimum values of the segments for each spectral band(6 bands)

Table 5 .
Best classification accuracy for each algorithm using objected-oriented approach.

Table 6 .
The number of misclassified pixels by different algorithms.

Table 7 .
The confusion matrix of the best result.