An Evaluation of Different Training Sample Allocation Schemes for Discrete and Continuous Land Cover Classification Using Decision Tree-Based Algorithms

Land cover mapping for large regions often employs satellite images of medium to coarse spatial resolution, which complicates mapping of discrete classes. Class memberships, which estimate the proportion of each class for every pixel, have been suggested as an alternative. This paper compares different strategies of training data allocation for discrete and continuous land cover mapping using classification and regression tree algorithms. In addition to measures of discrete and continuous map accuracy the correct estimation of the area is another important criteria. A subset of the 30 m national land cover dataset of 2006 (NLCD2006) of the United States was used as reference set to classify NADIR BRDF-adjusted surface reflectance time series of MODIS at 900 m spatial resolution. Results show that sampling of heterogeneous pixels and sample allocation according to the expected area of each class is best for classification trees. Regression trees for continuous land cover mapping should be trained with random allocation, and predictions should be normalized with a linear scaling function to correctly estimate the total area. From the tested algorithms random forest classification yields lower errors than boosted trees of C5.0, and Cubist shows higher accuracies than random forest regression.


Introduction
Land cover classification from satellite images is one of the primary fields in remote sensing.Finer spatial resolution data (10-30 m), in particular from Landsat, has been widely used for regional studies of land cover and change, and very fine spatial resolution imagery (<1 to 5 m) play an important role in local studies.Wall-to-wall mapping of large areas with 10-30 m data is expensive in terms of financial and computational resources, and there are only a few efforts for large areas, such as the National Land Cover Dataset (NLCD) of the United States [1], the National Land Cover of South Africa (NLC) [2], or the European Coordination of Information on the Environment as pan-European maps (CORINE) [3].Recently, global forest cover [4,5] and global land cover maps [6] were derived from 30 m Landsat data.Most macro-regional, continental, and global applications, however, employ data of relatively coarse spatial resolution (250-1000 m) from Terra-Aqua/MODIS, SPOT/VEGETATION, NOAA/AVHRR, and ENVISAT/MERIS.Besides fewer difficulties in handling data volumes, the increased number of available cloud-free images allows for generation of data composites, and the dense temporal information helps to discern classes by their distinct phenological patterns.The latter is advantageous for mapping across various ecoregions where classes are likely to be represented by multiple clusters in feature space [7,8].
The lack of spatial detail of coarse resolution data imposes limitations for accurate land cover characterization [9][10][11].The assignment of discrete classes to coarse resolution cells cannot adequately describe spatially complex areas [12].The likelihood for mixed pixels is a function of the spatial resolution, the thematic detail to be mapped, and the size and spatial pattern of land cover patches [13].However, discrete class assignment of mixed pixels not only imposes serious difficulties to coarse image data classification but also alters the area estimation.Several studies have noted that at coarser spatial resolution dominating classes with large patches yield higher area proportions than expected at the expense of dispersed, small-patch classes [7,13,14].Studies have postulated that area calculations from fractional estimates are more accurate than from discrete classifications [7,15].
Several algorithms have been explored for large area mapping with coarse resolution data.For instance, Fernandes et al. [10] compared a hard classifier, artificial neural networks (ANN), linear spectral unmixing, clustering, and linear regression for fractional class estimation and found differences of approximately 20% compared to fine resolution reference data.Studies focusing on urban land cover compared advanced regression algorithms [16] or various discrete classifiers [17].Several studies for the same global 1° spatial resolution AVHRR Normalized Difference Vegetation Index (NDVI) dataset have shown that classification of 11 land cover classes with decision trees (DT) perform best with 93% overall accuracy [18] compared to Maximum Likelihood classification (78%) [19] and ANN (85%) [20].Most automated processing systems for macro-regional to global land cover characterization employ DT approaches [1,12,[21][22][23][24][25].There are two general types of DT: classification trees (CT) with a discrete target value and regression trees (RT) with a continuous result.
Besides the classification algorithm, features and training data for supervised image classification have to be defined.Several studies address feature generation and selection processes [26][27][28][29] and various aspects of training data selection [17,[30][31][32].However, only a few studies have focused on training data allocation schemes, such as between-class sample balance or the structure of heterogeneous samples.In particular classification trees may suffer from an unbalanced sample size between classes because the number of samples in each leaf defines the class [33,34], and several allocation schemes have been recommended [24,26,27,32].A few studies recommend heterogeneous training data for discrete classification [35,36] but most large-area mapping projects select homogeneous areas for training [7,22,27].For regression techniques, the impact of non-random selection of heterogeneous training data is unknown, and the impact of combined tree models for several classes on correct area estimation has been widely overlooked.
The objective of this study is to compare the accuracy and area estimations of several decision tree approaches trained with specific sample allocation schemes from an existing higher spatial resolution map for discrete and continuous land cover mapping.Specific goals are: (1).Evaluate the performance of DT algorithms using two common approaches of classification and regression trees (2).For classification trees, compare (a) heterogeneous training pixels with different allocation schemes against homogeneous pixels and (b) schemes of sample allocation between classes (3).For regression trees, assess (a) sample allocations for heterogeneous samples and (b) normalized and non-normalized results to combine multiple models.

National Land Cover Data of the United States from Landsat Images
The National Land Cover Data set (NLCD) of the United States is a 30 m Landsat TM/ETM + -based classification with 16 classes produced by the United States Geological Survey (USGS).There are two maps (1992,2001) [1] and map updates for 2006 and 2011 [37][38][39]; the 2006 update was used in this study.NLCD2006 has an overall accuracy of 78% [40] and a small class-specific minimum mapping unit [37].NLCD data are provided in Albers Equal Area (AEA) projection with NAD83 datum, standard parallels at 23.5°N and 45.5°N, and an origin of latitude and longitude at 23°N and 96°W, which is also the map projection of this study.In this study, a subset of 60,000 × 30,000 pixels, extending from western Kansas (101°W, 39°N) to Jacksonville, FL (80°W, 30°N), was extracted (see Figure 1).In addition, ten Landsat images (Figure 1) were downloaded for accuracy assessment and evaluation of spatial co-registration between MODIS and Landsat from which NLCD2006 was derived.

MODIS Data
The MODIS nadir bidirectional reflectance distribution function (BRDF)-adjusted surface reflectance (NBAR) product with 926.6 m spatial resolution (MCD43B4) and the corresponding quality assessment science dataset (MCD43B2) were downloaded from the Land Processes Distributed Active Archive Center (LP DAAC) for the period of October 2005 to March 2007.The NBAR product applies the BRDF parameters to cloud-free and atmospherically corrected surface reflectance data (bands 1 to 7) with a solar angle at local solar noontime.This mimics a nadir-viewing instrument and results in a stable and consistent dataset [41,42].MCD43 products combine images of Terra and Aqua acquisitions over a 16-day period but are produced every eight days by rolling compositing.Three tiles (h09v05, h10v05, h11v05; Figure 1) were mosaicked, resampled to 900 m using nearest neighbor (NN) resampling, subset to 2000 × 1000 pixels, and projected to the AEA projection with projection parameters equal to NLCD.The cell size of 900 m, as compared to the more commonly used 1000 m, was chosen to nest the grid with 30 m cells from NLCD; thus a block of 30 × 30 cells of NLCD2006 corresponds to one MODIS cell at 900 m.

Methods
A common classification process similar to Blanco et al. [7], Clark et al. [21], and Colditz et al. [43] was used.Figure 2 illustrates this process which can be divided into five general blocks for (1) feature generation; (2) reference data processing; (3) training data sampling; (4) classification/regression; and (5) accuracy assessment and area calculation.

Feature Sets
The product quality for each pixel was analyzed using the Time Series Generator (TiSeG) [44].Only best observations for each band with a generally good quality and no snow cover were selected, and data gaps were temporally interpolated with a linear function.Additionally, the NDVI was computed from red and near infrared bands.
The usefulness of metrics, which are univariate statistics computed over a defined period, for land cover mapping has been demonstrated in several other studies [7,21,33,43].The mean, standard deviation, minimum and maximum value, and range between the minimum and maximum for the period of the entire year, two six-month, three four-month, and four three-month periods were computed from time series of each spectral band and the NDVI.This results in a feature set of 400 variables (seven spectral bands + NDVI, five univariate statistics, 10 periods).

Figure 2.
Process-flow for data processing and map assessment.OA: overall accuracy, MAD: mean absolute difference, CT: classification tree, RT: regression tree, RF-C: Random Forest Classification, RF-R: Random Forest Regression, prop: proportion.

Spatial Co-Registration
A prerequisite of this study is near-to-perfect spatial co-registration between MODIS and Landsat images from which NLCD2006 was mapped.Spatial co-registration errors were estimated with an iterative two-step approach: (1) coarsening Landsat data to the MODIS grid cell size and (2) correlation.This process was repeated within a defined window displacing Landsat data by a specified interval in x-and y-direction, and the offset with the highest correlation coefficient indicates the displacement between both images [45].In this study, the correlations are based on the NDVI from downloaded Landsat images and the closest available MODIS composite.

NLCD2006 Data
For this study 15 classes present in the subset of 30m NLCD2006 in the southeastern United States were combined to a final set of nine classes (Table 1).Next, blocks of 30 × 30 cells that spatially match one MODIS pixel were aggregated.Homogeneity H describes for each 900m MODIS pixel the area proportion of each land cover class from corresponding NLCD2006 data.In equation 1 x and y refer to individual pixels in NLCD2006, and expression c(x,y) = i counts all pixels in that block that correspond to class i.As a result, homogeneity, expressed in percentage, represents for each class the area proportion at the coarse grid.
The argument maximum, argmax(H), also known as the majority rule, extracts the dominant class for each MODIS pixel.The corresponding area proportion, the homogeneity value of that dominant class, max(H), indicates the level of dominance in percent.

Training of Classification Trees
A total of 5400 training samples (0.25% of the study area) were allocated from the homogeneity of NLCD2006 with a minimum distance of five pixels apart.For homogeneous training data the required number for each class was allocated with H = 100% which was decreased if this number could not be achieved [27].Heterogeneous training data were allocated uniformly across six bins, with one bin for H = 100 and five bins for 100% > H ≥ 50% with 10% intervals.An alternative heterogeneous training set used random allocation.
With respect to between-class sample balance, this study compares (1) random sampling; (2) allocation proportional to the expected area as obtained from NLCD2006; and (3) equal number of samples for all classes.Since random and area-proportional allocation can lead to a very low number of samples for scarce classes, a minimum of 50 samples per class (1% of all samples) was required.

Training of Regression Trees
As for each class a separate regression tree has to be trained, the issue of between-class allocation becomes irrelevant.More important are allocation schemes across different levels of homogeneity, which was divided in 12 bins (H = 0%, ten bins with 0% < H < 100% with 10% intervals, and H = 100%).For each class, 5400 samples were allocated, testing three schemes: (1) random allocation with a minimum of 50 samples per bin (Random-50), (2) random allocation with no minimum per bin (Random-0), and (3) a uniform allocation with 450 samples per bin.

Classification Trees
Classification trees (CT) apply recursive partitioning to a set of discrete (categorical) training data with the goal to reduce the impurity among classes by selecting an appropriate discriminating feature and threshold [46,47].Commonly, classification trees generate discrete maps in which the class is defined by the highest proportion of samples in each terminal node.However, additional strategies such as randomization [48] or the use of class frequency at the leaf level together with boosting [27,42] can be used to derive class memberships and thus continuous classifications.
C5.0 decision trees (www.rulequest.com)[49] in the tree-mode together with 10-folded boosting were used as the simplistic model.For each tree the proportion of each class in every leaf was calculated and for each class the trees were combined to estimate class memberships [43].
Random Forest Classification (RF-C) [48] uses the classification and regression tree (CART) algorithm [46] as base classifier.The version provided for R [50] was employed with default options, i.e., for each tree a set of 63.2% of the samples is extracted, the number of features is limited to 20 (the floored square root of the total features), and trees are grown with the Gini index until each leaf is pure.Class memberships were derived by the combination of 1000 trees.

Regression Trees
In contrast to classification trees, regression trees apply recursive partitioning to a set of continuous training data.They largely follow the same logic but use the reduction in standard deviation as criteria for feature and threshold selection.Regression models were generated for membership estimation of each class.An equation of Xu et al. [24] normalizes the regression value RV for class i among all classes J.This linear scaling function (Equation ( 2)) ensures that the membership total for each pixel will be 100%.The majority rule was used for transformation of regression results to discrete maps.

= ∑
(2) Cubist, a rule-based classifier (www.rulequest.com)[51,52], was employed as simplistic regression model.Initially, a regression tree similar to regression in CART is generated.Subsequently the tree is simplified and transformed into a set of rules with multiple conditions, and a multivariate regression equation estimates the numeric value.Thus, models from Cubist are not regression trees in a strict sense, however they yield promising results and were successfully employed in remote sensing [1,25,34,53,54].The options for unbiased value estimation, no extrapolation of data values beyond training data range, and 5 committee models (similar to five-folded boosting) were selected.
Random Forest Regression (RF-R) uses the regression tree option of CART as base classifier [46].Unlike Cubist, CART regression trees use the value estimate at the leaf level.The version provided for R [49] was executed with default options, i.e., for each tree a set of 63.2% of the samples is extracted, the number of features is 133 (total features/3, floored), and trees are grown with the standard deviation as splitting criterion, a minimum node size of 5 samples, and the numeric value is the mean of all samples in a leaf.The average of 1000 trees derived the class membership.

Discrete Map Assessment
From a set of potential samples, constrained to the location of Landsat path rows (Figure 1) and having a homogeneity greater than 50% (H > 50%), 150 samples per strata, i.e., the class as defined in NLCD2006, were extracted.As response data served Landsat imagery from the year 2006 and very high resolution Google Earth data as close as possible to the year 2006.It is important to note that NLCD2006 only served for stratification to ensure that some samples will correspond to scarce and scattered classes, but it was not consulted by the analyst to assign the reference label.This approach also allowed for a better comparison among all classifications using the same reference set.
Due to ambiguity in interpretation of coarse cells, either uncertainty in the interpretation or presence of more than one land cover type in a coarse resolution cell of 900 m (mixed pixel issue), it is recommended to assign two labels [7,23].The primary class is the most likely call, i.e., the most certain class or class with the largest area proportion, and the alternative label indicates the potential of presence of another class.In case of high certainty or presence of only one class both labels are the same.
Discrete maps were assessed with (1) only using the primary reference label or (2) the primary + alternative reference label.In the latter case, the land cover map was considered as correctly classified if it corresponded to the primary or the alternative reference label; in case of disagreement with both the primary class was used to assign the error in the confusion matrix.The overall accuracy (OA, sum of the diagonal against the total of the error matrix) as well as users and producers accuracies were employed using standard formulas for confidence estimation [55].
Pair-wise comparisons of accuracies were performed in two ways.The McNemar test aims at the difference between correctly and incorrectly classified class allocations [56].This test is recommended as the reference set was identical for the assessment of all maps [57], and the z-test form (Equation ( 3)) was implemented where fAB and fBA indicate the frequency of correctly classified in map A but incorrectly in map B and vice versa.
The overall accuracy was tested with standard z-test-statistic (Equation ( 4)) where OAA and OAB are the overall accuracies of map A and map B and SDA and SDB are their respective standard deviations.

.2. Continuous Map Assessment
Class memberships were assessed with four measures against the homogeneity from NLCD2006 as continuous reference set.The coefficient of correlation r is a means to evaluate the strength of agreement between the membership and reference set.The mean absolute difference (MAD, Equation ( 5)) addresses the absolute error in percent between the membership estimates (M) and reference set (R) for all pixels (K).For a spatial representation of the error, MAD was computed for each pixel with K being the total of classes for one pixel.The slope and intercept of the linear regression function between reference and membership indicate the dynamic range of the predicted values and the bias.

Area Estimation
Area estimation from discrete maps is a straight-forward pixel count for class i multiplied by the area of each pixel.The area of each class from memberships is the total of all membership values times their pixel area [15].The total absolute difference in area (AD) between reference R (NLCD2006) and classification or membership CM, with K being the total of all pixels, was calculated using Equation (6) and expressed in area and in percent against the total of the study area.

Spatial Co-Registration
Table 2 shows near-to-perfect spatial co-registration between NDVI from ten Landsat images and corresponding dates of MODIS composites.The offsets are negligible, with averages of x = −3 m, y = −3 m and extremes lower-equal ±30 m.The coefficient values itself are all positive and indicate a sufficiently high correlation, i.e., the spatial patterns in Landsat and MODIS NDVI are closely related to each other.This finding is an important prerequisite for the following analysis as it permits a direct relation between Landsat-based NLCD2006 maps and MODIS.
Table 2. Spatial offset between Landsat images (for their spatial location see Figure 1) and temporally corresponding composites of MODIS data using the NDVI. Figure 3A shows the NLCD2006 map recoded to nine classes (Table 1) at 30 m spatial resolution.The map illustrates some spatial details such as the road network in Kansas that disappeared in Figure 3B, showing the spatial distribution of the dominant class at 900 m spatial resolution derived with majority rule argmax(H).Figure 3C indicates the corresponding area proportion of the dominating class, max(H).There are distinct regional patterns with homogeneous areas in the western portion (Shrubland, Grassland, Cultivated crops), the Mississippi valley (Cultivated crops), the southern Ozark and Appalachians mountains (Deciduous forest), the Okefenokee Swamp in southern Georgia (Wetlands), and large metropolitan areas like Atlanta, Dallas-Fort Worth, and St. Louis (Developed).In particular, the southeastern region is highly heterogeneous with area proportions of the dominating class below 50%; similar heterogeneous patterns exist in eastern Texas, Oklahoma, Louisiana, and Arkansas.Table 3 shows for each class the percentage of homogeneity in 12 bins.It is evident that there are more pixels with low homogeneity, but the magnitude is different for each class.For instance, class Water only exists in selected parts of the map and thus H = 0% makes up 76.7% of the study area.Class Deciduous forest is rather ubiquitous with a proportion of 37.6% for 10% ≤ H < 60%.Due to many roads that cause a homogeneity slightly above 0%, class Developed is an interesting example with only 21.3% for H = 0% but 61.5% for 0% < H < 10%.

Sample Allocation of Training Data
This section exemplarily demonstrates training sample allocation schemes.Each of the following tables shows the expected sample frequency, which is calculated from the number of samples that fulfill the specific allocation criteria, the corresponding expected number of samples, in many cases considering a minimum of 50 samples per class or sample bin, followed by actual sample allocation.All numbers are specific for this study and are meant to demonstrate the sample allocation process in practice.
Table 4 presents the random sample allocation for homogeneous pixels.The expected frequency and thus the expected number of samples is relative to the class proportion of H = 100% in Table 3. Actual sampling starts at H = 100% and decreases until the expected number of samples per class is reached.Sufficient samples of fully homogeneous pixels (H = 100%) were available for classes Deciduous forest, Grassland, and Cultivated crops.To reach the expected number of 358 samples for class Shrubland, Homogeneity had to be decreased to 96%.Table 5 shows the allocation proportional to the expected area from NLCD2006 (Table 1).Heterogeneous pixels were allocated uniformly across six bins of H ≥ 50%.Sampling should start at the bin with the highest homogeneity (H = 100%) because in some cases the expected sample size may not be available and will be allocated from the next bin.For instance, for class Evergreen forest with an expected total of 715 samples each bin should contain 119.17 samples (rounded to 119 or 120 samples), but only 29 samples could be selected for H = 100% and the remaining 90 samples were allocated from bin 90% ≤ H < 100%.Table 6 presents equal allocation between classes of heterogeneous pixels with random allocation across sample bins of argmax(H).Although the lowest potential level of dominance could be as low as 11.1% (1/9 classes) in reality, the lowest homogeneity was above 20%.For most classes pixels are highly heterogeneous with 40% ≤ H < 70%, i.e., the area of the dominating class makes up approximately half of the pixel.Only classes Grassland and Cropland indicated more homogeneous pixels with 70% ≤ H < 100%.Table 7 shows sample allocation schemes for regression trees for class Evergreen forest.Heterogeneous pixels were allocated randomly or uniformly across all bins of homogeneity (see also Table 3).In case of insufficient available samples for a bin starting at H = 100%, the remaining samples were added to the next bin.

Reference Data for Discrete Map Assessment
Table 8 provides details on the reference sample allocation process and reference label assignment.For each class, the number of potential samples (each sample corresponds to one 900 m MODIS pixel) meets the following conditions: (1) its homogeneity in NLCD2006 is higher than 50% and (2) it is located within the extent of Landsat images (Figure 1).The average of the homogeneity shows that, albeit all samples belong in majority to one class (H > 50%), the level of dominance is moderate and most samples are not pure.For each stratum in NLCD2006, 150 samples were extracted.Out of 1350 samples, four were excluded from analysis because response data were obscured by clouds or class assignment was too uncertain.The columns for primary and alternative label indicate for each class the number of assigned reference samples.For instance, there are 120 samples with primary label of class Water and another 84 samples labeled as Water by the alternative call.For 73 samples, the primary and alternative calls agree, i.e., class assignment is quite certain.On the other hand, there are 47 samples for which the alternative class was not Water and 11 samples for which the primary call was not Water.As the assignment of Water in image interpretation is quite simple, these samples were likely located along the edge of a water body and contain a mixture of land cover types.There are extreme cases of ambiguity such as Grassland and Pasture, both indicating a specific land use difficult to classify only using satellite imagery, or frequently mixed pixels, e.g., Wetland.Less than half of the samples (48.8%) had corresponding class labels in the primary and alternative call.Table 8.Sample allocation from NLCD2006 (H > 50%) and location in Landsat path-rows (Figure 1) and primary and alternative reference label assignment from Landsat and Google Earth image interpretation.Agreement shows the number of samples with equal primary and alternative label.Homogeneity (H) in percent.The overall accuracies (OA) of all classifications for discrete maps are shown in Table 9 for the primary reference label (P) or the primary and alternative label (P + A) as correctly classified.Confidence intervals (p < 5%, two tailed z-test) range between 2.5% and 2.7% and are therefore not presented.

Class
Overall accuracies of RF-C are, on average, 1% higher than from C5.0, and Cubist yields about 0.5% higher accuracies than RF-R.Assessing discrete maps from classification trees (C5.0,RF-C), heterogeneous training pixels show, on average, 6% better accuracy than homogeneous training data.There are no notable differences between uniform or random allocation of heterogeneous training samples.Area-proportional between-class sample allocations show 1% higher overall accuracies than equalized sampling, and accuracy for random training sampling decreases another 0.5%.Discrete maps from regression trees show a consistent pattern of 2%-3% higher accuracies for uniform allocation.Random allocation with no minimum sample size per bin resulted in 1%-2% lower accuracies than when allocating at least 50 samples for each bin.Note that normalization has no effect on discrete maps obtained with the majority rule.Best results for classification trees were obtained with RF-C, uniform allocation and area-balanced between-class sample allocation and for regression trees with uniform sampling but negligible differences between Cubist and RF-R (highlighted cells in Table 9).
Assessments using the primary and alternative reference label as correctly classified result in, on average, 14% higher accuracies, which indicates ambiguity in reference label assignment of some classes.In terms of class accuracies (see supplemental material), Water and Developed are well classified (on average 75% or better in users and producers accuracy).Shrubland and Cropland form a second group with above 50% in both class accuracies.There is confusion between Evergreen and Deciduous forest, and between both classes and Wetland as many forests in the southeastern US are interconnected with wetlands either as riparian vegetation or along estuaries at the coast.It should be considered that Wetland was the class with lowest accuracies in NLCD [40].Other classes with below 50% class accuracy are Pasture and Grassland as both indicate land use forms of herbaceous areas.The differences in classification accuracies were statistically tested using McNemar test and Figure 4A depicts the statistically significant differences.In contrast to using only the primary reference label (lower-left triangle), there are less significant differences for assessments with the primary and alternative reference label (upper right triangle).Most obvious is that sampling homogeneous training data for classification trees almost always performs significantly worse (for actual accuracies see Table 9 and supplemental material).There are statistically significant differences between classification trees using heterogeneous training samples and regression trees, even though the differences in overall accuracies are low.This is due to the nature of the test, which aims at the number in differences of correctly and incorrectly classified reference samples between two classifications.The statistically significant differences in overall accuracies are shown in Figure 4B.Again, most notable is that classifications with homogeneous training samples perform significantly worse than all others.The main difference to McNemar test is that there are more statistically significant differences for the reference set using primary + alternative calls as correctly classified due to the higher range in overall accuracies (see also Table 9).

Class-Membership Assessment
The continuous reference derived from NLCD2006 is used for assessing class memberships across all classes using four statistics (Table 9): correlation coefficient (r), mean absolute difference (MAD), slope, and intercept (Int).Memberships from C5.0 show in general inferior results with lowest r and highest MAD compared to other tested algorithms.Homogeneous training data for classification trees are clearly inferior compared to heterogeneous training pixels (∆r = 0.11 and ∆MAD = 1.94%).Equal allocation between classes results in slightly lower correlation and higher MAD than random or area-proportional allocation.For regression trees, random allocation shows 0.07 higher correlations and a notably (4.9%) lower MAD than uniform sampling.Normalization only marginally improves correlation coefficients but the MAD decreases by 1.5%.Best results of classification trees were obtained for RF-C with area-proportional between-class sample allocation and randomly allocated heterogeneous samples (r = 0.83, MAD = 6.67%) which is almost as good as best results from regression trees with Cubist, random allocation of heterogeneous pixels with no minimum set and normalization r = 0.86 and MAD = 5.93%).
Figure 5 shows the spatial distribution of the MAD for which MAD was computed for each pixel individually.The figure only displays results for RF-C and Cubist; the spatial distribution of the error was similar for C5.0 and RF-R, respectively.For classification trees there are no spatial differences between among-class allocations (area-proportional allocation is shown), and there are no differences between allocations of heterogeneous pixels (uniform is displayed), which corresponds to the spatial patterns shown in Figure 3C.Allocating only homogeneous pixels for training shows notably higher errors in general and in particular for transitional zones from Deciduous forest to Evergreen forest in Mississippi, Alabama, and Georgia as well as transitions from Shrubland to Cultivated crops to Grassland in Texas and Oklahoma.Regression tree results with random allocation of heterogeneous pixels depict no differences among each other (Random-0 is depicted) for which normalization has no impact on the spatial distribution of errors.There are isolated areas with high errors, e.g., the Okefenokee Swamp in southeastern Georgia for which the membership values for class Wetland were underestimated.Uniform allocation depicts high MAD throughout the entire image, which decrease when normalization is applied.Parameters of the regression line between reference and predicted values show generally better results for classification trees with higher slopes and lower intercepts compared to regression trees.RF-C with uniform allocation of heterogeneous training pixels shows highest slopes (0.92) and lowest intercept (0.82%).For regression trees, uniform allocation of heterogeneous pixels and no indicates highest slope (0.86) for Cubist at the expense of a very high intercept with 11.71%; the lowest intercept of 2.56 was found for random sampling and no normalization.

Area Analysis
A second criterion for classifier performance and analysis of different sampling schemes is the similarity of area estimates.Table 9 depicts the total absolute difference between area proportions of the NLCD2006 map as reference and class membership or discrete maps expressed in million hectares and percent against the total study area.For instance, class-memberships of the C5.0 classification tree with homogeneous training pixels and random sample allocation between classes (first line in Table 9) shows a difference of 85.47 Mio ha or 52.76% to the NLCD2006 as reference.
Area differences from discrete maps for classification trees show no notable differences among algorithms (average of 24.30% for C5.0, 24.79% for RF-C) and a clearly better performance of heterogeneous pixel allocation (16.02%) compared to 41.59% for homogeneous pixels.For heterogeneous training pixels, equal allocation of samples between classes shows up to 2% lower differences than area-proportional allocation and 5%-8% lower than random allocation.For this sample allocation the C5.0 algorithm shows a slightly better result than RF-C.For regression trees Cubist shows slightly lower differences (average of 16.42%) than RF-R (17.78).Uniform allocation using Cubist shows lowest difference (9.46%), which is in line with better overall accuracies when measured with homogeneous test data (H = 100).
For membership estimates from classification trees, on average, there is no notable difference between C5.0 (20.83%) and RF-C (20.24%).Homogeneous training data show clearly inferior results with on average 39.90% difference compared to heterogeneous training pixels (10.85%).Area-proportional sample allocation of randomly sampled heterogeneous pixels yield best results with 3.93% total difference for RF-C.Memberships from regression trees show lower differences for Cubist (average 23.37%) than RF-R (29.45%).Random allocation (6.21%) clearly outperforms uniform sampling (66.82%).The table also indicates the importance of normalization (13.27%) because non-normalized results on average cannot correct the total area estimate (39.56%).For Cubist (RF-R), sampling with Random-50 estimated 110.0% (118.4%),Random-0 99.1% (105.6%) and Uniform 191.9% (209.8%) of the total area as compared to NLCD2006.Total areas of non-normalized results for random sampling are relatively close to the true total area, which is also the best result using Cubist with an absolute difference of 1.55%.

Reference Data
The NLCD data set provides a unique opportunity for this study because it maps common land cover classes at 30 m spatial resolution in a consistent manner over a large region.The area chosen in this study includes various semi-natural and human-controlled landscapes with large and small patches as well as transitional environmental zones from dry to moist and temperate to sub-tropical climate, which is useful to test the effectiveness of different sampling schemes with discrete and continuous classifications.
This study used the projection and spatial registration of NLCD2006 [37,38,40] and instead re-projected MODIS data, because NLCD2006 was considered to be the reference for this study that should not be altered.During re-projection, MODIS cells were resampled from 926 m to 900 m and referenced to the cell location of NLCD so one MODIS cell is nested to 30 × 30 NLCD pixels.The quantitative analysis of spatial co-registration for ten selected Landsat images, used for NLCD mapping in 2006, showed near-to-perfect spatial correspondence with MODIS image composites, which allows direct comparison between 30 m cells in NLCD2006 to its corresponding 900 m MODIS pixel.
The classification accuracy of NLCD2006 with 16 original classes is 78% and class aggregation to level I with eight classes yields 84% overall accuracy [40].In comparison, assessment of NLCD2006 for the southeastern United States with nine classes (Table 1) and coarsened to 900 m (Figure 3B) using the primary or primary and alternative reference samples resulted in 59.3% respective 72.9% overall accuracy.Major sources of error are Wetland, which was frequently confused with forests, also having the lowest accuracies in NLCD2006 [40], and Pasture versus Grassland, as two land-use forms of herbaceous cover.Despite the coarse resolution developed areas were classified well.
The potentially smallest unit to be mapped is the pixel area [58].Applying a minimum mapping unit (MMU), that is the smallest area of contiguous pixels in the map, will remove isolated pixels.A "smart-eliminate algorithm" with eight-neighbor rule is applied to publically released 30 m NLCD2006 maps with MMUs of 5 pixels (0.45 ha) for developed classes, 32 pixels (2.88 ha) for classes pasture/hay and cultivated crops, and 12 pixels (1.08 ha) for all other classes [37].The corresponding potential errors for each MODIS pixel (MMU (ha) × 100/81 ha) introduced by these minimum object sizes are 0.55%, 3.55%, and 1.33%, respectively.The error is directly proportional to the ratio between MMUs of NLCD2006 and a MODIS cell with 81 ha (900 m spatial resolution), which was the main reason for choosing the MCD43B4 data instead of MCD43A4 data with 463 m (resampled to 450 m, 20.25 ha pixel area).It should be noted that NLCD is the best regional fine resolution data source available for the analysis performed in this study.Therefore NLCD2006 was used as a high spatial resolution source to allocate training samples for MODIS image classification, for assessment of class memberships, and reference of areas for each land cover class.As decision tree classifiers can deal with some level of error in the training data the impact of error on the classification is considered to be low, but the impact on error statistics for continuous map assessment cannot be estimated.

Sample Allocation for Training Data
This study assessed different allocation schemes for training decision trees from a high spatial resolution map (NLCD2006) employing various accuracy measures for discrete and continuous maps and difference in area.Other issues such as sample size or feature set dimensionality were not addressed because there is ample literature on this subject [28,30,59].The total size of 5400 samples for classification and regression was deemed sufficient; a training sample size set of approximately 0.25% of the study area is realistic for many applied remote sensing studies.In addition, various sample allocations ensured a minimum size of 50 samples per class or sample bin, which corresponds to approximately 1% of all samples.The actual number of 5400 training samples should not be generalized to other studies, but was deemed useful here because it is divisible without remainder by the number of classes (9) and for the number of bins (6 and 12), which eased sample allocation and data processing in this study.
For classification trees, heterogeneous training pixels are recommended and if possible uniform allocation should be preferred because of slopes closer to 1 and intercepts closer to 0. Even though previous studies for very small areas suggested that heterogeneous training data could improve discrete classifications [24,35,37], results of this study demonstrate for first time that they have a better performance for membership estimates over a large and diverse area.
Sample size balance between classes is a controversial topic in image classification [7,26,27,30,32].In particular classification trees may suffer from unbalanced sample sizes [33], because in their standard form the class with the highest number of samples determines the class label.On the other hand, it could be argued that classes with multimodal frequency distributions, e.g., cropland with different crop types and growing cycles, should have more samples to be accurately represented in the classifier than a spectrally and temporally well-defined class such as water.This study tested random, area-proportional and equal sample allocation between all classes.Area-proportional allocation is recommended because of best area estimations and similarly high accuracy measures as random allocation.This result corresponds with the hypothesis that classes with a large area proportion and thus a higher probability of multiple modes in the feature space require more samples than classes with a small area proportion [7,27].
At first sight, regression methods seem more suitable for estimating memberships because their predictions intrinsically derive fractional estimates [24,34,53,60], but the additional step of normalization is necessary to obtain correct area totals.In general random allocation of heterogeneous samples is recommended for which normalization may not even be necessary, but is still recommended for all regression results as differing area totals may complicate further data analysis.Uniform allocation may be useful for discrete maps with better area estimates and accuracies similar to random allocation.The reason for testing uniform allocation was based on the hypothesis that a higher number of training pixels with high homogeneity may improve the prediction of high membership values, which is commonly underrepresented in random allocation (e.g., Table 7).

Classification Methods
Membership estimation from classification trees requires multiple iterations.For random forest classification this was realized with 1000 iterations and randomized selection of features and samples [48].For C5.0 a process described in Colditz et al. [43] computes the class proportion from samples of each leaf and averages the memberships of all boosted trees.Alternative possibilities to estimate class memberships are suggested in McIver and Friedl [61].
There are notable differences between results from classification trees (C5.0,RF-C) versus regression trees (Cubist, RF-R) and the performance of each algorithm depends on appropriate training data allocation.Regression trees depict higher accuracies and lower area differences than classification tree results.This, however, comes at the expense of lower slopes and higher intercepts, which affects the dynamic range of predicted membership values.The selection of a decision tree type should also include the computational costs such as time and storage.Classification trees only require one sampling process for all classes with one tree model, which may be iterated to derive class memberships.The complexity for regression trees increments with the number of classes for which each requires a separate sampling process and tree model.
With respect to the actual algorithm the differences are small, but from the tested algorithms RF-C should be preferred for classification trees and Cubist for regression.The main reason for the better performance of RF-C is likely related to the higher number of iterations (1000) as compared to 10-folded boosting with C5.0.It should also be noted that RF-C was executed in the most basic way and there are multiple options to improve results, e.g., by outlier removal and a priori sample stratification [50].The better performance of Cubist as compared to RF-R likely relates to the generation of a rule set and formulation of linear equations with specific weights for each input variable.In addition, Cubist results can be improved, e.g., by extrapolation beyond training data range which could increase the slopes of regression models and thus better estimate the full dynamic range.
Many studies compare classification results among different conceptual approaches to classify remote sensing data such as Maximum Likelihood classification (MCL), Artificial Neural Networks, Decision trees, and Support Vectors [10,16,17].This study only focused on decision trees, because each conceptual approach has certain needs with respect to feature sets and training data and thus results are likely biased towards one algorithm.For instance, multivariate statistics for MLC should include statistical tests for Gaussian frequency distributions and in case of multimodal frequency distributions training samples should be separated in different groups.Even in this study different sample allocation strategies had to be used to train classification trees, e.g., with respect to between-class sample balance, as compared to regression trees, for which this is not an issue.Therefore, results of this study can only be generalized for decision-tree models.

Conclusions
This study tested several sampling methods for discrete classification and class membership estimation (i.e., continuous land cover) using decision-tree methods.It employed an annual time series of spectral bands of MODIS data at 900 m spatial resolution and a subset of the 2006 National Land Cover Database as wall-to-wall finer resolution reference map from which training samples were allocated.Spatial co-registration was ensued with baseline Landsat data that also served as response data for discrete map assessment.There are three main conclusions: (1) Regression trees show higher accuracies and lower differences in expected area but classification trees better predict the full dynamic range of values.For tested regression tree methods, results of Cubist are better than random forest regression.Random forest classification performs better than C5.0 with boosted trees.(2) For classification trees, heterogeneous training data perform clearly better than homogeneous pixels for both, discrete and continuous land cover mapping.Uniform allocation of heterogeneous pixels is slightly better than random allocation.For between-class sample allocation areaproportional training data allocation is recommended.(3) For regression trees, normalization is imperative to correctly estimate the total area of class memberships.Random allocation is very important for estimating class memberships.A uniform sampling structure can be recommended for deriving discrete maps.
This study only focused on one study area, the southeastern United States.Further tests in other regions of the world and with different data sets and scales, e.g., 30 m image classification trained with 1 m reference data, will be needed to confirm and generalize its results.

Figure 1 .
Figure 1.Study area in the southeastern United States, showing MODIS tiles and Landsat path-rows.

Figure 3 .
Figure 3. (A) Reference map at 30m spatial resolution; (B) coarsened map at 900 m using majority rule, argmax(H); and (C) area proportion of that class, max(H).

Figure 4 .
Figure 4. Statistical significance of difference in accuracies between (A) image classifications using McNemar test and (B) overall accuracies.Lower-left triangle shows results for the primary reference label, upper-right triangle for the primary and alternative reference label as correctly classified.

Figure 5 .
Figure 5. Spatial mean absolute difference (MAD) of selected image sets of class memberships.Random forest classification (RF-C) with area-proportional sample allocation between classes and homogeneous and heterogeneous, uniformly allocated training pixels.Cubist with uniform and random allocation of heterogeneous training pixels and with and without normalization.

Table 3 .
Homogeneity (H) in 10-percent bins and bins for 0 and 100 percent derived from NLCD2006.For abbreviations of class names see Table1.

Table 4 .
Random allocation with a minimum of 50 samples per class using homogeneous pixels H = 100%.Homogeneity (H) in percent.See Table1for abbreviations of class names.

Table 5 .
Allocation proportional to expected area with a minimum of 50 samples per class using heterogeneous pixels with uniform allocation across six bins of H ≥ 50%.Homogeneity (H) in percent.For abbreviations of class names see Table1.

Table 6 .
Equal class allocation of heterogeneous pixels with random allocation across bins of argmax(H).Homogeneity (H) in percent.See Table1for abbreviations of class names.

Table 7 .
Random and uniform allocation of heterogeneous pixels for regression trees for class Evergreen forest.Homogeneity (H) in percent.

Table 9 .
Accuracy measures and absolute difference in area for discrete and continuous (class memberships) classifications.OA: overall accuracy using primary (P) or primary and alternative (P + A) label of reference data as correctly classified.r: correlation coefficient.MAD: mean absolute difference.Int: Intercept.AD: absolute difference in million hectares and percent.Classification trees C5.0 and Random forest classification (RF-C) with homogeneous samples (H = 100) or heterogeneous samples allocated uniformly for H ≥ 50% or randomly (argmax(H)).Sample allocation between classes with random, area-proportional, equal allocation.Regression trees Cubist and Random Forest Regression (RF-R) with uniform and random allocation with no minimum or at least 50 samples per bin.NN: no normalization.Norm: normalization.Highlighted cells indicate best results for classification and regression trees.