Article Supervised Classification of Agricultural Land Cover Using a Modified k-NN Technique (MNN) and Landsat Remote Sensing Imagery

Nearest neighbor techniques are commonly used in remote sensing, pattern recognition and statistics to classify objects into a predefined number of categories based on a given set of predictors. These techniques are especially useful for highly nonlinear relationship between the variables. In most studies the distance measure is adopted a priori. In contrast we propose a general procedure to find an adaptive metric that combines a local variance reducing technique and a linear embedding of the observation space into an appropriate Euclidean space. To illustrate the application of this technique, two agricultural land cover classifications using mono-temporal and multi-temporal Landsat scenes are presented. The results of the study, compared with standard approaches used in remote sensing such as maximum likelihood (ML) or k-Nearest Neighbor (k-NN) indicate substantial improvement with regard to the overall accuracy and the cardinality of the calibration data set. Also, using MNN in a soft/fuzzy classification framework demonstrated to be a very useful tool in order to derive critical areas that need some further attention and investment concerning additional calibration data.


Introduction
Remote sensing has become an important tool in wide areas of environmental research and planning.In particular the classification of spectral images has been a successful application that is used for deriving land cover maps [1,2], assessing deforestation and burned forest areas [3], real time fire detection [4], estimating crop acreage and production [5] or the monitoring of environmental pollution [6].Image classification is also commonly applied in other contexts such as optical pattern and object recognition in medical or other industrial processes [7].As a result, a number of algorithms for supervised classification have been developed over the past to cope with both the increasing demand for these products and the specific characteristics of a variety of scientific problems.Examples include the maximum likelihood method [8], fuzzy-rule based techniques [9,10], Bayesian and artificial neural networks [11,12], support vector machines [13,14] and the k-nearest neighbor (k-NN) technique [15,16].
In general, a supervised classification algorithm can be subdivided into two phases: (i) the learning or "calibration" phase in which the algorithm identifies a classification scheme based on spectral signatures of different bands obtained from "training" sites having known class labels (e.g., land cover or crop types), and (ii) the prediction phase, in which the classification scheme is applied to other locations with unknown class membership.The main distinction between the algorithms mentioned above is the procedure to find relationships, e.g., "rules", "networks", or "likelihood" measures, between the input (here spectral reflectance at different bands, also called "predictor space") and the output (here the land use class label) so that either an appropriate discriminant function is maximized or a cost function accounting for misclassified observations is minimized [17][18][19][20][21].In other words, they follow the traditional modeling paradigm that attempts to find an "optimal" parameter set which closes the distance between the observed attributes and the classification response [9,22].
A somewhat different approach has recently been proposed by [23] as the modified nearest-neighbor (MNN) technique within the context of land cover classification in a meso-scale hydrological modeling project.Their MNN algorithm is a hybrid algorithm that combines algorithmic features of a dimensionality reduction algorithm and the advantages of the standard k-NN being: (i) an extremely flexible and parsimonious-perhaps the simplest [24]-supervised classification technique which requires only one parameter (k the number of nearest neighbor), (ii) extremely attractive for operational purposes because it does neither require preprocessing of the data nor assumptions with respect to the distribution of the training data [25], and (iii) a quite robust technique because the single 1-NN rule guarantees an asymptotic error rate at the most twice that of the Bayes probability of error [24].
However, k-NN also has several shortcomings that have already been addressed in the literature.For instance, (i) its performance highly depends on the selection of k; (ii) pooling nearest neighbors from training data that contain overlapping classes is considered unsuitable; (iii) the so-called curse of dimensionality can severely hurt its performance in finite samples [25,26]; and finally (iv) the selection of the distance metric is crucial to determine the outcome of the nearest neighbor classification [26].
In order to overcome at least parts of these shortcomings, the MNN technique abandons the traditional modeling approach in finding an "optimal" parameter set that minimizes the distance between the observed attributes and the classification response.The basic criteria within MNN is to find an embedding space, via (non-)linear transformation of the original feature space, so that the cumulative variance of a given class label for a given portion of the closest pairs of observations should be minimum.In simple words, a feature space transformation is chosen in a way that observations sharing a given class label (crop type) are very likely to be grouped together while the rest tend to be farther away.
In [23] the MNN method has been introduced within a meso-scale hydrologically motivated application where a Landsat scene has been used to derive a 3-class (forrest, pervious, impervious) land cover map of the Upper Neckar-Catchment, Germany.MNN has also been extensively compared to other classification methods and showed superior performance.In the following sections we extend the analysis of the MNN method and demonstrate its performance to a smaller scale agricultural land use classification problem of a German test-site using mono-temporal and multi-temporal Landsat scenes.First, Section 2 briefly reviews the main features of the MNN method.In Section 3, the application including test site and data acquisition are described, followed by an extensive illustration of the MNN performance in Section 4. Results and future research directions are finally discussed in Section 5.

Motivation
A very important characteristic of the proposed MNN technique, compared to conventional NN approaches, is the way distances between observations are calculated.NN methods require the a priori definition of a distance in the feature-or x-space.Because the selection of the distance affects the performance of the classification algorithm [25,27], a variety of reasonable distances is usually tested, for instance: the L 1 Norm (or Manhattan distance), the L 2 Norm (or Euclidian distance), L ∞ norm (or Chebychev distance), the Mahalanobis distance [28], and similarity measures such as the Pearson-, and the Spearman-Rank correlation.
In MNN, on the contrary, we search for a metric in a lower dimensional space or embedding space which is suitable for separating a given class.In essence, MNN finds a nonlinear transformation (or mapping) that minimizes the variability of a given class membership of all those pairs of points whose cumulative distance is less than a predefined value (D).It is worth noting that this condition does not imply finding clusters in the predictor space that minimize the total intra-cluster variance, which is the usual premise in cluster analysis (e.g. in k-means algorithm) [8].A lower dimensional embedding space is preferred in MNN, if possible, because of the empirical evidence supporting the fact that the intrinsic manifold dimension of the multivariate data is substantially smaller than that of the inputs [25].A straightforward way to verify this hypothesis is calculating the dimensionality of the covariance matrix of the standardized predictors and to count the number of dominant eigenvectors [29].
An important issue to be considered during the derivation of an appropriate transformation or embedding space is the effect of the intrinsic nonlinearities present in multivariate data sets (e.g., multi-temporal and/or multi-/hyperspectral imagery).It is reported in the literature that various sources of nonlinearities affect severely the land cover classification products [30].This in turn would imply that dissimilar class labels might depend differently upon the input vectors.Consequently, it is better to find class specific embeddings and their respective metrics rather than a global embedding.

Problem Formulation
Supervised classification algorithms aim at predicting the class label h ∈ {1, . . ., l}-from among l predetermined classes -that correspond to a query x 0 composed of m characteristics, i.e., x 0 = {x 1 , . . ., x m } 0 ∈ R m .To perform this task, a classification scheme c "ascertained" from the training set T = {(x i , e i ) : i = 1, . . .n} is required.Here, to avoid any influence from the ordering, the membership of the observation i to the class h is denoted by a unit vector e i ∈ {e h : h = 1, . . ., l} [e.g., class 3 is coded in a three-class case as e 3 = {0, 0, 1}].Moreover, each class should have at least one observation, i.e. n h > 0, and l h=1 n h = n, where n is the sample size of the training set.In general, the proposed MNN algorithm estimates the classification response y 0 = c(x 0 ) = {y 1 , . . ., y l } 0 as a combination of selected membership vectors e i = {e i1 , . . ., e il } whose respective "transformed" inputs Since there are l possible embedding spaces in this case, then l possible neighborhoods of this query have to be taken into account in the estimation.Here B h denotes a class specific transformation function.A stepwise meta-algorithm for the estimation of the classification response can be formulated as follows: 1.An attribute (crop/landuse class) h is selected.
2. An optimal distance d h for the binary classification problem is identified.
3. The binary classification is performed using k-NN with the distance d h leading for each vector x 0 a partial membership vector {y h } 0h , h = 1, . . ., l.

4.
Steps 1) to 3) are repeated for each attribute h.

5.
The partial memberships are aggregated into the final classification response y 0 .
6.The class exhibiting the highest class membership value is assigned to x 0 .
In total, h optimal metrics have to be identified observing some conditions.A distance measure d h is considered suitable for the binary classification of a given attribute h if the following discriminating conditions are met: (a) The distances in the transformed space between pairs of points in which both correspond to the selected attribute are small, and (b) The distances in the transformed space between pairs of points in which only one of them correspond to the selected attribute are large.
These conditions ensure that the k-NN classifier works well at least for a given attribute in its corresponding embedded space.The distance between two points in MNN is defined as the Euclidean distance between their transformed coordinates where B h is a transformation (possibly nonlinear) of the m-dimensional space x into another κ -dimensional space u h usually with (κ ≤ m): As a result, MNN requires finding a set of transformations B h so that the corresponding distance d h performs well according to the conditions (a) and (b) mentioned above.In the present study, this was attained by a local variance function G B h (p) [31] that accounts for the increase in the variability of the class membership e h with respect to the increase in the distance of the nearest neighbors in a nonparametric form.
Formally, G B h (p) can be defined for where p is the proportion defined as the ratio of N (p) to N h , d B h (i, j) is the distance between points i and j in the transformed space, D B h (p) is the p percentile of the d B h distribution, and C h is the set of all possible pairs having at least one element in class h, formally here N h denotes the cardinality of the set C h given by and N (p) refers to the number of pairs (i, j) ∈ C h that have a d B h (i, j) smaller than the limiting distance D B (p), formally defined as where |{.}| denotes the cardinality of a given set.The transformation B h can be identified as the one which minimizes the integral of the curve G B h vs. p up to a given threshold proportion p * , subject to the condition that each G B h is a monotonic increasing curve (see later Figure 4 in Section 4.) for an illustration of this minimization problem).In simple words this procedure aims at close neighbors having the same class membership, whereas those further away should belong to different classes.
Formally, this can be written as follows: Find B h , h = 1, . . ., l so that subject to where δ is a given interval along the axis of p.It is worth noting that the objective function denoted by ( 7) will be barely affected by outliers (i.e.those points whose close neighborhood is relatively quite far), which appear often in unevenly distributed data.For computational reasons, equ.( 7) can be further simplified to where {p q : q = 1, . . ., Q} is a set of Q portions such as p q−1 < p q < 1, ∀q, and p Q = p * .The total number of pairs within the class h can be defined as N * h = |{(i, j) (e i = e h ) ∧ (e j = e h ) ∧ j > i}| ∀h, thus this threshold for the class h can be estimated by In summary, this method seeks different metrics, one for each class h, on their respective embedding spaces defined by the transformations B h so that the cumulative variance of class attributes for a given portion of the closest pairs of observations is minimized.Consequently these metrics are not global.The approach to find class specific metrics in different embedding spaces is what clearly differentiates MNN from other algorithms such as the standard k-NN or AQK.In contrast to MNN, k-NN employs a global metric defined in the input space, whereas AQK uses a kernel function to find ellipsoidal neighborhoods [25].The kernel, in turn, is applied to find implicitly the distance between points in a feature space whose dimension is larger than the input space x [26].

Deriving an Optimal Transformed/Embedding Space
There are infinitely many possibilities of selecting a transformation B h .Among them, the most parsimonious ones are the linear transformations [22].Here, for the sake of simplicity, l matrices of size [κ × m] are used: It is worth mentioning that the standard k-NN method is identical to the proposed MNN if B h = diag(1, . . ., 1).Appropriate transformations B h have to be found using an optimization method.Du to the high dimensionality of the resulting optimization problem (see (9)), the following global optimization algorithm based on simulated annealing [32] is proposed: 1. Select a set of threshold proportions p 1 < p 2 < . . .< p Q < 1.The value of p Q can be chosen to be equal k n with k being the number of the nearest-neighbors to be used for the estimation.For the other proportions, p q = q Q p Q is a reasonable choice.
3. Select the dimension of the u h space κ.
4. Fix an initial annealing temperature t.
5. Randomly fill the matrix B h .
6. Check the monotonic condition using: G B h (p q ) ).

Calculate the objective function O
8. Randomly select a pair of indices (i 1 , i 2 ) with 1 ≤ i 1 ≤ κ and 1 ≤ i 2 ≤ m. 9. Randomly modify the element b i 1 ,i 2 of the matrix B h and formulate a new matrix B * h .
14. Reduce the annealing temperature t and repeat steps ( 8)-( 13) until the objective function O achieves a minimum.
It should be stated here, that in case a nonlinear function B h would be required, the proposed algorithm is not affected, in general, and the generation of new solutions can be done as proposed in steps 8) to 9) after some modification.Also, any other global optimization method, such as Genetic Algorithms (GA) [33] or the Dynamically Dimensioned Search (DDS) algorithm [34] (among many others) could have been used here in principal, but good experience with the simulated annealing method in earlier applications has led to that choice.

Selection of an Estimator
The prediction of the classification vector y 0 based on a given vector x 0 is done with an ensemble of predictions (one with each transformation B h ) to ensure interdependency among the l classes.This characteristic of the system was already assumed during the selection of the matrices B h by means of their simultaneous optimization achieved in Equation (7).Consequently, the local estimator should also consider this fact.
There are many types of local estimators that can be employed, for instance: the nearest neighbor, mean of close neighbors, local linear regression, local kriging, among others, as presented in [22].For the classification problem stated in section 1., however, the mean of the k closest neighbors is the most appropriate one because it considers several observations.Thus, the expected value of the observation 0 can be estimated as: where D h (k) is the distance of the k-nearest neighbor of observation x 0 in the transformed space u h .The advantage of this approach is that one can always find an estimate for a given observation.Its main disadvantage, however, occurs for those points that do not have close enough neighbors (i.e., false neighbors), which, in turn, might lead to an erroneous estimation.
As a result of Equation ( 12), the classification vector y 0 has values in the interval 0 ≤ y h0 ≤ 1 h = 1, . . ., l hence l h=1 y h0 = 1, i.e., it is a soft classification.Since these values have been derived from a sample, they could be understood as the empirical probability that the observation 0 belong to the class h.If a hard classification is required, then the class having the highest y value can be assigned to the given observation x 0 .This procedure to obtain a binary classification also allows the estimation of the measure of ambiguity b(y 0 ) as b(y 0 ) = 1 − max h y h0 (13) which indicates whether or not the classifier c has yielded a clear response [9].

Validation
As the transformation B h is derived from observations it is necessary to validate it.Possible methods for validation are cross-validation and split sample testing.If the latter method is used, then x 0 ∈ V and T ∩ V ≡ ∅.V is commonly known as the validation set and should be similar in composition to the calibration set.Its cardinality, however, may vary.Here, the sample size of V is denoted by v = |V|.

Study Area and Data Availability
The proposed method was applied to two different land cover classification examples using i) single, and ii) multi-temporal Landsat TM/ETM+ images for the year 1999.Both examples use data from agricultural test sites located in the Parthe-catchment, southeast of Leipzig, Germany (see Figure 1).The study area is part of the German "loess band" consisting of highly productive soils with land use dominated by agricultural activities.The elevation above sea level ranges from 150 to 190 m with only very mild slopes.The yearly mean air temperature is 9.8 • C, the mean annual precipitation is 610 mm.
The test sites comprise 23 fields with a total area of 426 ha.Main soil types are Cambisol, Luvisol, and Stagnic Gleysol; and dominating crops are summer and winter cereals with a total fraction of over 70%.Table 1 summarizes the considered crops and land use classes as well as their spatial fractions.These data, as well as detailed tillage and management information for the year 1999, were collected and provided by the farm cooperative Landwirtschafts GmbH K ÖG Kleinbardau.
In addition, three Landsat images of the study region have been made available for the growing season of 1999: two Landsat 5-TM scenes from 30.April and 3. July (path/row 193/025) and a Landsat 7-ETM+ scene from 13. September 1999 (path/row 193/024).All three images were geo-referenced within the ERDAS imagine software using a digital topographic map at the scale 1:25,000 (Landesvermessungsamt Sachsen) and 40 significant reference points resulting in a maximum RMSE-error of 7.8 m.For both examples we only used the 6 bands in the visible and near infrared region, thus ignoring the thermal bands of TM and ETM+ as well as the panchromatic channel of ETM+. Figure 2 summarizes the vegetation periods of considered crops, the time schedule of soil and crop management activities and the dates of available remote sensing images.An atmospheric correction has not been considered here.

Variable Definition
For the calibration and validation phases of this study, several disjoint sets were sampled without replacement from the images mentioned above.The sample size of the validation set was fixed ad hoc to 270 observations distributed equally among all nine classes (i.e., l = 9).Additionally, six calibration sets were also sampled to perform a sensitivity analysis of the impact of the sample size on the efficiency of the classifier, thus, the number of observations per class were fixed as n h ∈ {5, 10, 15, 20, 25, 30}.For all these calibration sets, the variables listed in Table 2 were extracted from the original images and then tabulated as indicated in Section 1.It is convenient to visualize the relationships that might exist between predictors (see Figure 3) before the calibration starts, as the main task of a supervised classifier is finding patterns and rules within the predictor's space aiming at forming disjoint classes.In fact, as it is shown at the panel (c) of this figure, these relationships can be so intertwined that most classifiers produce a high number of false positives.In this example, for instance, classes 1, 2, 3, 4, 7, and 9 are heavily clustered as can be seen in each pairwise scatterplot of variables x 7 , x 8 , and x 9 , respectively.This highlights the main motivation behind this paper stated in Section 2. to find an embedding space u where close points in the x space correspond to a similar class.

Results
To find the embedding spaces u h , h = 1, . . ., l, the algorithm presented in subsection 2.3.was sequentially applied.In each optimization, the maximum limit of the integration was p Q = 0.30 (see Equation 9).The result of the optimization can be seen in Figure 4.This graph shows that the variance function G exhibits a completely different behavior depending on whether the distance is calculated in the original predictors' space x or in the embedding space u.In the former, the variance of the (p) closest neighbors remains almost constant for different values of p, whereas in the latter an exception.This indicates that k-NN might be sensitive to artifacts present (e.g., related with climate, vegetative cycle) in the data that may severely affect the efficiency of the classifier.This shortcoming has not been observed in the several hundreds of tests carried out with MNN.Another results derived from these experiments is that MNN performs much better than ML when the sample size per class (n h ) is small (see Figure 7).Furthermore, influence of n h on the efficiency is small.This is an advantage because it reduces the cost of data collection without jeopardizing the quality of the results.The selection of the dimension of the embedding space k plays a very important role on the efficiency of MNN as can be seen in Figure 8.Because there are no rules for the determination of k the trial-and-error approach was followed.This figure shows that embedding a large dimensional space into an scalar (i.e., k = 1) is not a good solution.As k approaches m, the same effect is observed, i.e. large values of k do not perform much better than that of the k−1-case.It should be noted that increasing k by one increases the number of degrees of freedom l times.In the present case the best results were obtained for k = 3 in the mono-temporal experiment.A modification of the Mallows-CP [35] statistic, however, was suggested in [22] to help determine the most efficient embedding dimension, which can also be used in this context.
Based on the results of the sensitivity analysis, the final classification of the whole area of the test sites was carried out using multi-temporal data (bands 3-4-5 form the three scenes) embedded into a three dimensional space, i.e.B[3 × 9] and using N = 5 neighbors.As a result, the land cover map depicted in Figure 9 was obtained.Additionally, the ambiguity index b(y 0 ) proposed in Equation ( 13) was estimated.With the help of this index, it is possible to determine those places where the classification is not clear (say b(y 0 ) > 0.5).Consequently, one should interpret the results in those areas carefully and try to determine why the classifier performs ambiguously on these areas (e.g., a very mixed class or errors in the determination in the ground truths).Furthermore, by plotting the density distribution of the ambiguity index and estimating statistics like the mean ambiguity index or the portion of pixels exceeding a given threshold, one can have a very reliable measure of the uncertainty of the classification (see Figure 10).Taking, for example, a threshold value of b = 0.2 we observe that a training set consisting of only n h = 5 observations for each class will have a large number of false positives, whereas with n h = 25 we only obtain a very small number indicating a more reliable classification.The slight increase of the classification uncertainty between n h = 25 and n h = 30 suggest that large training sets might be less efficient due to possible data artifacts introduced during the sampling process.We here find an optimum sample size of n h = 25 observation per class.

Figure 1 .
Figure 1.Distribution and location of the test sites.

Figure 3 .
Figure 3. Scatterplots depicting the location of the land use classes in the predictor space x at three different points in time as indicated at each panel.In this case, the reflectance measurements represent bands 3-4-5 of the training set whose sample size is n = 270.

Figure 5 .
Figure 5. Scatterplots depicting the location of the land use classes in the embedded space u 1 .Class 1 (encircled by a continuous line) is completely isolated from the others.Sample size n = 270.

Figure 6 .
Figure 6.Graph showing the variation of the 90% confidence interval depending on the number of nearest neighbors.As reference the confidence intervals for ML and k-NN (B not optimized) are also shown.

Figure 7 .
Figure 7. Graph showing the influence of the sample size per class n h on the overall accuracy of the classification.The dimension of the transformation matrix B in the monoand multi-temporal classifications is [3 × 6] and [3 × 9] respectively.

Figure 8 .
Figure 8. Graph showing the influence of the type of transformation and the number of nearest neighbors on the overall accuracy of the classification.

Figure 9 .
Figure 9. Ambiguity index [panel (a)] and land cover map obtained with the MNN classifier using N = 5 neighbors [panel (b)].

Table 1 .
Summary of the land use classes and its distribution within the test area.Figure 2. Chronogram of the vegetation periods and management activities.The dates of the landsat scenes are also depicted.

Table 2 .
Definition and notation of the selected predictors at different points in time.