Land Cover Classification Using Integrated Spectral , Temporal , and Spatial Features Derived from Remotely Sensed Images

Obtaining accurate and timely land cover information is an important topic in many remote sensing applications. Using satellite image time series data should achieve high-accuracy land cover classification. However, most satellite image time-series classification methods do not fully exploit the available data for mining the effective features to identify different land cover types. Therefore, a classification method that can take full advantage of the rich information provided by time-series data to improve the accuracy of land cover classification is needed. In this paper, a novel method for time-series land cover classification using spectral, temporal, and spatial information at an annual scale was introduced. Based on all the available data from time-series remote sensing images, a refined nonlinear dimensionality reduction method was used to extract the spectral and temporal features, and a modified graph segmentation method was used to extract the spatial features. The proposed classification method was applied in three study areas with land cover complexity, including Illinois, South Dakota, and Texas. All the Landsat time series data in 2014 were used, and different study areas have different amounts of invalid data. A series of comparative experiments were conducted on the annual time-series images using training data generated from Cropland Data Layer. The results demonstrated higher overall and per-class classification accuracies and kappa index values using the proposed spectral-temporal-spatial method compared to spectral-temporal classification methods. We also discuss the implications of this study and possibilities for future applications and developments of the method.


Introduction
Mapping land cover distribution and monitoring its dynamics have been identified as an important goal in environmental studies [1][2][3][4][5].Land cover maps provide fundamental information for many applications, including global change analysis, crop yield estimation, and forest management [6][7][8].Land cover maps can be easily generated using remote sensing images, but ensuring their accuracy is much more difficult [9].To improve the accuracy of classification, most land cover products use multi-temporal images as their inputs [10][11][12][13][14]. Currently, the primary method of land cover classification using multi-temporal data involves obtaining metrics from time series (i.e., the phenological characteristics of different vegetation), and the slope, elevation, maximum, minimum, mean, standard deviation values and tasseled cap transformation of spectral-temporal features and spectral indices [15][16][17][18][19][20][21].Then the metrics are classified in a supervised approach.While these metrics problem of multi-spectral time series using all the available data, and to determine how to extract and combine the spatial characteristics of time-series land cover classifications.To achieve these objectives, we developed a new automated time-series land cover classification method based on extracting spectral-temporal and spatial features using all the available data.This methodology is applicable to all of satellite time series but is illustrated using Landsat time series data in this study.Specifically, the dynamic time warping (DTW) similarity measure was used on satellite image time series to mine all the available data.Then, a modification of a nonlinear DR method developed for hyperspectral data was used to extract spectral-temporal features from multi-spectral satellite time series, and an image segmentation method was modified to extract spatial features from multi-spectral satellite time series.Finally, a classification system based on spatial regularization is established to generate land cover map using spectral-temporal and spatial features.

Study Area
Our study focused on agricultural states including Illinois, South Dakota, and Texas, which cover the latitudinal range of the United States.Due to climatic differences among different regions, the main land cover classes in these three areas differed.Major crop types from the South Dakota study area in the northern United States include soybeans, corn, and alfalfa.Major crop types from the Illinois study area in the central United States include soybeans, corn, and pasture and grassland.The main crop types from the Texas study area in the southern United States include corn, cotton, pasture and grassland, and winter wheat.Since the major crop types used in this study were planted once a year, such as corn and soybeans, we assume that land cover classes will not change within 1 year in the study areas.

Satellite Data
Landsat 8 Operational Land Imager (OLI) data were obtained from the Unites States Geological Survey web site (http://earthexplorer.usgs.gov/).The repeat cycle of Landsat 8 is 16 days, and thus each WRS-2 path/row can be obtained up to 22 or 23 scenes per year [55][56][57].In this study, all the 23 images acquired by Landsat 8 OLI L1T in 2014 of the study areas were used.We used the Landsat OLI reflective bands with 30 m spatial resolution, which included bands 1 (coastal), 2 (blue), 3 (green), 4 (red), 5 (near-infrared), 6 (middle-infrared), and 7 (middle-infrared), along with two cloud masks.Band 9 (cirrus) was not used because it is sensitive to water absorption [58].Three study areas were comprised of 100,000, 125,000 and 160,000 30 m pixels were cut from the Landsat 8 OLI Path 23/Row 32, Path 29/Row 30 and Path 30/Row 36 images, respectively.Figure 1a-c show the acquisition date and cloud coverage of each scene.The corresponding reference data were extracted from the Cropland Data Layer (CDL).Due to the computation of the Laplacian Eigenmaps (LE) DR algorithm increases rapidly with spatial dimension, larger regions were not used in this study.

Reference Data
The CDL for 2014 provided by the United States Department of Agriculture (USDA) National Agricultural Statistics Service (NASS) was obtained from the CDL web site (http://nassgeodata.gmu.edu/CropScape/) [59].The CDL is a raster-formatted, georeferenced, crop-specific, land cover map.The CDL was widely used as a training and testing data source for supervised land cover classification [12].The CDL is generated annually by a decision tree-supervised non-parametric classification method using moderate spatial resolution satellite imagery and ground-truth observations collected by the USDA.In total, 110 land cover classes were defined in the CDL product with a 30 m spatial resolution [60].

Image Preprocessing
In this study, only Landsat L1T images were used.The raw DN values are converted into surface reflectance using atmospheric correction tool from the Landsat Ecosystem Disturbance Adaptive Processing System [61].To effectively use the characteristics of the spectral variation, we arranged all the temporal data for each band together to build a time series.Each image had a corresponding cloud mask file defining the geographic areas that were affected by cloud cover.Values were removed from the time series if the cloud mask file indicated they were affected by clouds.As a result, the number of pixels varied among time series.We often encounter images with cloud cover in remote sensing time-series data.Fully exploiting the available observations requires a method that can be used with unequal time series.In addition, many phenomena of interest, such as vegetation phenology, exhibit periodic behavior that can be affected by weather variation.These adjustments result in the extension or compression of the temporal profiles.This phenomenon indicates that observed ground objects may exhibit irregular temporal behavior.Therefore, methods that are stable to temporal distortions are needed.
The DTW similarity measure [62,63] makes it possible to use all of the available data to analyze the temporal features of remote sensing time series.DTW uses the optimal alignment of radiometric profiles to describe the similarity between two temporal profiles with shifted or distorted evolution and irregular sampling.When there are invalid data in a remote sensing time series, the ideal solution is to remove only the invalid data, while preserving all the valid data.This solution requires a similarity measure that can compare time series with unequal lengths (representing different temporal sampling intervals).In contrast to classic similarity measures used to compare time series, DTW is not restricted to the comparison of time series with equal lengths.DTW can realign time

Image Preprocessing
In this study, only Landsat L1T images were used.The raw DN values are converted into surface reflectance using atmospheric correction tool from the Landsat Ecosystem Disturbance Adaptive Processing System [61].To effectively use the characteristics of the spectral variation, we arranged all the temporal data for each band together to build a time series.Each image had a corresponding cloud mask file defining the geographic areas that were affected by cloud cover.Values were removed from the time series if the cloud mask file indicated they were affected by clouds.As a result, the number of pixels varied among time series.We often encounter images with cloud cover in remote sensing time-series data.Fully exploiting the available observations requires a method that can be used with unequal time series.In addition, many phenomena of interest, such as vegetation phenology, exhibit periodic behavior that can be affected by weather variation.These adjustments result in the extension or compression of the temporal profiles.This phenomenon indicates that observed ground objects may exhibit irregular temporal behavior.Therefore, methods that are stable to temporal distortions are needed.
The DTW similarity measure [62,63] makes it possible to use all of the available data to analyze the temporal features of remote sensing time series.DTW uses the optimal alignment of radiometric profiles to describe the similarity between two temporal profiles with shifted or distorted evolution and irregular sampling.When there are invalid data in a remote sensing time series, the ideal solution is to remove only the invalid data, while preserving all the valid data.This solution requires a similarity measure that can compare time series with unequal lengths (representing different temporal sampling intervals).In contrast to classic similarity measures used to compare time series, DTW is not restricted to the comparison of time series with equal lengths.DTW can realign time series to assess nonlinear distortions on the temporal axis [64].Therefore, DTW cannot only compare time series with unequal lengths, but can find the optimal warping path for the two time series [65].
The methodology of DTW is as follows.Assume a time series T of length n, and a time series S of length m.T = t 1 , t 2 , . . ., t i , . . ., t n S = s 1 , s 2 , . . ., s j , . . ., s m The time series T and S can be ordered to form an n-by-m path matrix where every element of matrix (i, j), corresponds to a queue between the points t i and s j .The warping path is subject to the following constraints [63]: Endpoint constraint: the warping path must start from the first point (t 1 , s 1 ) of the path matrix and end at the last point (t n , s m ) of the path matrix.
Continuity constraint: the steps in the path matrix are confined to neighboring points, t i − t i−1 ≤ 1 and s j − s j−1 ≤ 1.
Monotonicity: the path must advance monotonically with respect to time, t i−1 ≤ t i and s j−1 ≤ s j .DTW is used to calculate the cumulative distance γ(i, j) to each point based on the following dynamic programming formulation: where d(i, j) denotes the distance measure between t i and s j , for example, the square of the difference or the magnitude of the difference.
In the search space determined by two time series, the cumulative distance of the warping path found by DTW is minimum among all possible paths.Figure 5 shows the path matrix and optimal warping path (indicated by red squares) between time series T and S using DTW.series to assess nonlinear distortions on the temporal axis [64].Therefore, DTW cannot only compare time series with unequal lengths, but can find the optimal warping path for the two time series [65].
The methodology of DTW is as follows.Assume a time series T of length n, and a time series S of length m.
The time series T and S can be ordered to form an n-by-m path matrix where every element of matrix (i, j), corresponds to a queue between the points and .The warping path is subject to the following constraints [63]: Endpoint constraint: the warping path must start from the first point ( , ) of the path matrix and end at the last point ( , ) of the path matrix.
Continuity constraint: the steps in the path matrix are confined to neighboring points, − ≤ 1 and − ≤ 1.
Monotonicity: the path must advance monotonically with respect to time, ≤ and ≤ .
DTW is used to calculate the cumulative distance γ( , ) to each point based on the following dynamic programming formulation: where ( , ) denotes the distance measure between and , for example, the square of the difference or the magnitude of the difference.
In the search space determined by two time series, the cumulative distance of the warping path found by DTW is minimum among all possible paths.Figure 5 shows the path matrix and optimal warping path (indicated by red squares) between time series T and S using DTW.

Time Series Dimensionality Reduction
In this study, the LE DR algorithm is refined.The reasons for choosing the LE DR algorithm are as follows.First, LE is a nonlinear manifold learning method, which has been proven to perform better than linear DR methods when applied to hyperspectral data [43, [66][67][68][69].Second, LE is a local manifold learning (LML) method.The advantage of LML method is that the local attributes of the original data can be preserved by constructing a neighborhood graph when projected from high-dimensionality to low-dimensionality space.Using a neighborhood graph implicitly emphasizes the natural clusters in the data, in contrast to global methods, for example, ISOMAP [70].Third, LE is a graph-based method, which only requires similarity between data points to complete DR, thereby making it easy to combine different distance metrics compared to other localization-preserving methods [71][72][73][74][75].
In this study, an improved version of LE called LE-DTW was proposed.In the LE-DTW DR method, DTW is used instead of Euclidean distance to find the k nearest neighbors for each sample in the original feature space.If specific dimensions in the feature space for some pixels are invalid (e.g., covered by cloud or snow), these values will be removed during the construction of sequences (Section 3.1) before DR.Therefore, all the available data are used in the process of building an adjacency graph using DTW.For example, given a dataset X = [x 1 , x 2 , . . . ,x n ] and x i ∈ R t (i = 1, 2, . . ., n), where n denotes the number of data points and t denotes the dimensionality of the data, the adjacency graph W can be constructed with an edge between nodes i and j if x i and x j are the nearest neighbors of each other.If nodes i and j are connected, then where the parameter q is a predetermined constant; otherwise, set W ij = 0. Let D represent a diagonal matrix and set D ii = ∑ j W ij .Now the matrix L = D − W is the so-called Laplacian matrix, it is symmetrical and positive semi-definite.Next, solve the generalized eigenvector problem Because the smallest eigenvalue λ 0 = 0, we discard f 0 and use the next S eigenvectors for embedding in S-dimensional Euclidian space using the map There are two parameters that need to be manually set in the LE DR method.One is the number of nearest neighbors selected for each pixel, abbreviated k.Another is the number of bands retained after DR.In this study, the parameter k is automatically determined in the LE-DTW DR method using a "graph growing" strategy [76].This strategy increases the probability that any pixel and its neighbors belong to the same ground object and selects a different number of nearest neighbors for each pixel.The 10 DR bands corresponding to the 10 smallest non-zero eigenvalues were selected for classification.The number of DR bands is greater than the maximum number of classes in the image (six classes in the Texas study area) to ensure that adequate information was retained in the DR bands [39].A reasonable minimum number of bands after DR cannot be determined using the existing approaches.Because these existing approaches cannot handle unequal time series [77,78].For simplicity, all the DR methods employed in this study reduced the data to 10 dimensions.

Time Series Spatial Feature Extraction
To perform image segmentation is the main approach to obtain spatial information for classification.Segmentation is a comprehensive partitioning of the input image into objects, each of which consists of a set of pixels that are homogeneous with respect to some criterion [79].These objects form a segmentation map that provides spatial structures and features for object-oriented classification [80] or post-classification optimization [47].In many segmentation methods, the image elements are mapped onto a graph.Then, the problem of segmenting an image becomes the problem of partitioning a graph.No discretization is the advantage of graph segmentation because it has completely combinatorial operator, and therefore it does not accumulate discretization errors [81].
The minimum spanning tree (MST) is a classic method in graph theory [82].The graph vertices are connected to satisfy the minimum cumulative edge weights, and partitioning of the graph is performed by removing the edges to obtain non-overlapping sub-graphs.The criterion of segmentation methods based on MST is to remove the edges with the greatest weights.The edge weights are determined by a similarity measure between pixels [83].Two commonly used methods of computing the MST are Kruskal's algorithm [84] and Prim's algorithm [85].In this study, we presented a method called MST-DTW for automatic segmentation of multi-spectral time-series imagery.This method sets the edges and their weights based on the DTW measure using all the valid data in the time series to build a graph.It is worth noting that each edge on a graph can only connect the center node and its eight nearest neighbors in a 3 × 3 window.We chose Prim's algorithm, which constructs the MST by iteratively adding the frontier edge with the smallest weight.Given a multi-spectral time series, we constructed the input data using the segmentation method described in Section 3.1.Assuming Q × R total pixels in the input data, the eight-connection graph G 8 is defined as follows.Assume that the set of all vertices in G 8 is where the p i,j are the image pixels.Then the set of all edges in G 8 is denoted by where s i,j;k,l are the edges that connect node p i,j and one of its eight adjacent nodes p k,l .For each edge s i,j;k,l , we set a weight w i,j;k,l that is the result of DTW measurement between two node vectors Then the weighted graph is where W is the set of w values corresponding to the weight of each edge s ∈ S.
Then, according to the undirected graph G 8 , we constructed an MST using the tree spanning given by T = P, S , W , S ⊆ S such that ∑ s∈S w(s) is a minimum.For a multi-spectral time series, the MST represents a connection of adjacent pixels that is globally optimized according to regions of spectral-temporal similarity based on DTW.The weights of edges indicate the degree of similarity between nodes.The edges with high weights can be removed to split the MST into a set of sub-trees that represent the different segments of the image.The removed edges in this study were identified using the spectral averaging method proposed by Lersch [86]. Figure 6 illustrates the principle of segmentation based on MST.
Remote Sens. 2018, 10, x FOR PEER REVIEW 9 of 26 of partitioning a graph.No discretization is the advantage of graph segmentation because it has completely combinatorial operator, and therefore it does not accumulate discretization errors [81].
The minimum spanning tree (MST) is a classic method in graph theory [82].The graph vertices are connected to satisfy the minimum cumulative edge weights, and partitioning of the graph is performed by removing the edges to obtain non-overlapping sub-graphs.The criterion of segmentation methods based on MST is to remove the edges with the greatest weights.The edge weights are determined by a similarity measure between pixels [83].Two commonly used methods of computing the MST are Kruskal's algorithm [84] and Prim's algorithm [85].In this study, we presented a method called MST-DTW for automatic segmentation of multi-spectral time-series imagery.This method sets the edges and their weights based on the DTW measure using all the valid data in the time series to build a graph.It is worth noting that each edge on a graph can only connect the center node and its eight nearest neighbors in a 3 × 3 window.We chose Prim's algorithm, which constructs the MST by iteratively adding the frontier edge with the smallest weight.Given a multispectral time series, we constructed the input data using the segmentation method described in Section 3.1.Assuming Q × R total pixels in the input data, the eight-connection graph is defined as follows.Assume that the set of all vertices in is where the , are the image pixels.Then the set of all edges in is denoted by where , ; , are the edges that connect node , and one of its eight adjacent nodes , .For each edge , ; , , we set a weight , ; , that is the result of DTW measurement between two node vectors , ; , = , − , Then the weighted graph is where W is the set of w values corresponding to the weight of each edge ∈ .Then, according to the undirected graph , we constructed an MST using the tree spanning given by = ( , , ), ⊆ such that ( ) ∈ is a minimum.
For a multi-spectral time series, the MST represents a connection of adjacent pixels that is globally optimized according to regions of spectral-temporal similarity based on DTW.The weights of edges indicate the degree of similarity between nodes.The edges with high weights can be removed to split the MST into a set of sub-trees that represent the different segments of the image.The removed edges in this study were identified using the spectral averaging method proposed by Lersch [86]. Figure 6 illustrates the principle of segmentation based on MST.

Spectral-Temporal-Spatial (STS) Classification Method
Our proposed method extracts and combines the spectral-temporal and spatial characteristics of remote sensing images using all the available data.The workflow of the proposed time-series land cover classification method is shown in Figure 7.The proposed classification method consists of the following four steps.(1) Time-series data are constructed based on the spectral-temporal arrangement.Then the similarity measure matrix is calculated by computing the DTW distance between pixels using all the available data (Section 3.1).( 2) The spectral-temporal feature vectors extracted by the LE-DTW DR method (using the DTW similarity matrix to construct an adjacency graph) are classified to create a pixel-based classification map (Section 3.2).(3) A segmentation map is obtained by the MST-DTW method (an MST is created from a weighted graph based on the DTW similarity matrix; Section 3.3).(4) Spatial regularization is the strategy employed to combine the spectral-temporal and spatial features.In this step, majority voting [47,87] based on the segmentation map is utilized to post-process the pixel-based classification map.Finally, a land cover map is obtained based on the spectral-temporal-spatial (STS) classification method.

Spectral-Temporal-Spatial (STS) Classification Method
Our proposed method extracts and combines the spectral-temporal and spatial characteristics of remote sensing images using all the available data.The workflow of the proposed time-series land cover classification method is shown in Figure 7.The proposed classification method consists of the following four steps.(1) Time-series data are constructed based on the spectral-temporal arrangement.Then the similarity measure matrix is calculated by computing the DTW distance between pixels using all the available data (Section 3.1).( 2) The spectral-temporal feature vectors extracted by the LE-DTW DR method (using the DTW similarity matrix to construct an adjacency graph) are classified to create a pixel-based classification map (Section 3.2).(3) A segmentation map is obtained by the MST-DTW method (an MST is created from a weighted graph based on the DTW similarity matrix; Section 3.3).(4) Spatial regularization is the strategy employed to combine the spectral-temporal and spatial features.In this step, majority voting [47,87] based on the segmentation map is utilized to post-process the pixel-based classification map.Finally, a land cover map is obtained based on the spectral-temporal-spatial (STS) classification method.

Experiments
A series of quantitative classification experiments were undertaken in each study area to verify the performance of the proposed method.Five classification methods were used for comparison, as shown in Table 1.The reasons for choosing these five methods are as follows.First, LE-DTW and STS are the methods proposed in this paper.The comparison of these two methods is to illustrate whether spatial information is useful for improving the accuracy of time-series land cover classification.Second, principal component analysis (PCA) is a typical linear DR method [44], and is compared to LE-DTW to explore the internal structure of satellite time-series data.Third, LE-SAM-R is a refined LE algorithm using the spectral angle mapper (SAM) with satellite time-series data after temporal interpolation [39].We compared it with LE-DTW to illustrate which metric is more effective for satellite time-series data mining.Fourth, temporal interpolation (TI) only needs to interpolate the original time series data without DR, and then the interpolated data is directly input to the classifier to obtain the classification results.It is compared with LE-DTW to show whether there is a large amount of data redundancy in satellite images time series.
Because all of TI, PCA and LE-SAM-R require cloud-free land surface observations, we use temporal interpolation with the raw data using data from earlier and later dates.Specifically, if the band values for a pixel were covered by clouds on a certain date, these values were replaced by a temporally adjacent data point or the average of two temporally adjacent measurements if both the prior and following points are available.
Two popular supervised classifier random forests (RFs) [88] and support vector machine (SVM) [89,90] were used to generate pixel-based classification maps in experiments.The RF classifier is composed of multiple tree classifiers.In addition, each tree classifier casts an equal vote to choose the most popular classification of the input vector [91,92].A total of 500 trees were built using RFs in this paper.In addition, the SVM classifier with a radial basis function (RBF) kernel was performed by using LIBSVM [93].

Experiments
A series of quantitative classification experiments were undertaken in each study area to verify the performance of the proposed method.Five classification methods were used for comparison, as shown in Table 1.The reasons for choosing these five methods are as follows.First, LE-DTW and STS are the methods proposed in this paper.The comparison of these two methods is to illustrate whether spatial information is useful for improving the accuracy of time-series land cover classification.Second, principal component analysis (PCA) is a typical linear DR method [44], and is compared to LE-DTW to explore the internal structure of satellite time-series data.Third, LE-SAM-R is a refined LE algorithm using the spectral angle mapper (SAM) with satellite time-series data after temporal interpolation [39].We compared it with LE-DTW to illustrate which metric is more effective for satellite time-series data mining.Fourth, temporal interpolation (TI) only needs to interpolate the original time series data without DR, and then the interpolated data is directly input to the classifier to obtain the classification results.It is compared with LE-DTW to show whether there is a large amount of data redundancy in satellite images time series.
Because all of TI, PCA and LE-SAM-R require cloud-free land surface observations, we use temporal interpolation with the raw data using data from earlier and later dates.Specifically, if the band values for a pixel were covered by clouds on a certain date, these values were replaced by a temporally adjacent data point or the average of two temporally adjacent measurements if both the prior and following points are available.
Two popular supervised classifier random forests (RFs) [88] and support vector machine (SVM) [89,90] were used to generate pixel-based classification maps in experiments.The RF classifier is composed of multiple tree classifiers.In addition, each tree classifier casts an equal vote to choose the most popular classification of the input vector [91,92].A total of 500 trees were built using RFs in this paper.In addition, the SVM classifier with a radial basis function (RBF) kernel was performed by using LIBSVM [93].The same parameter settings were adopted for all three sets of experiments.Training data and testing data for classification in the three study areas were extracted from the 2014 CDL.In each study area, only the classes which covering over 2% on the CDL data were considered.In addition, 1% of the CDL pixels were randomly selected as training samples and the rest of pixels were used as testing data.A total of 1600, 1000, and 1250 training pixels were used for the Illinois, South Dakota, and Texas study areas, respectively.The traditional classification accuracy statistics obtained from confusion matrices, including the overall and single-class accuracies and kappa index were used to evaluate the performance of classification [3,94].Sensitivity to training data with different proportions was also undertaken (take RF classifier as an example) by selecting at random from 0.1% to 10% of the CDL pixels in three study areas.To ensure the reliability of the results, all the experiments were repeated 10 times.

Performance of RF and SVM in Five Classification Methods
The classification results and the classification "stability" for the five classification methods combined with RF classifier and SVM classifier is illustrated in this section.The mean overall accuracies and kappa index values of five classification methods combined with two classifiers using 1% CDL training data are shown in Table 2. Comparing the two classifiers from the table, we can see that both RF and SVM classifier had similar overall accuracy and kappa index for each classification method in all three study areas.The maximum difference of overall accuracy using RF classifier and SVM classifier for TI, PCA, LE-SAM-R, LE-DTW and STS method in three study areas were 1.93%, 1.53%, 2.66%, 3.22% and 0.63%, respectively, and the corresponding difference of kappa index were 0.0217, 0.0857, 0.0395, 0.0456 and 0.0107.It is worth noting that, in the STS classification method, the difference of overall classification accuracy of two classifiers in three study areas were 0.51%, 0.52% and 0.63%, respectively, and the corresponding difference of kappa index were 0.0096, 0.0079, and 0.0107.Obviously, the performance of the two classifiers are most similar in the STS classification method, meaning that that method is more stable than the other four classification methods.

Satellite Image Time Series Data Redundancy
The mean overall accuracies and kappa index values of the TI and LE-DTW classification results using 1% CDL training data are shown in Table 2.The overall classification accuracy of the LE-DTW method is at least 2.72%, 1.77% and 8.40% higher than that of TI in Illinois, South Dakota, and Texas study areas, respectively.In addition, the kappa index of the LE-DTW method are raised more than 0.0658, 0.0325 and 0.1210 compared to the TI method in three study areas.Classification maps constructed using the TI and LE-DTW methods combined with RF classifier in three study areas are shown in Figure 8a,d, Figure 9a,d and Figure 10a,d.Tables 3 and 4 summarize the mean producer's and user's accuracies for classification based on the TI and LE-DTW methods combined with RF classifier for each major crop category, respectively.The producer's accuracy indicates the probability that a pixel is classified correctly, which equal the ratio of all the pixels classified correctly in a class to the sum of true reference pixels for that class [95].The user's accuracy indicates the probability that a pixel is classified to a specific class, which equal the ratio of all the pixels classified correctly in a class to the sum of all of the pixels allocated to that class [95].The producer's and user's accuracies of TI method for most of the categories in the three study areas were significantly lower than those of the LE-DTW, especially in the CDL classes covering less than 10% of each study area.For example, the producer's and user's accuracies of the Illinois grass/pasture class (4.6% of the study area), the South Dakota developed/open space class (5.67% of the study area), and the Texas sorghum class (5.36% of the study area) in the TI method were 25.71%, 9.08% and 31.44%less than those of the LE-DTW method, respectively.
For satellite image time series, the classification accuracy of the data after DR is higher than that of the data without DR.Possible reasons are as follows.First, the time series data contains data redundancy, which cause the Hughes phenomenon [42].Second, the temporal interpolation for time series data brings new error, especially the continuous temporal data are unavailable.Third, LE-DTW method provides dimensionality-reduced data that have desirable classification properties.Therefore, prior to land cover classification, it is appropriate to apply the DR techniques to satellite image time series.

Satellite Image Time Series Nonlinear Characteristics
The mean overall accuracies and kappa index values of the PCA and LE-DTW classification results using 1% CDL training data are shown in Table 2.The overall classification accuracy of the LE-DTW method combined with RF classifier or SVM classifier is more than 6% greater than that of PCA in all study areas, and its kappa index is also higher than that of PCA by more than 0.04.Classification maps constructed using the PCA methods combined with RF classifier in three study areas are shown in Figures 8b, 9b and 10b.Table 5 summarize the mean producer's and user's accuracies of classification based on the PCA methods combined with RF classifier for each major crop category.For the PCA DR method, the producer's and user's accuracies exceeded 76%, 86%, and 50% for all CDL classes covering more than 10% of each study area in Illinois, South Dakota, and Texas, respectively.However, the producer's and user's accuracies of the Illinois developed/open space class (5.29% of the study area), the South Dakota grass/pasture class (2.10% of the study area) and the Texas developed/open space class (4.92% of the study area) ranged from only 11% to 45%.For developed/open space class and grass/pasture class, the main reason for the low accuracy include two aspects.One is the small covering area resulting in the small size of the training sample, which means that it is difficult to establish reasonable classification rules when training classifiers.Another is that these natural land and vegetation classes have less pronounced phenology characteristics than the cropland classes.
For the LE-DTW method, the producer's and user's accuracies for all the categories in the three study areas were significantly higher than those of the PCA.This was to be expected, because satellite image time series have intrinsic nonlinear characteristics.Thus, LE-DTW, as a nonlinear DR method, is better suited than the linear DR method for solving the high dimensionality problem of satellite image time series.In addition, multi-spectral time-series data require cloud removal before PCA DR.Error generated by this pre-processing may be another reason that PCA has lower accuracy than LE-DTW.

Satellite Image Time Series Metric
The mean overall accuracies and kappa index values of the LE-SAM-R classification results generated using 1% CDL training data are shown in Table 2.The overall classification accuracies and kappa index values of the LE-SAM-R method were lower than those of LE-DTW in all the three study areas.The largest relative improvement using the LE-DTW method was in the Texas study area.LE-DTW combined with RF classifier had a 3.26% greater overall accuracy and a 0.0450 increase in kappa index than LE-SAM-R combined with RF classifier.The classification maps for the LE-SAM-R method combined with RF classifier in the three study areas are shown in Figures 8c, 9c and 10c.Table 6 summarizes the mean producer and user accuracies of the classification based on the LE-SAM-R method combined with RF classifier for each major crop category.The producer and user accuracies exceeded 81%, 88%, and 70% for all CDL classes covering more than 10% of the study area in Illinois, South Dakota, and Texas, respectively.However, the producer's accuracies of LE-SAM-R for most categories were also lower than those of the LE-DTW method.
These results suggest that the DTW distance is better suited for similarity measurement of multi-spectral time-series data than the SAM distance.The DTW metric ensures that all the data from cloudless regions in each temporal image are used.This is of great importance to those sensors that have low-frequency observations.It is worth noting that the LE-DTW method does not need to reconstruct the value of the data in the cloud-covered regions, whereas the LE-SAM-R method requires this reconstruction process.In other words, the LE-DTW method uses all the available data directly.Therefore, the LE-DTW method is a dimensionality reduction method which is more suitable for satellite image time series data.

Satellite Image Time Series Spatial Features
The mean overall accuracies and kappa index values of the STS classification results generated using 1% CDL training data are shown in Table 2.The STS method provides unambiguously higher overall classification accuracies than LE-DTW in most cases.It is worth noting that, compared to the LE-DTW method (combining RF classifier), the overall classification accuracies improved by 1.66%, 4.01%, and 1.13% in Illinois, South Dakota, and Texas study areas, respectively, using the STS method (combining RF classifier), while the corresponding kappa index values were enhanced by 0.0258, 0.0594 and 0.0136.Similarly, the performance of STS method combined with SVM classifier is better than that of LE-DTW method combined with SVM classifier.Classification maps generated using the STS method combined with RF classifier in the three study areas are shown in Figures 8e, 9e and  10e.Table 7 summarizes the mean producer and user accuracies of classification based on the STS method combined with RF classifier for each major crop category.Compared to the producer and user accuracies in Table 6, the corresponding accuracies stated in Table 7 were higher for the great majority of classes.Furthermore, for CDL classes covering less than 10% of each study area, the producer and user accuracies of the STS method were also greatly improved.For example, the producer's accuracies of the Illinois developed/open space class (5.29% of the study area), the South Dakota developed/open space class (5.67% of the study area), and the Texas sorghum class (5.36% of the study area) in the STS method were 8.73%, 19.06% and 2.4% greater than those of the LE-DTW method, respectively.These results show that the MST-DTW method can effectively extract spatial features from satellite image time series using all available data.The STS method combined with spatial information extracted by MST-DTW significantly improves the land cover classification accuracies of multi-spectral time-series data.STS can mine a variety of features from limited data for classification; therefore, it is particularly effective for remote sensing image time series with low temporal resolution.

Classification Sensitivity to the Amount of Training Data
Classification accuracy always directly depends on the amount of the training data [96].Careful selection of training samples may help reduce the size of the training data without reducing supervised classification accuracy [97].When using fewer training samples, a given classification accuracy should be capacitated by a more optimal classification method than a less one [39].
Figure 11 illustrates the overall classification accuracies provided by the LE-DTW (green) and STS (blue) using different percentages of training data.At each training percentage, a total of 10 independent classifications were performed.As can be seen from Figure 11, the overall classification accuracies of STS approach are consistently higher than the LE-DTW approach.Moreover, when using fewer training samples, the classification performances of STS method are more stable than that of LE-DTW method.Both the overall classification accuracies of STS approach and LE-DTW become stable when approximately 1% of the training samples are used.For each set of 10 classifications, the standard deviations of the overall classification accuracies are not illustrated.This is because the standard deviation for each set experiments were less than 1% except for the results using the 0.1% training data which were less than 2.1% in the three study areas.These results show that the STS method using spectral-temporal-spatial features is more optimal than the LE-DTW method using spectral-temporal features only.

Classification Sensitivity to the Amount of Training Data
Classification accuracy always directly depends on the amount of the training data [96].Careful selection of training samples may help reduce the size of the training data without reducing supervised classification accuracy [97].When using fewer training samples, a given classification accuracy should be capacitated by a more optimal classification method than a less one [39].
Figure 11 illustrates the overall classification accuracies provided by the LE-DTW (green) and STS (blue) using different percentages of training data.At each training percentage, a total of 10 independent classifications were performed.As can be seen from Figure 11, the overall classification accuracies of STS approach are consistently higher than the LE-DTW approach.Moreover, when using fewer training samples, the classification performances of STS method are more stable than that of LE-DTW method.Both the overall classification accuracies of STS approach and LE-DTW become stable when approximately 1% of the training samples are used.For each set of 10 classifications, the standard deviations of the overall classification accuracies are not illustrated.This is because the standard deviation for each set experiments were less than 1% except for the results using the 0.1% training data which were less than 2.1% in the three study areas.These results show that the STS method using spectral-temporal-spatial features is more optimal than the LE-DTW method using spectral-temporal features only.Sensitivity to the different training data percentage (0.1%, 0.3%, 0.5%, 0.7%, 0.9%, 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9% and 10%).At each percentage on three study areas, the mean overall accuracies of the LE-DTW (green) and STS (blue) calculated by 10 independent classifications are shown.

Conclusions
Obtaining accurate and timely land cover maps is a difficult problem in remote sensing.Such maps require the mining of as much useful information as possible to improve land cover classification accuracy based on limited data.In this study, a novel method is developed for land cover classification using spectral-temporal-spatial data at an annual scale, assuming there is no land cover change within 1 year.This approach utilizes all the available multi-spectral time-series data to construct a graph based on the DTW similarity measure, and then utilizes graph theory-based dimensionality reduction and segmentation methods to extract spectral-temporal features and spatial features for identification and optimization of land cover classes.In addition, the proposed method is an automated classification method, and requires few training samples to perform well.These advantages are significant because land cover classification is labor-intensive and difficult to automate [98,99].Therefore, the classification method introduced in this paper should prove useful for improving the accuracy and reducing the mapping period for land cover classification.The proposed method was applied to Landsat multi-spectral reflectance time-series data, which has a Sensitivity to the different training data percentage (0.1%, 0.3%, 0.5%, 0.7%, 0.9%, 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9% and 10%).At each percentage on three study areas, the mean overall accuracies of the LE-DTW (green) and STS (blue) calculated by 10 independent classifications are shown.

Conclusions
Obtaining accurate and timely land cover maps is a difficult problem in remote sensing.Such maps require the mining of as much useful information as possible to improve land cover classification accuracy based on limited data.In this study, a novel method is developed for land cover classification using spectral-temporal-spatial data at an annual scale, assuming there is no land cover change within 1 year.This approach utilizes all the available multi-spectral time-series data to construct a graph based on the DTW similarity measure, and then utilizes graph theory-based dimensionality reduction and segmentation methods to extract spectral-temporal features and spatial features for identification and optimization of land cover classes.In addition, the proposed method is an automated classification method, and requires few training samples to perform well.These advantages are significant because land cover classification is labor-intensive and difficult to automate [98,99].Therefore, the classification method introduced in this paper should prove useful for improving the accuracy and reducing the mapping period for land cover classification.The proposed method was applied to Landsat multi-spectral reflectance time-series data, which has a resample period of 16 days.A series of supervised classification experiments using USDA CDL land cover maps as reference data were undertaken in three study areas with land cover complexity and different amounts of invalid data in the United States.
Although the STS classification method provided the anticipated classification results in this study, the computation required for the LE-DTW method in the STS system increases geometrically with the spatial dimensions of the image.In fact, this problem occurs in most manifold learning DR algorithms [100][101][102][103][104].This is an issue, especially when manifold learning DR algorithms are applied to land cover classification at continental to global scale [13,105].Due to the complexity of the MST-DTW algorithm, it is mainly implemented by building a DTW similarity measure matrix in the same manner as LE-DTW, and therefore MST-DTW does not significantly increase the computational intensity of the STS system.Reducing its computational requirements is a direction for future research.This could be accomplished, for example, by employing the landmark points strategy [106] developed for the ISOMAP (isometric mapping) global nonlinear DR method [67,107,108], GPU (Graphics Processing Unit) enhanced computing [109], and the STS method to classify and then merge image subsets.In addition, we are now conducting experiments using long-term multi-spectral time-series data with invalid values for land cover change detection with spectral-temporal-spatial features extracted from the STS system.

Figure 1 .Figure 1 .
Figure 1.The acquisition date and cloud coverage of each study area.(a) Illinois study area.(b) South Dakota study area.(c) Texas study area.Figure 1.The acquisition date and cloud coverage of each study area.(a) Illinois study area.(b) South Dakota study area.(c) Texas study area.

Figure 5 .
Figure 5.The optimal warping path between time series T and S (red squares).Figure 5.The optimal warping path between time series T and S (red squares).

Figure 5 .
Figure 5.The optimal warping path between time series T and S (red squares).Figure 5.The optimal warping path between time series T and S (red squares).

Figure 6 .
Figure 6.The principle of MST segmentation.

Figure 6 .
Figure 6.The principle of MST segmentation.

Figure 7 .
Figure 7. Flow chart of the STS classification method.

Figure 7 .
Figure 7. Flow chart of the STS classification method.

Table 1 .
Comparison of five classification methods.

Table 2 .
The overall accuracies and kappa index of five classification methods.

Table 3 .
Mean producer and user accuracies of classification based on TI + RF method.

Table 4 .
Mean producer and user accuracies of classification based on LE-DTW+RF method.

Table 5 .
Mean producer and user accuracies of classification based on PCA + RF method.

Table 6 .
Mean producer and user accuracies of classification based on LE-SAM-R+RF method.

Table 7 .
Mean producer and user accuracies of classification based on STS (using RF classifier) method.