Di ﬀ erent Spectral Domain Transformation for Land Cover Classiﬁcation Using Convolutional Neural Networks with Multi-Temporal Satellite Imagery

: This study compares some di ﬀ erent types of spectral domain transformations for convolutional neural network (CNN)-based land cover classiﬁcation. A novel approach was proposed, which transforms one-dimensional (1-D) spectral vectors into two-dimensional (2-D) features: Polygon graph images (CNN-Polygon) and 2-D matrices (CNN-Matrix). The motivations of this study are that (1) the shape of the converted 2-D images is more intuitive for human eyes to interpret when compared to 1-D spectral input; and (2) CNNs are highly specialized and may be able to similarly utilize this information for land cover classification. Four seasonal Landsat 8 images over three study areas—Lake Tapps, Washington, Concord, New Hampshire, USA, and Gwangju, Korea—were used to evaluate the proposed approach for nine land cover classes compared to several other methods: Random forest (RF), support vector machine (SVM), 1-D CNN, and patch-based CNN. Oversampling and undersampling approaches were conducted to examine the effect of the sample size on the model performance. The CNN-Polygon had better performance than the other methods, with overall accuracies of about 93%–95 % for both Concord and Lake Tapps and 80%–84% for Gwangju. The CNN-Polygon particularly performed well when the training sample size was small, less than 200 per class, while the CNN-Matrix resulted in similar or higher performance as sample sizes became larger. The contributing input variables to the models were carefully analyzed through sensitivity analysis based on occlusion maps and accuracy decreases. Our result showed that a more visually intuitive representation of input features for CNN-based classification models yielded higher performance, especially when the training sample size was small. This implies that the proposed graph-based CNNs would be useful for land cover classification where reference data are limited.


Introduction
Land cover is a primary information source that characterizes natural ecosystems and human activities on the surface of Earth. This information has been utilized for various research fields, such as landscape ecology, disaster management, urban planning, and environmental modeling [1][2][3][4][5]. Remote sensing, which can regularly capture surface information over large areas, is an efficient tool Remote Sens. 2020, 12, 1097 3 of 28

2-D Feature Extraction
This study introduces two new input features to CNNs: Polygon images and 2-D matrices. A polygon image is defined as a plane figure that is bounded by finite straight-line segments closing in a loop to form a closed polygonal chain. Different to a line graph, a polygon uses four quadrants, even though spectral reflectance values may be very low. Moreover, the vertical and horizontal differences of a polygon graph image by class would be higher than those of a line graph image because of its closed shape. Figure 1a shows how to create a polygon image of a pixel in the m-th row and n-th column of multi-temporal and multi-spectral images. The vertices of the polygon are located at the polar coordinate with the same angle interval as pixel values on the m-th row and n-th column of spectral bands. The number of vertices is equal to the number of image dates multiplied by the number of spectral bands. The line segments between neighboring vertices are connected. The filled polygon is converted to a gridded image with fixed rows and columns. In a gridded image, the polygon was filled with 1 and backgrounds were zero-filled. The second new feature type explored uses a 2-D matrix approach. Different to the 1-D vector approach [32], the 2-D matrices are aligned with spectral bands (x axis) and time (y axis) as shown in Figure 1b.

2-D Feature Extraction
This study introduces two new input features to CNNs: Polygon images and 2-D matrices. A polygon image is defined as a plane figure that is bounded by finite straight-line segments closing in a loop to form a closed polygonal chain. Different to a line graph, a polygon uses four quadrants, even though spectral reflectance values may be very low. Moreover, the vertical and horizontal differences of a polygon graph image by class would be higher than those of a line graph image because of its closed shape. Figure 1a shows how to create a polygon image of a pixel in the m-th row and n-th column of multi-temporal and multi-spectral images. The vertices of the polygon are located at the polar coordinate with the same angle interval as pixel values on the m-th row and n-th column of spectral bands. The number of vertices is equal to the number of image dates multiplied by the number of spectral bands. The line segments between neighboring vertices are connected. The filled polygon is converted to a gridded image with fixed rows and columns. In a gridded image, the polygon was filled with 1 and backgrounds were zero-filled. The second new feature type explored uses a 2-D matrix approach. Different to the 1-D vector approach [32], the 2-D matrices are aligned with spectral bands (x axis) and time (y axis) as shown in Figure 1b.

Convolutional Neural Networks
CNNs are a type of deep learning method that use convolutional multiplication based on artificial neural networks [37]. Recently, CNN have been widely used in land cover classification, showing remarkable performance [18,19,32,33,[38][39][40][41]. Typical CNNs are composed of convolutional layers, pooling layers, and fully connected layers. Given an image (or a vector for 1-D CNN), several filters with a specific window size sweep the image (or the vector) to create feature maps at convolutional layers. Filters are trained to extract significant features of the input data. Pooling layers reduce the spatial size of feature maps by extracting a representative value, such as a mean or maximum value, from a given window. This process is widely used to make the CNN model more robust by avoiding overfitting problems while considerably decreasing the computational cost [42]. Fully connected layers produce the final result of classification or regression with the features from previous layers. In addition, dropout is a widely used regularization method to alleviate the overfitting problem. Dropout randomly drops a few connections between layers by setting the weights of the connections to zero [43,44]. Dropout can be applied to any of the aforementioned layers.

CNN Architecture
In this study, CNN models with different 2-D inputs ( Figure 1) were developed. To find the optimal relationship between the input graph size and model performance, we compared the classification results for various input sizes (i.e., 50 × 50, 100 × 100, 200 × 200, and 400 × 400) for line and polygon graphs. Larger input graph sizes provide more detailed information, but a preliminary experiment found no improvement when input graphs were larger than 100 × 100 (not shown). The optimal size of the polygon-based input images was determined to be 100 × 100 in this study. The different input sizes of the polygon image (100 × 100) and the two-dimensional matrix (4 × 7) demand different CNN architectures, which are described in Figure 2. CNN models were optimized over each input dataset to compare their best result, rather than using a single CNN structure over all input types. We designated the CNN models according to the input features (i.e., the polygon image: CNN-Polygon; and the 2-D matrix: CNN-Matrix).
Parameters for the CNN-Polygon and CNN-Matrix models were determined based on multiple tests with different combinations of parameters in order to maximize performance and efficiency. Although a grid search approach testing every possible hyper-parameter combination was not conducted due to the extensive computational cost, more than 20 structures for each approach were tested to find the optimal CNN model. The tested models were combinations of 1-10 convolutional layers with 32-256 nodes, zero to multiple max-pooling layers, single or double fully connected layers with 32-1024 nodes considering both a shallow and deep structure. The final CNN-Polygon model consists of three convolutional layers, three max pooling layers, and two fully connected layers ( Figure 2a). The convolutional layers vary in the number and size of filters. The first convolutional layer has 32 filters with a 5 × 5 size; the second and third convolutional layers used 64 and 128 filters with a 3 × 3 size, respectively. Each convolutional layer was followed by a 2 × 2 max pooling layer. Dropout with the rate of 0.25 was used after the last max pooling layer. Extracted features after the convolutional and max pooling layers were passed to the fully connected layers. The numbers of nodes in the two fully connected layers were 256 and 16. The output layer has 9 nodes, which corresponds to the number of classes. The CNN-Matrix has a different structure than the CNN-Polygon model due to the much smaller size of the 4 × 7 matrix input (Figure 2b). To prevent the reduction of the feature size, zero-padding was added for every convolutional layer in the CNN-Matrix model. This model did not use a pooling layer because of the small input size. Three convolutional layers were used with 32, 64, and 128 filters with a 3 × 3 size. The fully connected layers have the same structure as the CNN-Polygon model.
Both the CNN-Polygon and CNN-Matrix models used a rectified linear unit (Relu) as an activation function. Recent neural network applications have been shown to provide better performance with Relu when compared to typical s-shape functions [13]. A softmax function was adopted as a classifier Remote Sens. 2020, 12, 1097 5 of 28 on the output layer with a categorical cross-entropy loss function. All CNN models were optimized with adaptive moment estimation (Adam) using the default values of Keras framework: 0.001 of the learning rate, 0.9 of beta 1, and 0.999 of beta 2 value. Adam is widely used for multi-class classification [45,46]. A high-level deep learning framework Keras was used to run CNN using Tensorflow as a background engine.
Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 28 Both the CNN-Polygon and CNN-Matrix models used a rectified linear unit (Relu) as an activation function. Recent neural network applications have been shown to provide better performance with Relu when compared to typical s-shape functions [13]. A softmax function was adopted as a classifier on the output layer with a categorical cross-entropy loss function. All CNN models were optimized with adaptive moment estimation (Adam) using the default values of Keras framework: 0.001 of the learning rate, 0.9 of beta 1, and 0.999 of beta 2 value. Adam is widely used for multi-class classification [45,46]. A high-level deep learning framework Keras was used to run CNN using Tensorflow as a background engine.

Study Areas
The proposed methods were evaluated for two local regions in the United States and one large region in South Korea with different climate and environmental characteristics ( Figure 3). Lake Tapps in Washington (WA) state has a Mediterranean climate with dry warm summers and mild winters according to the Köppen climate classification [47,48]. The annual high, low, and average temperatures of Lake Tapps are 15.7, 7.2, and 11.5℃, respectively, and annual precipitation is 943.1 mm. Concord in the state of New Hampshire (NH) shows a moist continental climate with warm summers and cold winters with no dry season. The annual high, low, and average temperatures of Concord are 14.3, 1.6, and 7.9℃, respectively. The annual precipitation is 1033.5 mm. Concord shows lower temperatures but higher annual average precipitation when compared to the Lake Tapps. Gwangju is the 6th largest city in South Korea with an area of about 501.18 km 2 . Gwangju is generally warm and temperate with a humid subtropical climate [47]. North Pacific high-pressure systems make the region hot and humid in summer, but moving high pressure systems from China create many dry and sunny days in the spring and autumn. The annual precipitation is 1427.9 mm. The annual high, low, and average temperatures are 28.4, −0.2, and 14.6℃, respectively.

Study Areas
The proposed methods were evaluated for two local regions in the United States and one large region in South Korea with different climate and environmental characteristics ( Figure 3). Lake Tapps in Washington (WA) state has a Mediterranean climate with dry warm summers and mild winters according to the Köppen climate classification [47,48]. The annual high, low, and average temperatures of Lake Tapps are 15.7, 7.2, and 11.5 • C, respectively, and annual precipitation is 943.1 mm. Concord in the state of New Hampshire (NH) shows a moist continental climate with warm summers and cold winters with no dry season. The annual high, low, and average temperatures of Concord are 14.3, 1.6, and 7.9 • C, respectively. The annual precipitation is 1033.5 mm. Concord shows lower temperatures but higher annual average precipitation when compared to the Lake Tapps. Gwangju is the 6th largest city in South Korea with an area of about 501.18 km 2 . Gwangju is generally warm and temperate with a humid subtropical climate [47]. North Pacific high-pressure systems make the region hot and humid in summer, but moving high pressure systems from China create many dry and sunny days in the spring and autumn. The annual precipitation is 1427.9 mm. The annual high, low, and average temperatures are 28.4, −0.2, and 14.6 • C, respectively.

Ground Reference Data
The collection of ground reference data was based on visual interpretation of high-resolution Google Earth images over the area whose land cover was not changed during the study period. Nine classes were identified for land cover classification: Barren, cropland, grassland, water, evergreen forests, mixed forests, deciduous forests, high impervious area, and low impervious area. The high impervious label was assigned when the proportion of impervious area exceeded approximately 75% within a pixel of Landsat images. The low impervious label was assigned when the proportion of the impervious area is between 50%-75%. When the impervious surface rate of a pixel is below 50%, the signal from other classes, such as vegetation, significantly influences the reflectance of the pixel, resulting in the classic mixed pixel problem. It is problematic to classify mixed pixels of mediumspatial resolution images, such as Landsat series [49]. For better visual interpretation of mixed pixels for reference data construction [50], we utilized additional spatial information and tools, such as interactive geographic information system (GIS) viewers with zoning information provided by US governmental agencies (http://esuite.concordnh.gov/arcgis/publicwebgis/, https://www.axisgis.com/pembrokenh/, https://www.axisgis.com/BowNH/) and the basic version of AcreValue (https://www.acrevalue.com/map/), which provides the value and productivity of farmlands.

Landsat 8 Images
The land cover classification inputs were derived from multi-temporal Landsat 8 OLI images in Level-1 precision and terrain-corrected product (L1TP) format provided from the U.S. Geological Survey Earth Explorer. We used the first seven spectral bands (bands 1-7) with the 30-m resolution for each image selected. Seven multispectral bands include coastal and aerosol (band 1), visible (band 2-4), near-infrared (band 5), and shortwave infrared (band 6-7). Seasonal images were selected for

Ground Reference Data
The collection of ground reference data was based on visual interpretation of high-resolution Google Earth images over the area whose land cover was not changed during the study period. Nine classes were identified for land cover classification: Barren, cropland, grassland, water, evergreen forests, mixed forests, deciduous forests, high impervious area, and low impervious area. The high impervious label was assigned when the proportion of impervious area exceeded approximately 75% within a pixel of Landsat images. The low impervious label was assigned when the proportion of the impervious area is between 50%-75%. When the impervious surface rate of a pixel is below 50%, the signal from other classes, such as vegetation, significantly influences the reflectance of the pixel, resulting in the classic mixed pixel problem. It is problematic to classify mixed pixels of medium-spatial resolution images, such as Landsat series [49]. For better visual interpretation of mixed pixels for reference data construction [50], we utilized additional spatial information and tools, such as interactive geographic information system (GIS) viewers with zoning information provided by US governmental agencies (http://esuite.concordnh.gov/arcgis/publicwebgis/, https://www.axisgis.com/pembrokenh/, https://www.axisgis.com/BowNH/) and the basic version of AcreValue (https://www.acrevalue.com/ map/), which provides the value and productivity of farmlands.

Landsat 8 Images
The land cover classification inputs were derived from multi-temporal Landsat 8 OLI images in Level-1 precision and terrain-corrected product (L1TP) format provided from the U.S. Geological Survey Earth Explorer. We used the first seven spectral bands (bands 1-7) with the 30-m resolution for each image selected. Seven multispectral bands include coastal and aerosol (band 1), visible (band 2-4), near-infrared (band 5), and shortwave infrared (band 6-7). Seasonal images were selected for each study site ( Table 1). The Landsat 8 OLI images were atmospherically corrected and converted to scaled reflectance with Fast Line-of-sight Atmospheric Analysis of Hypercubes (FLAASH) in ENVI software [51].

Experimental Design
A total of 28 input variables consisting of the reflectance data from the seven Landsat 8 bands for four seasons were used as input data to the machine learning models ( Table 1). The sample size of nine land cover classes for each site is summarized in Table 2. The reference data were randomly divided into training (80%) and testing (20%) sets, and this process was repeated 10 times (i.e., 10 different training/testing datasets) to mitigate the issue of test bias with a small dataset. Unlike traditional machine learning algorithms, such as RF and SVM, CNNs are generally known to require a huge dataset to train their deep structure and internal parameters [52]. Thus, oversampling was conducted for training the models for the Lake Tapps and Concord, which have a relatively small number of ground reference points per class. The oversampled data were randomly generated for each training sample with a subtle perturbation (within 5% for each reflectance value). Then, oversampled data were converted into 2D graphs or matrices for constructing CNN-Line, CNN-Polygon, and CNN-Matrix. As a result, each land cover class had 1000 samples after oversampling. To explore the variation of model performance due to sample size without oversampling, we randomly selected 50, 100, 200, and 400 samples per class for model training for the Gwangju area for the undersampling test. Table 2. The number of ground reference points used for training (tr) and testing (te). The training and test datasets were randomly divided 10 times with the ratio of sample size (~80:20) shown in the table (* The number of the oversampled (ovr) data is the sum of the original (ori) training data and perturbed data).

Lake Tapps
Concord Gwangju tr te tr te tr te ori ovr * ori ori ovr * ori ori ori Barren  178  1000  44  132  1000  32  400  100  Cropland  120  1000  30  164  1000  40  400  100  Grassland  197  1000  49  197  1000  49  400  100  Water  244  1000  60  182  1000  45  400  100  Evergreen Forest  144  1000  36  120  1000  30  400  100  Mixed Forest  160  1000  40  160  1000  40  400  100  Deciduous Forest  160  1000  40  160  1000  40  400  100  High Impervious area  200  1000  50  205  1000  51  400  100  Low Impervious area  172  1000  43  170  1000  42  400  100 The overall process is described in Figure 4. The original and oversampled datasets were transformed into various input formats according to the schemes of the SVM, RF, and CNN (i.e., CNN-Polygon and CNN-Matrix) approaches. A total of 500 trees were used in RF and the linear kernel with a cost value of 100 was used in SVM based on the grid search algorithm. In order to compare different input representations of 2-D images, the line graph image approach [35] was tested with the architecture of the CNN-Polygon model (hereafter, CNN-Line). The one-dimensional CNN (CNN-1D) was also implemented to examine the differences between 1-D and 2-D inputs for Remote Sens. 2020, 12, 1097 8 of 28 pixel-level land cover classification. The structure of the CNN-1D model is based on [32]. CNN-1D was optimized after testing several structures with 1-5 one-dimensional convolutional layers with 32-128 nodes, zero or multiple pooling layers with a stride of 2, and single or double fully connected layers with 32-512 nodes. The final CNN-1D model consisted of a single 1-D convolutional layer, a single pooling layer with a stride of 2, and a single fully connected layer with 500 nodes. A more detailed explanation about CNN-1D structure can be found in [32]. Neighboring pixels are typically used to improve land cover classification based on machine learning approaches, especially for CNNs [18]. A patch-based CNN (CNN-Patch) was also implemented to evaluate the results using a 11 × 11 window with neighboring pixels. To test CNN-Patch with the 11 × 11 × 28 input size (x, y, bands), four convolutional layers were used with a 3 × 3 size of 32, 64, 128, and 64 kernels. Two fully connected layers were also used with 256 and 16 nodes after convolutional layers. Since it is difficult to incorporate neighboring pixels during the oversampling process, CNN-Patch was only conducted for the Gwangju area, focusing on examining the effect of the sample size. Figure 5 shows the frequency images of the line and polygon graphs, and mean and standard deviation values of the 2-D matrix generated using the reference dataset.
Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 28 with the architecture of the CNN-Polygon model (hereafter, CNN-Line). The one-dimensional CNN (CNN-1D) was also implemented to examine the differences between 1-D and 2-D inputs for pixellevel land cover classification. The structure of the CNN-1D model is based on [32]. CNN-1D was optimized after testing several structures with 1-5 one-dimensional convolutional layers with 32-128 nodes, zero or multiple pooling layers with a stride of 2, and single or double fully connected layers with 32-512 nodes. The final CNN-1D model consisted of a single 1-D convolutional layer, a single pooling layer with a stride of 2, and a single fully connected layer with 500 nodes. A more detailed explanation about CNN-1D structure can be found in [32]. Neighboring pixels are typically used to improve land cover classification based on machine learning approaches, especially for CNNs [18]. A patch-based CNN (CNN-Patch) was also implemented to evaluate the results using a 11 × 11 window with neighboring pixels. To test CNN-Patch with the 11 × 11 × 28 input size (x, y, bands), four convolutional layers were used with a 3 × 3 size of 32, 64, 128, and 64 kernels. Two fully connected layers were also used with 256 and 16 nodes after convolutional layers. Since it is difficult to incorporate neighboring pixels during the oversampling process, CNN-Patch was only conducted for the Gwangju area, focusing on examining the effect of the sample size. Figure 5 shows the frequency images of the line and polygon graphs, and mean and standard deviation values of the 2-D matrix generated using the reference dataset.  The first and second rows show the rate of occurrence of line and polygon graphs as density using the reference data. An area with a high occurrence rate means that the majority of graphs were plotted over the area. A rate of 1 indicates that every converted graph was plotted over an area, while a rate of 0 means no graph was plotted in that area. The third and fourth rows show the normalized mean and standard deviation of the reflectance for the 2-D matrix, respectively.
Overall accuracy (OA) [53] and standard Kappa coefficient [54] were used for model assessment [55][56][57]. To statistically compare model performance with multiple datasets, Demšar [58] and Garcia and Herrera [59] suggested the Wilcoxon paired rank test and Friedman test, which are nonparametric. The Wilcoxon paired signed-rank test was used only when two models were compared [60][61][62]. The Friedman test was used for the multiple model comparisons [63][64][65] since multiple The first and second rows show the rate of occurrence of line and polygon graphs as density using the reference data. An area with a high occurrence rate means that the majority of graphs were plotted over the area. A rate of 1 indicates that every converted graph was plotted over an area, while a rate of 0 means no graph was plotted in that area. The third and fourth rows show the normalized mean and standard deviation of the reflectance for the 2-D matrix, respectively.
Overall accuracy (OA) [53] and standard Kappa coefficient [54] were used for model assessment [55][56][57]. To statistically compare model performance with multiple datasets, Demšar [58] and Garcia and Herrera [59] suggested the Wilcoxon paired rank test and Friedman test, which are non-parametric. The Wilcoxon paired signed-rank test was used only when two models were compared [60][61][62]. The Friedman test was used for the multiple model comparisons [63][64][65] since multiple testing potentially results in an increase of type I error. The significance level was calculated based on the p-value derived from Friedman's chi-square.
Sensitivity analysis was performed to understand how input features contribute to the models. The basic idea was to measure how much the accuracy changes when each band is removed from the model similar to the mean decrease accuracy in RF. Models whose inputs are reflectance values in 1-D (i.e., RF, SVM, CNN-1D) and 2-D (i.e., CNN-Matrix and CNN-Patch) were iteratively run with a zero-filled band. The sensitivities of CNN-Line and CNN-Polygon were analyzed by occluding each pixel on the 2D graph images with a zero-filled 7 × 7 window because the transformed 2D line and polygon images with zero reflectance for the specific band could largely distort the converted reflectance graphs. Occluded areas that corresponded to high drops in accuracy imply significant contribution in distinguishing the particular class. The detailed process for generating occlusion maps is depicted in Figure S1. The accuracy drop was normalized into the 0-1 per class, with higher values indicating more contributing features. Generally, the discriminative localization [66] is used to visualize the contributing area of input images in a CNN model. However, rather than using discriminative localization, we iteratively occluded input images to maintain the basic concept of measuring sensitivity (i.e., removing each band) to compare to other types of models used in this study. Sensitivity analysis was conducted for models with the original training dataset for all study areas.

Model Performance
The models developed with the 10 training datasets were evaluated using the resultant OA and Kappa coefficient values of the test datasets ( Figure 6). The average ranks for each OA and Kappa and the p-values from Friedman's test are summarized in Table 3. The p-values smaller than 0.05 in Table 3 indicate that the difference among models is significant.
Oversampled datasets from Lake Tapps and Concord improved the overall OA and Kappa coefficients by about 1.0%-1.2 % and 0.01-0.02, respectively (Figure 6a-b and Figure S1a-b). The CNN models with 1-D and 2-D matrix-based inputs, CNN-1D and CNN-Matrix, improved the model performance over the other models after oversampling ( Figure 6 and Figure S1). The CNN-Polygon showed the highest average rank of OA and Kappa in the original dataset using about 150-200 samples per class on Lake Tapps and Concord. The performance differences between CNN-Polygon, CNN-Line, and RF were not large, showing that the significance levels were less than 90% confidence with the original dataset (Figure 7a,c). After oversampling with around 1000 samples per class, the average rank of CNN-Polygon was pushed back slightly (Table 3), but there is little difference between CNN-Polygon and other newly ranked models (Figure 7b,d). In a previous study, Kim et al. [35] compared CNN-Line, RF, and SVM for sites in Concord, New Hampshire, USA, and South Korea. They reported that the CNN-Line model had better accuracy than RF for both study sites. However, the differences between the models for the Concord site were not statistically significant according to the Cochran's Q test and McNamar's test, while the South Korea site showed significant differences between models [35].
Undersampling was applied to the Gwangju dataset. As the number of samples per class increased from 50 to 400, the average of the OA and Kappa coefficient gradually increased from 71.43% to 82.06% and 0.64 to 0.77, respectively. The performance of CNN-Patch was worse than the other CNN models with the per-pixel input, despite the rapid increase in model performance as the number of samples increased (Figure 6c and Figure S1c). Neighboring pixel information has been contributed to enhance model performances in previous CNN studies, but a small number of samples using CNN-Patch tends to underperform compared to the other models in our results. The patch-based inputs seem to need a large sample size since various neighboring environmental conditions should be considered. The CNN-Polygon showed the best performance with 50 samples per class. CNN-Matrix and CNN-1D yielded a similar performance with the CNN-Polygon as the number of samples per class increased. The best performance for the CNN-Matrix and CNN-1D models peaked with 400 samples per class.
Interestingly, unlike the graph image-based models (i.e., CNN-Line and CNN-Polygon), the performance of the matrix-based CNN models (i.e., CNN-Matrix and CNN-1D) was improved when the training sample size became larger. The CNN-Polygon performed significantly better than the CNN-Line with the same CNN architecture, which demonstrates that model performance is impacted by different graph image representation. CNN-Patch showed the lowest performance compared to the other models when using a small number of samples but showed similar performance to RF and SVM as the number of training samples increased.   Significance levels based on the Wilcoxon signed-rank tests between models calculated for Lake Tapps, Concord, and Gwangju. Each matrix has four colors: red (significant at the 99% confidence level), orange (significant at the 95% confidence level), yellow (significant at the 90% confidence level), and white (not significant at the 90% confidence level). Areas over the diagonal denote significance levels for the overall accuracy. Areas under the diagonal show significance levels for the standard kappa coefficients.  Significance levels based on the Wilcoxon signed-rank tests between models calculated for Lake Tapps, Concord, and Gwangju. Each matrix has four colors: red (significant at the 99% confidence level), orange (significant at the 95% confidence level), yellow (significant at the 90% confidence level), and white (not significant at the 90% confidence level). Areas over the diagonal denote significance levels for the overall accuracy. Areas under the diagonal show significance levels for the standard kappa coefficients.

Sub-Class Analysis with Land Cover Classification Maps
Land cover classification maps were produced using the models that had the most frequent highest overall accuracies (Figures S2-S4). We further analyzed confusion in the classification focusing on a few classes that were problematic-barren vs. high impervious area and crop vs. low impervious area.
It is sometimes hard to distinguish among vegetation, barren, and built-up areas because of the confusing spectral response pattern [67]. Figure 8 shows a subset of the land cover maps for the Lake Tapps region that highlights a construction area. The highlighted area within a red circle in Figure 8 should be classified as a barren class as shown in the Google Earth image. The RF and CNN-1D models generally classified the construction material as a high impervious area, while the SVM model classified the plant as a high or low impervious area and grassland. The CNN models with the transformation (i.e., CNN-Line, CNN-Polygon, and CNN-Matrix) more consistently classified the construction material plant as barren, with fewer pixels misclassified as high impervious area.
Cropland is a challenging land cover class due to changes in cover associated with different crop cycles (i.e., phenology), the spectral similarity with grassland, and heterogeneity of the landscape [68]. In the present study, the polygon graph image for the cropland class shape that is more similar to the low impervious class than grassland ( Figure 5). The higher near-infrared reflectance in grassland compared to cropland makes the two classes rather distinguishable. Additional confusion arises because the low impervious areas generally contain a mixture of cover types frequently including some vegetation (e.g., trees, shrub, and grass). Figure 9 shows the clear misclassification between cropland and low impervious area. Cropland is well classified in CNN-Line, CNN-Polygon, and CNN-Matrix, whereas misclassified in RF, SVM, and CNN-1D. When the oversampled dataset was adopted, CNN-1D showed slightly reduced misclassification between cropland and low impervious, but SVM misclassified cropland as grassland. The center of the land cover subset shown in Figure 10 highlights a cropland region in Gwangju. The CNN-Patch model most often confused cropland with impervious areas when using the smallest number of training data, while the CNN-Polygon classified cropland well. As the number of training samples increased, most models correctly classified the region as cropland. Kim et al. [35] reported that CNNs sometimes struggled to classify natural grasslands, forests, and croplands using just summer and winter data.

Model Type, Sample Size, and Performance
This study compared classification models with different input types (i.e., spectral vector-based, graph image-based, matrix-based, and patch-based) focusing on model performance and the effect of training sample size. Among different input types, the graph image-based models (i.e., CNN-Line and CNN-Polygon) showed higher performance than the others models when the original dataset was used for Lake Tapps and Concord. This result agreed with Kim et al. [35], who used a graphbased CNN model very similar to CNN-Line in this study for land cover classification. CNN-Polygon also resulted in the highest performance when the sample size was less than 200 for Gwangju. This implies that the graph-based CNNs can yield successful classification results with a small training sample size, unlike recent CNN-based land cover classification studies that used large datasets with hundreds to thousands of training samples per class [18,19,31,36,64]. As the sample size increased, the performance of matrix-based models (i.e., CNN-Matrix and CNN-1D) increased to similar or slightly higher levels than the graph image-based models. The transformation from spectral reflectance values to a graph image could make the input variables less sensitive to small changes of reflectance values in contrast to the matrix-based models that directly use reflectance values. Such a

Model Type, Sample Size, and Performance
This study compared classification models with different input types (i.e., spectral vector-based, graph image-based, matrix-based, and patch-based) focusing on model performance and the effect of training sample size. Among different input types, the graph image-based models (i.e., CNN-Line and CNN-Polygon) showed higher performance than the others models when the original dataset was used for Lake Tapps and Concord. This result agreed with Kim et al. [35], who used a graph-based CNN model very similar to CNN-Line in this study for land cover classification. CNN-Polygon also resulted in the highest performance when the sample size was less than 200 for Gwangju. This implies that the graph-based CNNs can yield successful classification results with a small training sample size, unlike recent CNN-based land cover classification studies that used large datasets with hundreds to thousands of training samples per class [18,19,31,36,64]. As the sample size increased, the performance of matrix-based models (i.e., CNN-Matrix and CNN-1D) increased to similar or slightly higher levels than the graph image-based models. The transformation from spectral reflectance values to a graph image could make the input variables less sensitive to small changes of reflectance values in contrast to the matrix-based models that directly use reflectance values. Such a characteristic seemed to affect the improvement rate of the model performance as the number of training samples increased. When it came to the patch-based input (i.e., CNN-Patch) that consider neighboring pixels, the performance was significantly lower even with more information than the other single-pixel based models, especially with a small training dataset (less than 200 per class) for Gwangju. Not only the high variation of reflectance data in neighboring pixels but also mixed land cover classes with small patches within the spatial resolution of 30 m seemed to make it difficult to build a robust patch-based model without a massive amount of data [55,69,70]. CNN-Polygon performed better than CNN-Line for all study areas when applied with the same input size, hyperparameter, and CNN structure (Table 3 and Figure 7). This implies that the transformed polygon graph images appear to be more intuitive than the line graph images in a CNN framework. Interestingly, as the training sample size increased, the performance difference between CNN-Polygon and CNN-Line decreased, indicating that CNN-Line was slightly more sensitive to the training sample size than CNN-Polygon. On the other hand, the performance of CNN-Matrix and CNN-1D sharply increased as the training sample size increased. While CNN-Matrix generally performed better than CNN-1D for Lake Tapps and Concord, their performance for Gwangju were similar regardless of the training sample size (Figure 7, Table 3). Since there is a structural difference between CNN-Matix and CNN-1D (i.e., multiple rows by season vs. one-dimensional vectors), further investigation is needed to identify how such different structures affect the classification results.

Sensitivity Analysis
We performed a sensitivity analysis for each model to see how band importance might change when the same data were transformed into different input feature types. The normalized sensitivity of CNN-Matrix and CNN-Patch can be directly compared to those of RF, SVM, and CNN-1D, unlike CNN-Line and CNN-Polygon. For this reason, occlusion maps for CNN-Line and CNN-Polygon are shown in Figures 11-13 while the normalized sensitivity for RF, SVM, CNN-Matrix, and CNN-1D are shown in Figures 14-16 separately. The sensitivity analysis of CNN-Patch for Gwangju is described in Figure 16. Figures 11 and 14 show the occlusion maps for the graph and matrix-based CNN models and the sensitivity analysis results for Lake Tapps, respectively. Similarly, Figures 12 and 15 depict the occlusion maps and sensitivity results for Concord, and Figures 13 and 16 are for Gwangju. The sensitivity of input variables varied by model depending on the input feature type, algorithm, and study area. Nonetheless, some common characteristics were found among the models. The barren class showed the highest sensitivity in the summer near-infrared (NIR) band (band 5) for both study areas. In the CNN-Line and CNN-Polygon models, there was no accuracy drop in the occlusion maps for the water class at either study site, while the other models showed some variation in the sensitivity results. This indicates that the graph-based two-dimensional input data format may provide more reliable and stable learning than the matrix-based ones.
Even with the same CNN structure, the CNN-Line and CNN-Polygon models showed different input variable sensitivity. For example, for the cropland class at Lake Tapps, the CNN-Polygon had high sensitivity to summer NIR and winter visible bands, while the CNN-Line had high sensitivity only to the winter visible bands. On the other hand, the CNN-Line showed high sensitivity over spring NIR and summer visible/NIR bands for the grassland class in Concord, while the CNN-Polygon had no significant sensitivity. This implies that the contributing bands for classification can be different depending on the type of input features in two-dimensional CNNs. Forest-related classes showed a high sensitivity to NIR to SWIR bands for all three study sites. This corresponds to previous studies that show NIR bands play a key role in forest class classification [71,72]. The pattern of forest classes in Gwangju shows a clear sensitivity of band 5 (i.e., 0.85-0.88 µm) when compared to Lake Tapps and Concord sites. This might imply there is sensitivity variation over different phenology. High impervious areas had a high variance in most models since the mixed samples of dark impervious surfaces (e.g., asphalt and parking lot) and bright impervious surfaces (e.g., concrete, rooftop, and metal) caused large standard deviations of surface reflectance values. This makes it difficult to classify a high impervious area, resulting in confusion with barren and low impervious areas. As mentioned above, it should be noted that while input features came from the same reference dataset, important and contributing attributes were examined through different methods by the model. Thus, qualitative interpretation of results is more appropriate than quantitative.
Remote Sens. 2020, 12, x FOR PEER REVIEW 18 of 28 difficult to classify a high impervious area, resulting in confusion with barren and low impervious areas. As mentioned above, it should be noted that while input features came from the same reference dataset, important and contributing attributes were examined through different methods by the model. Thus, qualitative interpretation of results is more appropriate than quantitative.      The magenta color indicates that the attribute is more sensitive (i.e., contributing) to the models than others.

Novelty and Limitations
In this study, through a series of classifications for three study sites, we found that: 1) The proposed CNN-Polygon approach works well for land cover classification even when the number of training samples is very small, and 2) the proposed CNN-Matrix performs well when multi-temporal data are used for classification and the training sample size is relatively large. In particular, this study showed that the types (and structures) of input features are a critical consideration of CNN-based classification [73,74]. More visually intuitive input features tend to increase the classification accuracy even when training data are limited.
However, there are several limitations in this study. Many studies [32,75] reported higher classification accuracy when using multi-seasonal data than single images, especially better performance for classifying vegetation classes (e.g., forests and croplands) that have high inter-class The magenta color indicates that the attribute is more sensitive (i.e., contributing) to the models than others.

Novelty and Limitations
In this study, through a series of classifications for three study sites, we found that: (1) The proposed CNN-Polygon approach works well for land cover classification even when the number of training samples is very small, and (2) the proposed CNN-Matrix performs well when multi-temporal data are used for classification and the training sample size is relatively large. In particular, this study showed that the types (and structures) of input features are a critical consideration of CNN-based classification [73,74]. More visually intuitive input features tend to increase the classification accuracy even when training data are limited.
However, there are several limitations in this study. Many studies [32,75] reported higher classification accuracy when using multi-seasonal data than single images, especially better performance for classifying vegetation classes (e.g., forests and croplands) that have high inter-class spectral variability over time. However, seasonality might make it difficult to classify some other classes, such as inland water, due to the temporal difference of the water boundary. The seasonal sensitivity of classes should be carefully considered when constructing input features from multi-temporal data. Especially for the graph-based CNNs proposed in this study, how to connect multi-temporal data in a graph should be further examined. The transferability of the proposed approaches is another limitation. Although the proposed approaches were evaluated over three study sites, they should be tested more extensively over large areas with different sensor data to ensure their generalization. The relatively high computational cost of the graph-based CNNs compared to the matrix-based CNNs is another limitation, which requires further examination [35].

Conclusions
This study proposed two novel CNN frameworks by transforming spectral information into 2-D graph and matrix (i.e., CNN-Polygon and CNN-Matrix) data for use as input features. The proposed CNN approaches were compared to other types of CNNs-CNN-Line, CNN-1D, and CNN-Patch-and vector-based machine learning approaches (i.e., RF and SVM). The proposed CNN-Polygon resulted in better performance than the others, producing overall accuracies of 93%-95 % for both Concord and Lake Tapps and 80%-84% for Gwangju. The CNN-Polygon had a particular performance advantage when the training sample size was small (i.e., less than 200 per class), while CNN-Matrix resulted in similar or higher performance as the training sample size became larger. The graph-based CNN models could be applied for various classification fields where reference data are very limited. While some common contributing variables were found for specific classes (e.g., NIR for forests) for all approaches, the overall patterns of the contributing variables were different by model even when all input features came from the same dataset. The two proposed approaches (i.e., CNN-Polygon and CNN-Matrix) are pixel-based ones through the conversion of spectral vectors into two dimensional features. Given that most CNNs applied for land cover classification in the literature have used spatial contextual information, the proposed CNN frameworks can be further improved through the incorporation of such contextual data.
Supplementary Materials: The following are available online at http://www.mdpi.com/2072-4292/12/7/1097/s1, Figure S1. The sensitivity analysis results of CNN-Line and CNN-Polygon using occlusion maps. To assess the contribution of each part of a graph, a moving window occludes the sub-area of the 2-D graph by zero-filling. The colorful legend indicates the normalized accuracy drop per class. The grey-scaled legend shows the occurrence rate of each graph, same as Figure 5. Figure S2. Box plots of Kappa for the seven models: RF, SVM, CNN-Line, CNN-Polygon, CNN-Matrix, CNN-1D, and CNN-Patch. Kappa coefficients are calculated for (a) Lake Tapps (b) Concord, and (c) Gwangju. The Lake Tapps and Concord models were trained using original (O) and oversampled (OV) datasets, while the Gwangju models were trained using datasets of 50, 100, 200, 300, and 400 samples per class. Dotted red lines indicate the average performance of all models over each number of samples.