Comparison of CNN Algorithms on Hyperspectral Image Classification in Agricultural Lands

Several versions of convolutional neural network (CNN) were developed to classify hyperspectral images (HSIs) of agricultural lands, including 1D-CNN with pixelwise spectral data, 1D-CNN with selected bands, 1D-CNN with spectral-spatial features and 2D-CNN with principal components. The HSI data of a crop agriculture in Salinas Valley and a mixed vegetation agriculture in Indian Pines were used to compare the performance of these CNN algorithms. The highest overall accuracy on these two cases are 99.8% and 98.1%, respectively, achieved by applying 1D-CNN with augmented input vectors, which contain both spectral and spatial features embedded in the HSI data.

Feature extraction and feature selection approaches were proposed to curtail the redundancy of information among hyperspectral bands [12]. By feature extraction, a projection matrix is used to map the original spectral data to a feature space while holding the dominant spectral information [13]. Typical feature extraction algorithms include principal component analysis (PCA) [14], linear discriminant analysis (LDA) [15], manifold learning [16], nonnegative matrix factorization (NMF) [17] and spatial-spectral feature extraction [18].
Feature selection means selecting part of the original bands based on proper criterion [19]. Typical feature selection algorithms include multitask sparsity pursuit [20], structure-aware [21], support vector machine [22], hypergraph model [23], sparse Hilbert-Schmidt independence criterion [24] and nonhomogeneous hidden Markov chain model [25]. Different measures were used to select preferred bands, including mutual information [12], information divergence [13], variance [26] and local spatial information [27]. However, these algorithms are time-consuming because the classifiers must be trained and tested again as the set of selected bands is changed. Pixels in an HSI are usually spatially correlated to their adjacent pixels [28], which can be exploited to complement the spectral information to achieve higher classification rate [29]. In [30], a spectral-spatial semisupervised training set construction was proposed to mitigate the problem of labeled data scarcity, in which unlabeled pixels are recruited for a class training subset if they belong to the same spectral cluster and are in the spatial neighborhood of a labeled pixel.
Deep spectral features embedded in the HSIs of agricultural vegetation can be physically related to the photosynthetic pigment absorption in the wavelengths of 400-700 nm, large spectral slope in 700-750 nm [31,32], liquid water inflection point in 1080-1170 nm [33], absorption by various leaf waxes and oils in 1700-1780 nm and cellulose absorption around 2100 nm, to name a few. The spectral features relevant to soil properties mainly appear in 2100-2300 nm [34,35]. These spectral features can be exploited for applications like precision agriculture [36,37], noxious weed mapping for rangeland management [38,39], forest health monitoring [40,41], vegetation stress analysis [42,43] and carbon sequestration site monitoring [44].
CNN has the potential of exploiting deep-level features embedded in its input data for classification, making it suitable for terrain classification with HSI data that contain both spatial and spectral features. Although CNNs have been widely used for classification in agricultural lands, there are always some outliers or misclassification between similar classes which contain similar spatial and spectral features. In this work, we present several versions of CNN, each taking different types of input vectors that include more feature information, to resolve these issues. These CNNs were trained and tested on the HSIs of Salinas Valley and Indian Pines, respectively. The former is a crop agriculture, the latter contains two-thirds crop agriculture and one-third forest or other natural perennial vegetation. The rest of this article is organized as follows. The 1D-CNN with input vector composed of pixelwise spectral data and spectral-spatial data are presented in Sections 2 and 3, respectively, The 2D-CNN with input layer of principal components is presented in Section 4, simulation results are presented and analyzed in Section 5, and some conclusions are drawn in Section 6. Figure 1 shows an HSI cube which is composed of N x × N y pixels and each pixel contains spectral data in N bands. A one-dimensional (1D) input vector is prepared for each pixel by extracting the spectral data from that pixel. The input vectors from a selected set of training pixels are used to train the 1D-CNN shown in Figure 2, then the input vectors from another set of testing pixels are used to evaluate the performance of the 1D-CNN.  In the schematic of a 1D-CNN shown in Figure 2, the compositions of convp-n(1 × 2) and convp-20(20 × 2) are shown in Figures 3 and 4, respectively, and FC(20N × M) is a fully connected layer shown in Figure 5.  Figure 3 shows the schematic of a convp-20(1 × 2) layer, where conv-n(1 × 2) is a convolutional layer composed of n filters of kernel size two, taking one input vector. The outputs of conv-n(1 × 2) are processed with batch normalization (BN), rectified linear unit (ReLU) activation function and maxpooling, MP(2), in sequence. The BN is used to make the learning process less sensitive to initialization. The input to a BN is a mini-batch of M input vectors,x m = [x m1 , x m2 , · · · , x mN ] t with 1 ≤ m ≤ M . The mean value and variance of the nth band in the bth mini-batch are computed as [45]

1D-CNN with Pixelwise Spectral Data
Then, the original input vectors in the bth mini-batch are normalized as where is a regularization constant to avoid divergence when σ 2 bn is too small. To further increase the degree-of-freedom in the subsequent convolutional layers, the normalized input vectors are scaled and shifted as where the offset β n and the scaling factor γ n are updated during the training phase. The ReLU activation function is defined as y = max{0, x}, with input x and output y. The maxpooling function, MP( ), reduces the computational load by picking the maximum from input data, which preserves the main characteristics of feature maps at the cost of coarser resolution. Figure 4 shows the composition of a convp-20(20 × 2) layer, where conv-n(20 × 2) is a convolutional layer composed of n filters of kernel size two, taking 20 input feature maps.   Band Selection Approach Figure 6 shows the flowchart of a band selection (BS) approach based on CNN (BSCNN), which selects the best combination of spectral bands for classification. A CNN was first trained by using all the N spectral bands of training pixels, and its configuration remained the same during the subsequent band selection process. The BS process was executed L times. In each time, N bands (N < N) were randomly selected and the data in the other (N − N ) bands were reset to zero in the input vector.
Among all these L combinations, the N bands delivering the highest overall accuracy is adopted for retraining the CNN for classification.    Figure 8 shows the preparation of an augmented input vector by concatenating the spectral bands of the target pixel with the PCA data surrounding the target pixel, exploiting the spatial correlation between neighboring pixels. The PCA is first applied to all the spectral bands of each pixel to extract the first Q principal components. Then, the first Q principal components of all the R × R pixels surrounding the target pixel are collected, vectorized and concatenated to the original N bands of the target pixel to form an augmented input vector of dimension N + R × R × Q, to be input to the 1D-CNN.  Figure 9 shows the preparation of input layers, composed of principal components from each pixel, to be input to the 2D-CNN shown in Figure 10. The PCA is first applied to all the N spectral bands of each pixel to extract the first Q principal components [46]. The Q principal components from each of the R × R pixels surrounding the target pixel form an input layer associated with the target pixel. The PCA extracts the main features in the spectral dimension and exploits the spatial features embedded in the hyperspectral data.   Figure 10 shows the schematic of the 2D-CNN used in this work, where the compositions of convp-n(1 × 2 × 2) and convp-20(20 × 2 × 2) are shown in Figures 11 and 12, respectively, FC(20R × R × M) is a fully connected layer shown in Figure 13. Cascading four convp-20(1 × 2 × 2) layers makes the resulting 2D-CNN highly nonlinear and enables it to recognize more abstract spatial-spectral features embedded in the hyperspectral data.  Figure 11 shows the schematic of a convp-20(1 × 2 × 2) layer, where conv-n(1 × 2 × 2) is a convolutional layer composed of n filters of kernel size 2 × 2, taking one input layer. The outputs of conv-n(1 × 2 × 2) are processed with BN, ReLU activation function and MP(2 × 2) in sequence.  Figure 12 shows the composition of a convp-20(20 × 2 × 2) layer, where conv-n(20 × 2 × 2) is a convolutional layer composed of n filters of kernel size 2 × 2, taking 20 input feature maps. Figure 13 shows the composition of a fully connected layer, FC(20(R × R ) × M), which connects the feature maps from the last convolutional layer, convp-20(20 × 2 × 2), to the input of softmax function for final classification.  Figure 14 shows the image and ground truth, respectively, in Salinas Valley, which were acquired with the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor in October 1998 [47]. The HSI is composed of 512 × 217 pixels, with spatial resolution of 3.7 m. The spectral data in wavelengths of 400-2500 nm were recorded in 224 bands, among which bands 108-112, 154-167 and 224 are removed for the concern of dense water vapor and atmospheric effects, leaving 204 more reliable bands. Table 1 lists the ground truth of 54,129 pixels in 16 classes [47].

Salinas Valley HSI
The main objective of the AVIRIS project was to identify, measure and monitor the composition of Earth surface and atmosphere, based on the signatures of molecular absorption and particle scattering. Research with AVIRIS data were focused on understanding the processes related to global environment and climate change [48].
In the training phase of this work, 50% of pixels were randomly selected, labeled by ground-truth data, to determine the weights and biases associated with each neuron. Mini-batches of size M = 16 were used over 200 training epochs. The other 50% of pixels were then used in the testing phase to evaluate the classification performance.   Figure 15 shows the mean value and standard deviation of all the pixels in each of the 16 classes, over all 224 bands. The bands of dense water vapor and atmospheric effects are marked by a grey shade.  Figure 16a shows the classification image with 1D-CNN applied to 204 selected bands. The overall accuracy (OA) is 91.8%, which is the ratio of pixels correctly classified and the total number of testing pixels. Table 2a lists the producer accuracy (PA) of each class, which is the ratio of pixels correctly classified to a specific class and the total number of pixels which are classified to that specific class. The PAs of classes #8 and #15 are 91.9% and 54.8%, respectively, consistent with the observation on Figure 16a that classes #8 and #15 are apparently misclassified. Also notice that some spectral curves in Figure 15 nearly overlap in certain bands, which may cause errors in classification.    Figure 17 shows the effect of band number on the overall accuracy, which indicates that choosing N = 70 bands renders the highest overall accuracy.  Figure 16b shows the classification image by applying the BSCNN.  Figure 8 shows the preparation of an augmented input vector by concatenating the N spectral bands of the target pixel and Q principal components from each of the R × R pixels surrounding the target pixel. By choosing N = 204, Q = 1 and R = 21, the augmented input vector has the dimension of N = 645. With additional spatial information, the accuracy rate is expected to improve [49]. Figure 16c shows the classification image with augmented input vectors of 645 bands. Table 2c lists the PAs of all 16 classes, and the overall accuracy is 99.8%. By comparing Table 2a with 2b, the PAs on classes #8 and #15 are significantly increased to 99.8% and 99.7%, respectively. Figure 16d shows the classification image by using 2D-CNN, with one principal component from each pixel to form an input layer. Table 2d lists the PAs of all 16 classes, and the overall accuracy is 99%.
In summary, the OA of 1D-CNN with 204 selected bands is the lowest of 91.8%, that of BSCNN with 70 selected bands is 93.2%, that of 1D-CNN with augmented input vectors of 645 bands is the highest of 99.8% and that of 2D-CNN with one principal component from each pixel is 99%. Figure 18 shows a testing site in the Indian Pines, which was recorded on 12 June 1992, with AVIRIS sensors over the Purdue University Agronomy farm, northwest of West Lafayette. The image is composed of 145 × 145 pixels, each containing 220 spectral reflectance bands in wavelengths of 400-2500 nm. The number of bands is reduced to 200 after removing bands 104-108, 150-163 and 220, which suffer significant water absorption. Two-thirds of the test site was covered with agriculture land and one-third with forest or other natural perennial vegetation. There were also two dual-lane highways, a rail line, some low-density housings, other man-made structures and local roads. At the time of recording, some crops were growing, corn and soybeans were in their early stage of growth, with less than 5% coverage. Table 3 lists the available ground truth in 16 classes, which are not mutually exclusive.  Table 3. Summary of ground truth in Figure 18 [50].

# Class
Sample Number  1  alfalfa  46  2  corn-notill  1428  3  corn-mintill  830  4  corn  237  5  grass-pasture  483  6  grass-trees  730  7 grass-pasture-mowed 28 8 hay -windrowed  478  9  oats  20  10  soybean-notill  972  11  soybean-mintill  2455  12  soybean-clean  593  13  wheat  205  14  woods  1265  15 buildings-grass-trees-drives  386  16 stone-steel-towers 93 In the training phase of this work, 50% of pixels were randomly selected, labeled by ground-truth data, to determine the weights and biases associated with each neuron. Mini-batches of size M = 16 were used over 200 training epochs. The other 50% of pixels were then used in the testing phase to evaluate the classification performance. Figure 19 shows the mean value and standard deviation of all the pixels in each of the 16 classes, over the 200 selected bands.     Figure 20b shows the classification image with augmented input vectors of 641 bands, where we choose N = 200, Q = 1 and R = 21. Table 4b lists the PAs of all 16 classes, with an overall accuracy of 95.4%. Compared to Table 4a, the PAs on classes #3, #7 and #15 are improved to 94%, 94.7% and 99.5%, respectively. Figure 20c shows the classification images by applying 2D-CNN, with the input layer composed of one principal component from each pixel. Table 4c lists the PAs of all 16 classes, with the overall accuracy of 91.5%.
In summary, the OA of 1D-CNN with 200 selected bands is the lowest of 83.4%, that of 1D-CNN with augmented input vectors of 641 bands is 95.4%, and that of 2D-CNN with one principal component from each pixel is 91.5%. Figure 21 shows the overall accuracy of 2D-CNN with different numbers of principal components. The highest OA is slightly below 98%, with 4, 30 or 60 principal components. Figure 20d   The computational time for training and testing these CNNs as well as the resulting accuracy were affected by the size of input vector, input layer, convolution kernel, batch and epoch. Table 5 lists the CPU time for each CNN developed in this work, on a desktop PC with Intel R Core TM i7-8700 processor 3.2 GHz.  Both sets of HSI data used in this work were recorded in about 200 bands and classified into 16 labels. The pixel numbers are 54,129 and 21,025, respectively. The complexity of the CNNs adopted in this work seems suitable to these HSI datasets. It is conjectured that more complicated CNN configuration should be considered if more bands or more labels are involved. The results on these two cases show that the overall accuracy of the 1D-CNN with augmented input vectors is higher than 1D-CNN, BSCNN and 2D-CNN. The results of 2D-CNN turn out to be more accurate than conventional 1D-CNN, indicating that the spatial feature embedded in the spectral data can be useful. A small percentage of misclassification between similar classes can be resolved by applying the 1D-CNN with augmented input vectors which contain both the spatial and spectral features embedded in the HSI data.

Conclusions
Both the spectral and spatial features of HSI are exploited to increase the overall accuracy of image classification with several versions of 1D-CNN and 2D-CNN. The PCA was applied to extract significant spectral information while reducing the data dimension. These CNNs were applied to the HSI data on Salinas Valley and Indian Pines, respectively, to compare their accuracy rates of classification. The selection of band number and principal components was investigated by simulations. The highest OA on the Salinas Valley HSI is 99.8%, achieved by applying 1D-CNN to augmented input vectors of 645 bands, with one principal component from 21 × 21 pixels surrounding the target pixel. The highest OA on the Indian Pines HSI is 98.1%, achieved by applying 1D-CNN to augmented input vectors of 1964 bands, with four principal components from 21 × 21 pixels surrounding the target pixel. Possible misclassification between similar labels can be resolved by augmenting the input vectors to include more spatial and spectral features embedded in the HSI data.