Divide-and-Conquer Dual-Architecture Convolutional Neural Network for Classiﬁcation of Hyperspectral Images

: Convolutional neural network (CNN) is well-known for its powerful capability on image classiﬁcation. In hyperspectral images (HSIs), ﬁxed-size spatial window is generally used as the input of CNN for pixel-wise classiﬁcation. However, single ﬁxed-size spatial architecture hinders the excellent performance of CNN due to the neglect of various land-cover distributions in HSIs. Moreover, insufﬁcient samples in HSIs may cause the overﬁtting problem. To address these problems, a novel divide-and-conquer dual-architecture CNN (DDCNN) method is proposed for HSI classiﬁcation. In DDCNN, a novel regional division strategy based on local and non-local decisions is devised to distinguish homogeneous and heterogeneous regions. Then, for homogeneous regions, a multi-scale CNN architecture with larger spatial window inputs is constructed to learn joint spectral-spatial features. For heterogeneous regions, a ﬁne-grained CNN architecture with smaller spatial window inputs is constructed to learn hierarchical spectral features. Moreover, to alleviate the problem of insufﬁcient training samples, unlabeled samples with high conﬁdences are pre-labeled under adaptively spatial constraint. Experimental results on HSIs demonstrate that the proposed method provides encouraging classiﬁcation performance, especially region uniformity and edge preservation with limited training samples.


Introduction
With the rapid development of hyperspectral sensors, hyperspectral remote sensing images have become more available.Hyperspectral images (HSIs) often contain hundreds of narrow and contiguous spectral bands in the same scene, with wavelengths spanning the visible to infrared spectrum [1].The detailed spectral information provided by hyperspectral sensors improves the capacity to differentiate the interesting land-cover classes.It makes HSI classification one of the most promising techniques in many practical applications, including agriculture [2], military [3], astronomy [4], mineralogy [5], surveillance [6], and environmental sciences [7,8].
HSI classification involves two key aspects: feature extraction and classification.Feature extraction is crucial in addressing the "Hughes phenomenon" [9] caused by high-dimensional spectral bands of HSIs.In the early stage of HSI feature extraction, various spectral-based methods were proposed, such as principal component analysis (PCA) [10,11], independent component analysis (ICA) [12,13], manifold learning [14], sparse graph learning [15], and local Fisher's discriminant analysis (LFDA) [16].These methods are implemented by transforming original high-dimensional data into an appropriate low-dimensional space.However, it is difficult to precisely distinguish different land-cover classes only by spectral information.To address this issue, some researchers make use of spatial information to extract features, such as Gabor filters [17], wavelets [18,19], extended morphological profiles [20], morphological attribute profiles [21], and extended multi-attribute profiles (EMAPs) [22].Besides, multitask learning has powerful feature extraction ability due to its ability to incorporate shared information across multiple tasks.In one study [23], the kernel low-rank multitask method is proposed to capture multiple features from the 2-D variational mode decomposition domain for multi-/hyperspectral image classification.
The mentioned-above methods complete feature extraction and classification individually.Besides, these methods adopt manually-extracted features, which involve massive effort in feature engineering.In 2006, Geoffery Hinton proposed deep learning [32], and deep learning obtained a great success in computer vision [33][34][35][36][37]. Compared with traditional methods, deep learning-based methods extract hierarchical features and train the classifier simultaneously.Moreover, these deep learning-based methods adopt two or more hidden layers to extract more abstract and invariant features of data automatically.
A series of deep learning-based models have been introduced into the classification of HSIs.In one study [38], the stacked autoencoder (SAE) was proposed to extract deep features from hierarchical architecture.Subsequently, sparse SAE [39], denoising SAE [40], and Laplacian SAE [41] were successively proposed.In another study [42], Chen et al. presented a deep belief network (DBN) by learning the restricted Boltzmann machine network layer-by-layer.However, these methods cannot make full use of spatial information, since flattening training samples destroys the spatial structure in HSIs.Besides, there are so many parameters produced by full connection (FC) in these networks that a large number of available training samples are required.
Compared with SAE and DBN, convolutional neural network (CNN) [33] exploits local connections to effectively extract the spatial feature representation and shared weights to significantly decrease the number of parameters.Inspired by these properties, a series of CNN methods [43][44][45][46][47][48][49][50][51][52][53][54] have emerged for HSI classification.Hu et al. proposed a 1-dimensional (1D) CNN-based method to learn hierarchical spectral features of HSIs [50].Makantasis et al. combined randomized PCA and CNN to encode spatial information of HSIs [51].However, these two methods only exploit spectral information or spatial information, respectively.Later, some joint spectral-spatial CNN-based methods were proposed [48,51,52].A dual-channel CNN (DCNN) was constructed to extract spectral and spatial features by 1D-CNN and 2D-CNN separately, then extracted spectral and spatial features were concatenated together [51].Chen et al. presented another type of joint spatial-spectral feature extraction, where a 3-dimensional (3D) CNN (3DCNN) model was adopted to extract spectral and spatial information simultaneously [52].However, the performance of these CNN methods depends on the quantity of training samples greatly.Generally, the collection of training samples is difficult in HSIs.Recently, Li et al. proposed a pixel-pair CNN (PPF-CNN) method by reorganizing and relabeling existing training samples [53].Besides, in several studies [55][56][57], tensor-based models significantly reduced the number of weight parameters required to train the model via tensor decomposition.When the number of training data is limited, tensor-based classification models can perform well.Makantasis et al. proposed tensor-based linear and nonlinear models for HSI classification [55].The data from all the sensors was fused into a tensor, and damage-sensitive features were extracted for classification in tensor-based models [56].Recently, some other deep learning models are introduced for HIS [58,59].A new fully CNN was proposed to extract the deep features of HSIs.Then, the optimized extreme learning machine is used for classification [58].
All the mentioned CNN-based methods [43][44][45][46][47][48][49][50][51][52][53][54] adopt a single fixed network structure for HSI classification.The single network structure ignores the complex land-cover distributions of HSIs.In heterogeneous regions, a large-sized spatial window input covers some samples coming from different classes.These neighbor samples with different classes may lead to misclassification of samples located around the boundaries.In this case, spectral information is mainly required for heterogeneous regions.On the contrary, in homogeneous regions, neighbor samples have similar spectral signatures.A small spatial window input may lack enough contextual information for classification.In this case, spatial and spectral information are required to analyze homogeneous regions simultaneously.Therefore, single fixed network structure may hinder the excellent performance of CNNs for HSI pixel-wise classification.
To address this problem, a novel divide-and-conquer dual-architecture CNN (DDCNN) method is designed for HSI classification.In DDCNN, a new regional division strategy based on local and non-local decisions is devised to divide HSIs into homogeneous and heterogeneous regions, respectively.The non-local decision is performed to search the superpixel-pair similarity in the whole image, while the local decision is made by spatially adjacent samples in the superpixels.For the homogeneous regions, larger-sized spatial windows are selected to extract adequately contextual information.A multi-scale CNN architecture with larger spatial windows is constructed to learn joint spectral-spatial features.For the heterogeneous regions, smaller spatial windows are selected to guarantee the samples belonging to the same class.A fine-gained CNN architecture with smaller spatial windows is constructed to learn hierarchical spectral features.Then, to alleviate the problem of insufficient training samples, unlabeled samples are selected by measuring the spectral similarity under adaptively spatial constraint.The samples with high confidences on the spectral similarity are pre-labeled to expand the training set.
The main contributions of this paper can be summarized as follows.(1) A novel dual-architecture CNN is designed instead of traditional single architecture considering various land-cover distributions of HSIs.In DDCNN, a multi-scale CNN architecture is constructed to improve the uniformity of homogeneous regions, and a fine-grained CNN architecture is constructed to avoid edge over-smoothness.(2) Regional division method-based local and non-local decisions are designed to divide the homogeneous and heterogeneous regions effectively, where superpixel-to-superpixel similarity is utilized in the non-local searching.(3) DDCNN devises a new sample augmentation method based on spectral similarity under adaptively spatial constraints, which alleviates the over-fitting problem of CNNs caused by the imbalance between insufficient training samples and numerous parameters.
The rest of this paper is organized as follows.Section 2 reviews the CNN briefly.Section 3 describes the procedure of the proposed DDCNN method in detail.Then, the experimental validation and corresponding analysis on several hyperspectral datasets are discussed in Section 4. Finally, some concluding remarks and suggestions are provided for further work in Section 5.

The Review of Convolutional Neural Networks
CNN, one of the deep leaning models, gains outstanding performance in computer vision tasks, such as classification, detection, and recognition.The architecture of CNN is based on the inspirations from neuroscience [60].In the biological visual system, the cells in the cortex are sensitive to small regions, known as receptive fields.The strong capability of cells within receptive fields is used to exploit the local spatial correlation in images.
In contrast to other deep learning models, CNN possesses three core ideas: local connections, shared weights, and pooling.Local connections can extract local spatial features effectively corresponding to the receptive fields.Shared weight-that is, the connections between neurons-are replicated across the entire layer, which can significantly reduce the parameters of deep networks.Pooling is also known as downsampling, which extracts more robust features in the translation and deformation.
A traditional CNN is constructed by stacking several convolutional layers, pooling layers, and full connection layers to form deep architecture, where the output of each layer is provided as the input of the next layer.In the convolutional layer, the value of a neuron v xy ij at position (x, y) of the jth feature map in the ith layer is denoted as follows: where m indexes the feature map in the i − 1th layer connected to the current feature map, w pq ijm is the weights of position (p, q) connected to the mth feature map, P i and Q i are the height and width of the spatial window, and b ij is the bias of the jth feature map in the jth layer.

Divide-and-Conquer Dual-Architecture CNN(DDCNN)
The flowchart of the proposed DDCNN method is shown in Figure 1.As shown in Figure 1, DDCNN consists of three stages: regional division with local and non-local decisions, dual-architecture CNN-based classification, and data augmentation based on spectral similarity under adaptively spatial constraint.A HSI dataset contains M training samples where K is the number of classes, and 1 ≤ k ≤ K.At the regional division stage, the HSIs are divided into homogeneous and heterogeneous regions by using local and non-local decisions.Then, for the homogeneous regions, a multi-scale CNN architecture with larger-sized spatial window inputs is constructed to learn joint spectral-spatial features.For the heterogeneous regions, a fine-grained CNN architecture with smaller-sized inputs is constructed to learn hierarchical spectral features.Moreover, unlabeled samples with high confidences are selected to expand the training set by measuring the spectral similarity under the adaptive spatial constraint.i th layer is denoted as follows: where m indexes the feature map in the

Divide-and-Conquer Dual-Architecture CNN(DDCNN)
The flowchart of the proposed DDCNN method is shown in Figure 1.As shown in Figure 1 , where K is the number of classes, and1 kK  .At the regional division stage, the HSIs are divided into homogeneous and heterogeneous regions by using local and non-local decisions.Then, for the homogeneous regions, a multi-scale CNN architecture with larger-sized spatial window inputs is constructed to learn joint spectral-spatial features.For the heterogeneous

Superpixel Segmentation Based on Entropy Rate
In the superpixel segmentation, the images are divided into many superpixels.Each of them consists of spatially adjacent pixels with similar texture, color, brightness, or other characteristics [61].Compared with pixel-based methods, superpixel-based methods utilize the spatial structure of the images and show good regional uniformity.
In this paper, the entropy rate method [62] is adopted to generate a 2-D superpixel map in HSIs.Compared with other superpixel segmentation methods, the entropy rate method is a graph-based clustering algorithm.It favors compact and homogenous nonoverlapping clusters, and has a fast computation speed approximated as O(|V| log|V|), where V is the number of superpixels.More details of the entropy rate algorithm can be found in [62].As shown in Figure 2, the first principal component of HSIs extracted by PCA is utilized as the base image for the superpixel segmentation.Then the base image is divided into V superpixels with adaptive sizes and shapes, denoted as {π 1 , represents the vth superpixels.The segmentation result will be utilized in the regional division and data augmentation methods., where K is the number of classes, and1 kK  .At the regional division stage, the HSIs are divided into homogeneous and heterogeneous regions by using local and non-local decisions.Then, for the homogeneous regions, a multi-scale CNN architecture with larger-sized spatial window inputs is constructed to learn joint spectral-spatial features.For the heterogeneous regions, a fine-grained CNN architecture with smaller-sized inputs is constructed to learn hierarchical spectral features.Moreover, unlabeled samples with high confidences are selected to expand the training set by measuring the spectral similarity under the adaptive spatial constraint.

Regional Division with Local and Non-local Decisions
Most of CNN-based HSI classifications [43][44][45][46][47][48][50][51][52] are designed to exploit the spatial correlation in the neighborhood around the central pixel.That is, hyperspectral neighboring pixels in a spatial window are jointly represented by the CNN model for feature extraction.These CNN models commonly adopt a fixed-size spatial window as the input for feature extraction (e.g., 5 × 5, 27 × 27, etc.).This type of input hinders the excellent performance of CNNs for HSI classification.A large-sized spatial window input may include between-class samples in the heterogeneous regions, and a small-sized input may lead to extracting insufficient contextual information in the homogeneous regions.Figure 3 illustrates an example for these two situations.In Figure 3, i and j are two samples in the HSIs.These two samples locate in the homogeneous and heterogeneous regions, respectively.Both them belong to the "GREEN" class.For the sample i, a larger spatial window (i.e., black box) contains some samples belonging to "BLUE", "PURPLE", and "YELLOW" classes instead of "GREEN" class.In this case, the sample i may be easily misclassified as the "BLUE", "PURPLE", or "YELLOW" class.If a smaller spatial window (i.e., red box) is selected, all the samples in the window belong to the "GREEN" class.For the sample j, all the samples in both larger and smaller spatial windows (i.e., black and red boxes) belong to the "GREEN" class.In the case, a larger spatial widow contains more adequately contextual information for feature extraction.In the superpixel segmentation, the images are divided into many superpixels.Each of them consists of spatially adjacent pixels with similar texture, color, brightness, or other characteristics [61].Compared with pixel-based methods, superpixel-based methods utilize the spatial structure of the images and show good regional uniformity.
In this paper, the entropy rate method [62] is adopted to generate a 2-D superpixel map in HSIs.Compared with other superpixel segmentation methods, the entropy rate method is a graph-based clustering algorithm.It favors compact and homogenous nonoverlapping clusters, and has a fast computation speed approximated as ( ) V , where V is the number of superpixels.More details of the entropy rate algorithm can be found in [62].As shown in Figure 2, the first principal component of HSIs extracted by PCA is utilized as the base image for the superpixel segmentation.
Then the base image is divided into V superpixels with adaptive sizes and shapes, denoted as v vV   represents the v th superpixels.The segmentation result will be utilized in the regional division and data augmentation methods.

Regional Division with Local and Non-local Decisions
Most of CNN-based HSI classifications [43][44][45][46][47][48][50][51][52] are designed to exploit the spatial correlation in the neighborhood around the central pixel.That is, hyperspectral neighboring pixels in a spatial window are jointly represented by the CNN model for feature extraction.These CNN models commonly adopt a fixed-size spatial window as the input for feature extraction (e.g., 5 × 5, 27 × 27, etc.).This type of input hinders the excellent performance of CNNs for HSI classification.A large-sized spatial window input may include between-class samples in the heterogeneous regions, and a small-sized input may lead to extracting insufficient contextual information in the homogeneous regions.3 illustrates an example for these two situations.In Figure 3, i and j are two samples in the HSIs.These two samples locate in the homogeneous and heterogeneous regions, respectively.Both them belong to the "GREEN" class.For the sample i, a larger spatial window (i.e., black box) contains some samples belonging to "BLUE", "PURPLE", and "YELLOW" classes instead of "GREEN" class.In this case, the sample i may be easily misclassified as the "BLUE", "PURPLE", or "YELLOW" class.If a smaller spatial window (i.e., red box) is selected, all the samples in the window belong to the "GREEN" class.For the sample j, all the samples in both larger and smaller spatial windows (i.e., black and red boxes) belong to the "GREEN" class.In the case, a larger spatial widow contains more adequately contextual information for feature extraction.To deal with these two situations, novel regional division method-based local and non-local decisions are designed to divide the HSIs into homogeneous and heterogeneous regions, where different CNN architectures are designed for homogeneous and heterogeneous regions, respectively.The divide and conquer strategy with homogeneous and heterogeneous regions is inspired by a visual attention-based model.Doulamis et al. proposed a fuzzy representation of video content [63].The divide and conquer concept was first proposed in the multiresolution recursive shortest spanning tree algorithm for video summarization and content-based retrieval [63].Then, a neural network based scheme was used to select adaptive regions of interest (ROI) [64].Then, a ROI-based motion-compensated discrete consine transform coder was proposed to extract foreground objects from background in videophones.Derived from the pioneering work on ROI [64], a neurobiological model of visual attention was proposed for video compression [65].Later, visual attention based model was introduced into hyperspectral image processing [66,67].
(1) Regional Division with Local Decision: In the local decision, entropy rate-based superpixel segmentation is used to generate some homogeneous superpixels.Similar to the masking of edge detection, we choose a square frame (e.g., 3 × 3, 5 × 5) as the filter.If all the samples in the filter are within the same superpixel, the central sample is judged to be in the homogeneous regions.If these samples are divided into multiple superpixels, the central sample is located in the heterogeneous regions of the superpixel segmentation map.Actually, since the superpixel segmentation over-segments the HSIs, the central sample may be uncertain in the ground truth.It may belong to either the homogeneous or heterogeneous region.
Figure 4 illustrates the local regional division based on superpixel segmentation.Take the Indian Pines HSI as an example.Figure 4a shows the ground truth of the Indian Pines HSI. Figure 4b shows the results of entropy rate-based superpixel segmentation on the Indian Pines HSI.The samples i, j, and k represent the central samples located in the different regions.Figure 4c-e corresponds to the filters of the samples i, j, and k.In Figure 4d, since all neighbor samples in the filter belong to the same superpixel, the central sample i is judged to be in the homogeneous regions.In Figure 4c,e, the neighbor samples of the central samples j and k in the filters come from different superpixels.In the superpixel-based local decision, both sample j and k are judged to be in the heterogeneous regions.Actually, the sample k is located at the boundary area of superpixel segmentation map in Figure 4b rather than that of ground truth in Figure 4a.This is the "false boundary" phenomenon caused by the superpixel segmentation map.In the superpixel segmentation map, the samples belonging to the same class may be divided into several superpixels.
Remote Sens. 2018, 10, x FOR PEER REVIEW 6 of 28 recursive shortest spanning tree algorithm for video summarization and content-based retrieval [63].
Then, a neural network based scheme was used to select adaptive regions of interest (ROI) [64].
Then, a ROI-based motion-compensated discrete consine transform coder was proposed to extract foreground objects from background in videophones.Derived from the pioneering work on ROI [64], a neurobiological model of visual attention was proposed for video compression [65].Later, visual attention based model was introduced into hyperspectral image processing [66,67].
(1) Regional Division with Local Decision: In the local decision, entropy rate-based superpixel segmentation is used to generate some homogeneous superpixels.Similar to the masking of edge detection, we choose a square frame (e.g., 3 × 3, 5 × 5) as the filter.If all the samples in the filter are within the same superpixel, the central sample is judged to be in the homogeneous regions.If these samples are divided into multiple superpixels, the central sample is located in the heterogeneous regions of the superpixel segmentation map.Actually, since the superpixel segmentation over-segments the HSIs, the central sample may be uncertain in the ground truth.It may belong to either the homogeneous or heterogeneous region.Figure 4 illustrates the local regional division based on superpixel segmentation.Take the Indian Pines HSI as an example.Figure 4a shows the ground truth of the Indian Pines HSI. Figure 4b shows the results of entropy rate-based superpixel segmentation on the Indian Pines HSI.The samples i, j, and k represent the central samples located in the different regions.Figure 4c-e corresponds to the filters of the samples i, j, and k.In Figure 4d, since all neighbor samples in the filter belong to the same superpixel, the central sample i is judged to be in the homogeneous regions.In Figure 4c and e, the neighbor samples of the central samples j and k in the filters come from different superpixels.In the superpixel-based local decision, both sample j and k are judged to be in the heterogeneous regions.Actually, the sample k is located at the boundary area of superpixel segmentation map in Figure 4b rather than that of ground truth in Figure 4a.This is the "false boundary" phenomenon caused by the superpixel segmentation map.In the superpixel segmentation map, the samples belonging to the same class may be divided into several superpixels.Let x i be a central sample and N i be the filter of x i .If all the neighbor samples belong to the same superpixel π v , the central sample x i is judged to be in the homogeneous regions, and vice versa.The regional division based on local decision is formulated as follows: where π v (x i ) denotes the superpixel that the sample x i belongs to.X Ho represents the sample set in the homogeneous regions, and X He represents the sample set in the heterogeneous regions of superpixel segmentation map.
(2) Regional Division with Non-Local Decision: To alleviate the misdivision caused by the false boundary, a novel regional division based on non-local decisions is devised.In the HSIs, local information is used on the assumption that the samples in a local region belong to the same class.However, non-local information is also vital for HSI classification [68,69], since the samples belonging to the same class may be located in different regions.
In the non-local decision, pixel-similarity is extended to superpixel-similarity, which considers the structural information of current samples.For the samples judged in the heterogeneous regions by local decisions, the similarities of the neighbor samples and the current sample are calculated, where the current sample is represented by the samples with the same class in the global searching.Then, the similarities are compared with a calculated adaptive threshold.If the similarities of all the neighbors are larger than the threshold, the current sample is judged to be in the homogeneous region, and vice versa.
Let x i represent a sample judged in heterogeneous regions by local decision, denoted as x i ∈ X He .
The filter The similarities of the neighbor samples and the current sample are calculated by superpixel-to-superpixel similarity SS π v (x i ), π v N i l .π v N i l , which represents the superpixel π v containing the sample set N i l .If all the similarities are larger than the threshold T k of the kth category, the sample x i is judged to be in the homogeneous regions, and vice versa.T k is a set as the minimum superpixel-based similarity of the samples in the kth category.If x i is the unlabeled sample, k is set as the label of the training samples with most similarity.The regional division with non-local decisions is defined as follows: where x j is the sample in the kth category, and π v x j represents the superpixel correspond the sample x j ; ψ k is the set of training samples in the kth category.
To measure the similarity of two superpixels, the average pooling strategy is applied to exploit the most significant information of superpixels.The similarity of two superpixels is calculated as: where π v x p and π v x q represent two different superpixels corresponding to the samples x p an x q , respectively.The similarity measure is calculated by the heat kernel Combining the local and global decisions (3) and ( 4), the sample is divided into homogeneous and heterogeneous regions according to (6): where X He is the set of samples in the heterogeneous regions.

Multi-Scale CNN Architecture
In the HSIs, the spectral signatures of samples in the same class may be different due to varied imaging conditions, e.g., changes in illumination, various environments, different atmospheric conditions, and temporal conditions.Therefore, spatial contexture information is critical for HSI classification.For the samples in the homogenous regions, a multi-scale CNN architecture with larger-sized spatial window inputs is constructed to extract joint spatial and spectral features.The multi-scale convolution consists of 1 × 1, 3 × 3, and 5 × 5 convolutional filters, where a 1 × 1 convolutional filter is used to extract spectral features, while 3 × 3 and 5 × 5 filters are utilized to extract various spatial contextual features.
In the multi-scale CNN architecture, a multi-scale convolutional filter is inspired by the Inception module [35].The Inception module is used to exploit diverse local spatial structures of the input image, which enables the network to get deeper and wider and achieves state-of-the-art performance in image classification.The effectiveness of the inception module has been demonstrated in the large scale visual recognition challenge (LSVRC) 2014 [35].The multi-scale convolutional filter is used to extract joint spectral-spatial features for HSI classification in this paper.
The architecture of multi-scale CNN network is shown in Figure 5.The input of multi-scale CNN architecture is larger-sized spatial windows with several principle components of PCA.A multi-scale filter is used in the first convolutional layer to jointly extract spatial structure and spectral correlation.Three feature maps are employed to perform cascade connection to form a joint spectral-spatial feature map.Subsequently, three convolutional layers are stacked one by one to extract hierarchical abstract features of HSIs.Then the extracted feature maps are flattened to a one-dimensional vector used as the input to two full connection layers.Finally, the extracted features are fed into the last soft-max classification layer.

Multi-Scale CNN Architecture
In the HSIs, the spectral signatures of samples in the same class may be different due to varied imaging conditions, e.g., changes in illumination, various environments, different atmospheric conditions, and temporal conditions.Therefore, spatial contexture information is critical for HSI classification.For the samples in the homogenous regions, a multi-scale CNN architecture with larger-sized spatial window inputs is constructed to extract joint spatial and spectral features.The multi-scale convolution consists of 1 × 1, 3 × 3, and 5 × 5 convolutional filters, where a 1 × 1 convolutional filter is used to extract spectral features, while 3 × 3 and 5 × 5 filters are utilized to extract various spatial contextual features.In the multi-scale CNN architecture, a multi-scale convolutional filter is inspired by the Inception module [35].The Inception module is used to exploit diverse local spatial structures of the input image, which enables the network to get deeper and wider and achieves state-of-the-art performance in image classification.The effectiveness of the inception module has been demonstrated in the large scale visual recognition challenge (LSVRC) 2014 [35].The multi-scale convolutional filter is used to extract joint spectral-spatial features for HSI classification in this paper.
The architecture of multi-scale CNN network is shown in Figure 5.The input of multi-scale CNN architecture is larger-sized spatial windows with several principle components of PCA.A multi-scale filter is used in the first convolutional layer to jointly extract spatial structure and spectral correlation.Three feature maps are employed to perform cascade connection to form a joint spectral-spatial feature map.Subsequently, three convolutional layers are stacked one by one to extract hierarchical abstract features of HSIs.Then the extracted feature maps are flattened to a one-dimensional vector used as the input to two full connection layers.Finally, the extracted features are fed into the last soft-max classification layer.
In this model, some regularization methods, data augmentation, dropout, early stop, and batch normalization (BN) are introduced to alleviate the over-fitting problem of CNNs.A new sample augmentation method is devised by pre-labeling the unlabeled samples based on spectral similarity under adaptive spatial constraint.Dropout is used in the second and third convolutional layers by preventing complex co-adaptations.It is used as the regularization technique to relieve the In this model, some regularization methods, data augmentation, dropout, early stop, and batch normalization (BN) are introduced to alleviate the over-fitting problem of CNNs.A new sample augmentation method is devised by pre-labeling the unlabeled samples based on spectral similarity under adaptive spatial constraint.Dropout is used in the second and third convolutional layers by preventing complex co-adaptations.It is used as the regularization technique to relieve the over-fitting problem.Early stop relieves the over-fitting problem by limiting the number of iterations.In addition, batch normalization is used in all the convolutional layers to accelerate the training of networks and reduce the internal covariate shift [70].

Fine-Grained CNN Architecture
For the samples in the heterogeneous regions, the spatial information is hard to use due to the distribution of different land-cover classes.The distinction for these samples mainly depends on hundreds of contiguous and narrow spectral bands.For these samples, a fine-grained CNN architecture with smaller-sized spatial window inputs is constructed to extract spectral information, where 1 × 1 convolution is used in all the convolutional layers.
The architecture of the fine-grained CNN network is shown in Figure 6.In the fine-grained CNN network, all the spectral bands are retained.The input of fine-grained CNN architecture is smaller-sized spatial windows with all the spectral bands.The 1 × 1 convolution is used in all the four convolutional layers.The 1 × 1 convolutional filter is proposed in Network In Network (NIN) [71], which allows complex and learnable interactions of cross channel information.Furthermore, it is also used to adjust the dimensionality of the feature maps.Here, 1 × 1 convolution is used to learn spectral correlations in the proposed network.Two full connection layers are stacked one by one after the convolutional layers.Finally, the extracted spectral features are fed into the soft-max classification layer.Similar to multi-scale CNN architecture, BN and dropout are used in the same position.For the samples in the heterogeneous regions, the spatial information is hard to use due to the distribution of different land-cover classes.The distinction for these samples mainly depends on hundreds of contiguous and narrow spectral bands.For these samples, a fine-grained CNN architecture with smaller-sized spatial window inputs is constructed to extract spectral information, where 1 × 1 convolution is used in all the convolutional layers.
The architecture of the fine-grained CNN network is shown in Figure 6.In the fine-grained CNN network, all the spectral bands are retained.The input of fine-grained CNN architecture is smaller-sized spatial windows with all the spectral bands.The 1 × 1 convolution is used in all the four convolutional layers.The 1 × 1 convolutional filter is proposed in Network In Network (NIN) [71], which allows complex and learnable interactions of cross channel information.Furthermore, it is also used to adjust the dimensionality of the feature maps.Here, 1 × 1 convolution is used to learn spectral correlations in the proposed network.Two full connection layers are stacked one by one after the convolutional layers.Finally, the extracted spectral features are fed into the soft-max classification layer.Similar to multi-scale CNN architecture, BN and dropout are used in the same position.

Data Augmentation Based on Spectral Similarity under the Adaptively Spatial Constraint
Deep learning models depend on a large quantity of training data due to the models being heavily parameterized.However, only limited training samples are available in HSI data.The CNN model tends to be over-fitting for HSI classification.To conquer this issue, a novel data augmentation method based on spectral similarity under adaptive spatial constraint is devised.
In the data augmentation method, superpixels with adaptive sizes and shapes are used for the spatial constraint.In the spatial constraint, unlabeled samples located in the same superpixel with training samples are considered as candidates.Then, unlabeled candidate samples with high confidence, which have the most spectral similarity with training samples, are selected.Finally, these selected unlabeled samples are pre-labeled as the same class with training samples, which are used to expand the training set.
Table 1.The procedure of the proposed DDCNN method.

Data Augmentation Based on Spectral Similarity under the Adaptively Spatial Constraint
Deep learning models depend on a large quantity of training data due to the models being heavily parameterized.However, only limited training samples are available in HSI data.The CNN model tends to be over-fitting for HSI classification.To conquer this issue, a novel data augmentation method based on spectral similarity under adaptive spatial constraint is devised.
In the data augmentation method, superpixels with adaptive sizes and shapes are used for the spatial constraint.In the spatial constraint, unlabeled samples located in the same superpixel with training samples are considered as candidates.Then, unlabeled candidate samples with high confidence, which have the most spectral similarity with training samples, are selected.Finally, these selected unlabeled samples are pre-labeled as the same class with training samples, which are used to expand the training set.
Specifically, x u denotes a current unlabeled sample, and π v (x u ) represents the superpixel where the sample x u is located.For all the training samples {x m |x m ∈ π v } in the superpixel π v , the similarities of current unlabeled sample x u and all the training samples {x m |x m ∈ π i } are calculated.Then, the similarities are compared with a calculated threshold T π v , which is calculated by any two training samples in the superpixel π v .If all the similarities are larger than the threshold, the current unlabeled sample is selected, and vice versa.The selected unlabeled samples are pre-labeled as the same label as the training samples {x m |x m ∈ π v }, which is formulated as (7).These pre-labeled samples are used to expand the training set.
where I(•) is the indictor function, and y u = 0 represents that the unlabeled sample y u , it is not selected to expand the training set.

The Procedure of DDCNN
The proposed DDCNN method uses the divide-and-conquer strategy to break the HSI classification into pixel-wise classification based on homogeneous and heterogeneous regions.Then, we solve the classification problems by two well-designed CNN networks separately and combine these solutions with the original classification problem.The proposed DDCNN method guarantees regional uniformity for homogeneous regions and edge preservation for heterogeneous regions of HSIs simultaneously.The procedure of DDCNN can be summarized in Table 1.Segment the whole HSI into The training samples X train are expanded to new training samples X train by (7) 5.
Training samples X train are divided into X trainHo and X trainHe , and test samples X test are divided into X testHo and X testHe by (6) 6.
initialize all the weight matrices and biases 7.
Input the training samples X trainHo 8.
for every epoch 9.
for n training sample of every mini-batch 10.
compute the objective function l Ho by the cross-entropy loss function 11.
update the parameters of the multi-scale CNN by minimizing loss function 12.
end for 13.end for 14.Input the training samples X trainHe 15. for every epoch 16. for

Experimental Results
In this section, we validate the proposed DDCNN method on three benchmark HSI datasets.We investigate the performance of the proposed method from the following aspects: classification performance, running time, sensitivity analysis to the number of training samples, and sensitivity analysis of free parameters.

Data Description
In this study, we adopt three HSI datasets for the experiment: the Indian Pines, Pavia University, and Salinas.
(1) The Indian Pines dataset is a mixed vegetation site over the Indian Pines test area in Northwestern India.It was acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor, with the size of 145 × 145 pixels.There are 220 spectral bands in the wavelenth range of 0.4-2.5 µm in the visible and infrared spectrum.However, 200 spectral bands are preserved after 20 lower signal-to-noise ratio bands being diacarded.The dataset contains 16 different land-cover classes.The false-color composite image (bands 50, 27, 17) is shown in Figure 7a.
(2) The Pavia University dataset was gathered by the Reflective Optics System Imaging Spectrometer (ROSIS-3) sensor in an urban site over the city of Pavia, Italy.There are 610 × 340 pixels and 103 spectral bands after 20 water absorption bands beingremoved.The ROSIS tensor generates the spectral bands in the wavelength ranging from 0.43µm to 0.86µm.There are 9 different land-cover classes, and the false-color image (bands 53, 31, 8) is shown in Figure 7b. (

Experimental Setting
The performance of the proposed DDCNN method is compared with some state-of-the-art HSI classification approaches, which includes five representative deep learning-based methods, SAE [39], DBN [42], CNN [49], PPF-CNN [53], 3D-CNN [52], and a classical SVM method with radial basis function (RBF-SVM) [30].The classification performance of all the methods is measured by three common measurements: overall accuracy (OA), average accuracy (AA), and kappa coefficient (Kappa) [72].The experiments are impemented over 20 independent runs with a random division of training and test sets.The average classification accuracy and the corresponding standard deviation over 20 independent runs are calucated.When the training samples change by using the random selection, the sample augmentation, regional division, and DDCNN model are affected.In this way, the robustness of the proposed method is validated.All the experiments are carried out using Python language and TensorFlow [73] library on a NVIDIA 1080Ti graphics card.TensorFlow is an open source software library for numerical computation using data flow graphs.
For RBF-SVM, one-against-all strategy is used to deal with multi-classification.The penalty and gamma parameters in RBF-SVM are determined by five-fold cross validation.For SAE and DBN, the radius of the spatial neighborhood window is set as 7.As suggested by the literature [49], the input of the spatail window is set as 5 × 5.For PPF-CNN, the size of block window of neighboring pixels is set to the default value in [53].For 3DCNN, the spatial window size of 3-D input is resized to 27 × 27 × 100 [52].For DDCNN, the size of spatial window for dual architecture network will be investigated in the next subsection.
Besides, there are also several important parameters in the deep learning models, such as learning rate, epochs, and the number of layers.For the learning rate, we set all the models as 0.01.For the epochs, SAE, DBN, CNN, and DDCNN are trained with 1000 epochs.We train PPF-CNN with 300 epochs while we train 3DCNN with 500 epochs.SAE and DBN consist of 4 hidden layers.CNN, 3DCNN, and DDCNN include 3 convolutional layers and 2 full connection layers, while PPF-CNN consists of 8 convolutional layers and 2 full connection layers.

Classification Results of Hyperspectral Datasets
(1) Classification Results of the Indian Pines Dataset: The Indian Pines dataset is randomly divided into 5% training set and 95% test set.The numbers of training and test samples for each class are listed in Table 2. Table 3 records the class-specific accuracy, overall accuracy (OA), average accuracy (AA), and Kappa of all seven methods.The best classification results in the seven algorithms are emphasized in gray regions.Compared with RBF-SVM, deep learning-based methods SAE, DBN, CNN, PPF-CNN, 3DCNN, and DDCNN obtain better classification results due to hierarchical nonlinear feature extraction.Compared with SAE and DBN, CNN, PPF-CNN, 3DCNN, and DDCNN are superior by making full use of the spatial information in HSI.Among the seven methods, DDCNN achieves the best classification results in the majority of classes due to the power feature extraction capability of dual-architecture CNN for various land-cover distributions.Furthermore, DDCNN improves the classification performance more than the best baseline by 4.1% in the OA index, 7.2% in the AA index, and 4.4% in the Kappa index.
Figure 8 shows the classification maps of the seven algorithms on the Indian Pines dataset.As shown in Figure 8b-d,f, there are massive noisy scattered points in SVM, SAE, DBN, and PPF-CNN, especially in the corn-notill, corn-mintill, soybean-notill, and soybean-mintill classes.Compared with these methods, CNN, 3DCNN, and DDCNN improve the region uniformity significantly.Howerer, edge over-smoothness occurs in the visual maps of CNN and 3DCNN.Compared with CNN and 3DCNN, DDCNN obtains better boundary localization of the soybean-notill and soybean-mintill classes.(2) Classification results of the Pavia University dataset: The Pavia University dataset is randomly divided into a 3% training set and 97% test set.The numbers of training and test samples for each class are listed in Table 4. Table 5 records the classification results for the Pavia University dataset.As shown in Table 5, compared with other methods, DDCNN gains a certain degree of improvement in most classes, especially in the gravel and bitumen classes.DDCNN improves 38.2% more than SVM in the gravel class, and improves 24.8% than DBN in the bitumen class.For all the classes, the proposed DDCNN method improves by 8.2%, 6.1%, 6.6%, 5.5%, 1.6%, and 3.3% more than the other six methods in the OA index.The visual classification maps of the Pavia University dataset are shown in Figure 9.As shown in Figure 9b-f, many samples belonging to the bitumen class are misclassified as the asphalt class because of similar spectral signatures.The proposed DDCNN method provides a better distinction for these two classes.Besides, the samples in the gravel class are misclassified as the class of the self-blocking bricks by SVM, SAE, and DBN, and as the class of the asphalt by 3DCNN.Compared with them, DDCNN obtains better classification performance for the gravel class.Compared with the other methods, DDCNN achieves better region uniformity in the bare soil class, and obtains better boundary localization in the gravel and bitumen classes.
(3) Classification results of the Salinas dataset: The Salinas dataset is randomly divided into 1% for training and 99% for testing.The numbers of training and test samples for each class are listed in Table 6.The classifcation results of all seven algorithms on the Salinas dataset are summarized in Table 7.It can be seen that many samples in the grapes_untrained and vinyard_untrained classes are misclassified by RBF-SVM, SAE, DBN, CNN, and PPF-CNN.Compared with these methods, DDCNN obviously improves the classification results.For the vinyard_untrained class, DDCNN improves by 42.6%, 20.7%, 27.1%, 16.5%, and 23.7%.For the broccoli_green_weeds_1 class, DDCNN achieves completely correct classification result.Among all the seven methods, DDCNN obtains the best classification performance by OA=98.8%,AA=98.6%, and Kappa=98.6%.(2) Classification results of the Pavia University dataset: The Pavia University dataset is randomly divided into a 3% training set and 97% test set.The numbers of training and test samples for each class are listed in Table 4. Table 5 records the classification results for the Pavia University dataset.4. Table 5 records the classification results for the Pavia University dataset.(3) Classification results of the Salinas dataset: The Salinas dataset is randomly divided into 1% for training and 99% for testing.The numbers of training and test samples for each class are listed in Table 6.The classifcation results of all seven algorithms on the Salinas dataset are summarized in Table 7.It can be seen that many samples in the grapes_untrained and vinyard_untrained classes are misclassified by RBF-SVM, SAE, DBN, CNN, and PPF-CNN.Compared with these methods, DDCNN obviously improves the classification results.For the vinyard_untrained class, DDCNN improves by 42.6%, 20.7%, 27.1%, 16.5%, and 23.7%.For the broccoli_green_weeds_1 class, DDCNN achieves completely correct classification result.Among all the seven methods, DDCNN obtains the best classification performance by OA=98.8%,AA=98.6%, and Kappa=98.6%.
Figure 10 shows the classification visual maps of the seven algorithms on the Salinas dataset.As shown in Figure 10b-f

Investigation on Running Time and Parameters
Tables 8-10 list the training and test times of the seven methods on the Indian Pines, Pavia University, and Salinas datasets, respectively.Futhermore, the number of parameters involved with the seven methods are listed.As shown in Tables 8-10, compared with RBF-SVM, six deep learning-based methods, SAE, DBN, PPF-CNN, CNN, 3DCNN, and DDCNN, cost more training time due to heavily parameterized models.Among all the comparison methods, 3DCNN costs lots of time in the training process because three-dimensional convolution operation involves a large number of parameters.PPF-CNN is time-consuming due to the expansion of a large number of training samples, especially when the number of training samples is large.DDCNN involve two CNN architectures, which cost more time than CNN but less time than 3DCNN and PPF-CNN.The number of parameters for DDCNN is almost 376,000, where multi-scale CNN has nearly 347,000 paremeters and fine-grained CNN has nearly 29,000 parameters.In the testing procedure, DDCNN is more time-consuming than SAE, DBN, and CNN due to the computation burden in double CNN architectures.Compared with PPF-CNN and 3D-CNN, DDCNN has obvious advantage because PPF-CNN uses the voting strategy with the adjacent samples and 3D-CNN uses a complex 3D convolution operation.DDCNN costs 0.7s, 2.3s, and 4.7s on the Indian Pines, Pavia University, and Salinas datasets, respectively.In the experiment, the number of training samples per class is changed from 1% to 9% with an interval of 2% on the Indian Pines dataset, 1% to 5% with an interval of 1% on the Pavia University dataset, and 1% to 3% with an interval of 0.5% on the Salinas dataset.Generally, deep learning-based methods are usually heavily parameterized and a large number of training samples are required to guarantee the performance.When the ratio of training samples is larger than 9% on the Indian Pines, 5% on the Pavia University, and 3% on the Salinas, the training samples are sufficient to estimate the models.CNN-based methods, CNN, PPF-CNN, 3DCNN, and DDCNN, perform better than the other three methods.When the ratio of training samples decreases, the classification performance of all the seven algorithms declines.In this case, deep learning-based methods SAE, DBN, and CNN have no obvious advantage over RBF-SVM.Compared with them, 3D-CNN, PPF-CNN, and DDCNN show better classification performance for the small-sized sample set.Among these methods, DDCNN consistently provides superior performance with different ratios of training samples.DDCNN improves by at least 6.8%, 5.6%, and 2.9% on the Indian Pines, Pavia University, and Salinas datasets, respectively, when the ratio of training sample is 1%.Thus, DDCNN is a better choice when the number of training samples is limited.In the experiment, the number of training samples per class is changed from 1% to 9% with an interval of 2% on the Indian Pines dataset, 1% to 5% with an interval of 1% on the Pavia University dataset, and 1% to 3% with an interval of 0.5% on the Salinas dataset.Generally, deep   To verify the effectiveness of data augmentation, we have added the proposed method without data augmentation (DDCNN-WDA) as the comparison method.To validate the structure effectiveness of the proposed dual-architecture CNN method, a multi-scale CNN (MCNN) and a fine-gained CNN (FCNN) have been added as the comparison methods.The experimental results on the Indian Pines, Pavia University, and Salinas datasets are recorded in Table 12.As shown in Table 12, compared with FCNN, DDCNN increases by 3.6%, 1.1%, and 2.5% on the Indian Pines, Pavia University, and Salinas datasets.Compared with MCNN, DDCNN increases by 1.1%, 0.7%, and 1.7% on three HSI datasets.It is shown that dual-architecture is more effective than single network architecture for HSI classification.DDCNN exploits dual-architecture CNN to improve the classification performance of HSIs.Compared with DDCNN-WDA, DDCNN increases by 1.0%, 0.8%, and 0.4% on the Indian Pines, Pavia University, and Salinas datasets.It is shown that data augmentation is effective for HSI classification.DDCNN improves the classification performance of HSIs by exploiting the data augmentation.

Analysis of Free Parameters in DDCNN
There are two important parameters w 1 and w 2 in DDCNN; w 1 and w 2 represent the size of spatial window in multi-scale CNN and fine-grained CNN, respectively.In Figure 12, w 1 is set to [23,25,27,29,31], while w 2 is set to [1,3,5,7,9]; w 1 and w 2 control the input size of samples in the homogeneous and heterogeneous regions.Figure 12a-c shows the OA results of DDCNN on the Indian Pines, Pavia University, and Salinas datasets under different parameters w 1 and w 2 .As shown in Figure 12, when w 1 and w 2 are selected as 27 and 7 on the Indian Pines, 31 and 9 on the Pavia University, and 31 and 9 on the Salinas, the classification performance reaches the peak values.The Pavia University and Salinas dataset have higher spatial resolution than the Indian Pines dataset.Therefore, the sizes of w 1 and w 2 in the Pavia University and Salinas datasets are larger than that in the Indian Pines dataset.There are two important parameters 1 w and 2 w in DDCNN; 1 w and 2 w represent th of spatial window in multi-scale CNN and fine-grained CNN, respectively.In Figure 12, 1 w is [23,25,27,29,31], while 2 w is set to [1, 3, 5, 7, 9]; 1 w and 2 w control the input size of samp the homogeneous and heterogeneous regions.Figure 12a-c    The depth of the network plays an important role because it determines the quality of extracted features.Table 13 shows the classification results of DDCNN as the number of convolutional layers increases from 1 to 5. The experimental results show that the model achieves the best classification results when 4 convolutional layers are chosen for hyperspectral datasets.When the number of layers is large enough, the model extracts abstract and invariant features.

Salinas
The number of superpixel is an important free parameter.The superpixel segmentation is utilized in the regional division and data augmentation of DDCNN.As shown in Table 14, DDCNN obtains the best classification performance when the number of superpixels is set as 100 on the Indian Pines dataset and Salinas dataset, and 1000 on the Pavia University dataset.The number of superpixels on the Pavia University dataset is larger than that on other datasets due to more complex distribution on the Pavia University dataset.When the number of superpixels is too small, the same superpixel may contain different classes.In this case, the classification results would deteriorate due to misdivision of homogeneous and heterogeneous regions.On the contrary, when the number of superpixels is too large, fewer unlabeled samples are pre-labeled to augment the data.In this case, DDCNN has limited ability to alleviate the overfitting problem.

Analysis of the Thresholds in DDCNN
There are two thresholds, T k and T π v , involved in the proposed method.T k is a threshold involved in the regional division with non-local decision.The threshold T k is not empirically set.It can be calculated by the equation T k = min SS π v (x i ), π v x j x i , x j ∈ ψ k .T k is the minimum value of similarities between any two superpixels containing the training samples of the kth category.For each class, an adaptive threshold T k can be obtained by considering all the training samples of this class.When the value of T k is too large or small, the classification performance would degrade due to misdivision of homogeneous and heterogeneous regions.Compared with empirical setting, the proposed adaptive calculation is a better choice due to considering data distribution.
T π v is a threshold involved in the data augmentation.It is calculated as the minimum value of the similarities between any two training samples in the superpixel π v .For three hyperspectral datasets, T π v is calculated as 0.921, 0.903, and 0.915 in the experiment.We have added the analysis of classification performance under different thresholds T π v in Figure 13.In Figure 13, the OA results of DDCNN on three hyperspectral datasets are shown as T π v increases from 0.5 to 1.0.When the value of T π v is too large, the spatial constraint of sample augmentation becomes strict.Fewer unlabeled samples are selected to pre-label.In this case, DDCNN has limited ability to alleviate the overfitting problem.Conversely, when the value of T π v is too small, unlabeled samples having low confidence may be selected.In this case, pre-labeled unlabeled samples would deteriorate the classification performance.When T π v is in the range of [0.88, promising classification results on three hyperspectral datasets.On three hyperspectral datasets, T π v is calculated as 0.921, 0.903, and 0.915 in the experiment.It can be seen that the calculated values of T π v fall within this range.

Conclusions
In this paper, a novel divide-and-conquer dual-architecture CNN (DDCNN) method is proposed for HSI classification.In DDCNN, a regional division method based on local and non-local decisions is designed to divide the HSIs into homogeneous and heterogeneous regions, respectively.A multi-scale CNN architecture and a fine-grained CNN architecture are constructed to learn spectral-spatial features on the homogeneous and heterogeneous regions.Dual-architecture CNN guarantees region uniformity and edge preservation of HSI classification simultaneously.Moreover, to alleviate the problem of insufficient training samples, the unlabeled samples with high confidence are selected under adaptive spatial constraints.The experimental results on several hyperspectral datasets demonstrated the effectiveness of the proposed method for HSI classification.
In the future, more varied CNN architecture will be considered in DDCNN for complex land-cover distributions in HSIs.
R d×1 feature space, where d is the number of spectral bands, and 1 ≤ m ≤ M. The class label of training samples is represented by Y= {y 1

1 { 1 d 1 ={
, DDCNN consists of three stages: regional division with local and non-local decisions, dual-architecture CNN-based classification, and data augmentation based on spectral similarity under adaptively spatial constraint.A HSI dataset contains M training samples feature space, where d is the number of spectral bands, and 1 mM  .The class label of training samples is represented by

Figure 3 .
Figure 3. Illustration of samples in the homogeneous and heterogeneous regions.

Figure
Figure 3  illustrates an example for these two situations.In Figure3, i and j are two samples in the HSIs.These two samples locate in the homogeneous and heterogeneous regions, respectively.Both them belong to the "GREEN" class.For the sample i, a larger spatial window (i.e., black box) contains some samples belonging to "BLUE", "PURPLE", and "YELLOW" classes instead of "GREEN" class.In this case, the sample i may be easily misclassified as the "BLUE", "PURPLE", or "YELLOW" class.If a smaller spatial window (i.e., red box) is selected, all the samples in the window belong to the "GREEN" class.For the sample j, all the samples in both larger and smaller spatial windows (i.e., black and red boxes) belong to the "GREEN" class.In the case, a larger spatial widow contains more adequately contextual information for feature extraction.

Figure 3 .
Figure 3. Illustration of samples in the homogeneous and heterogeneous regions.

Figure 4 .
Figure 4. Illustration of local regional division based on superpixel segmentation: (a) ground truth; (b) superpixel segmentation map; (c) the filter of samples in the homogeneous region; (d) the filter of samples in the heterogeneous region; (e) the filter of samples in the "false boundary".

Figure 4 .
Figure 4. Illustration of local regional division based on superpixel segmentation: (a) ground truth; (b) superpixel segmentation map; (c) the filter of samples in the homogeneous region; (d) the filter of samples in the heterogeneous region; (e) the filter of samples in the "false boundary".

Figure 6 .
Figure 6.The construction of fine-grained CNN.

Figure 6 .
Figure 6.The construction of fine-grained CNN.
) The Salinas dataset was collected by the AVIRIS sensor over Salinas Valley, California.The dataset comprises 512 × 217 pixels.It has the spatial resolution of 3.7m per pixel.The sensor system generates 224 bands in wavelength range of 0.4-2.5µm.In the experiments, 204 bands are preserved after 20 water absorption bands being omitted.The image contains 16 classes.The false-color composite image (bands 50, 170, 190) is shown in Figure 7c.Northwestern India.It was acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor, with the size of 145 × 145 pixels.There are 220 spectral bands in the wavelenth range of 0.4-2.5μm in the visible and infrared spectrum.However, 200 spectral bands are preserved after 20 lower signal-to-noise ratio bands being diacarded.The dataset contains 16 different land-cover classes.The false-color composite image (bands 50, 27, 17) is shown in Figure 7a.(2) The Pavia University dataset was gathered by the Reflective Optics System Imaging Spectrometer (ROSIS-3) sensor in an urban site over the city of Pavia, Italy.There are 610 × 340 pixels and 103 spectral bands after 20 water absorption bands beingremoved.The ROSIS tensor generates the spectral bands in the wavelength ranging from 0.43μm to 0.86μm.There are 9 different land-cover classes, and the false-color image (bands 53, 31, 8) is shown in Figure 7b.(3) The Salinas dataset was collected by the AVIRIS sensor over Salinas Valley, California.The dataset comprises 512 × 217 pixels.It has the spatial resolution of 3.7m per pixel.The sensor system generates 224 bands in wavelength range of 0.4-2.5μm.In the experiments, 204 bands are preserved after 20 water absorption bands being omitted.The image contains 16 classes.The false-color composite image (bands 50, 170, 190) is shown in Figure 7c.

Figure 7 .
Figure 7.The false-color composite images of (a) the Indian Pines; (b) the Pavia University; (c) the Salinas valley.

Figure 7 .
Figure 7.The false-color composite images of (a) the Indian Pines; (b) the Pavia University; (c) the Salinas valley.

Figure 10
Figure 10 shows the classification visual maps of the seven algorithms on the Salinas dataset.As shown in Figure 10b-f, many samples belonging to the grapes_untrained and vinyard_untrained classes are confused by RBF-SVM, SAE, DBN, CNN, and PPF-CNN.Compared with them, 3DCNN and DDCNN provide better distinction for these two classes.Compared with 3DCNN, DDCNN obtains better boundary localization for these two classes.

4. 5 .
Figure11shows the classification performance with different numbers of training samples.The classification performance of deep learning-based methods depends on the number of training samples greatly.Thus, it's necessary to investigate the sensitivity to the number of training samples.In the experiment, the number of training samples per class is changed from 1% to 9% with an interval of 2% on the Indian Pines dataset, 1% to 5% with an interval of 1% on the Pavia University dataset, and 1% to 3% with an interval of 0.5% on the Salinas dataset.Generally, deep learning-based methods are usually heavily parameterized and a large number of training samples are required to guarantee the performance.When the ratio of training samples is larger than 9% on the Indian Pines, 5% on the Pavia University, and 3% on the Salinas, the training samples are sufficient to estimate the models.CNN-based methods, CNN, PPF-CNN, 3DCNN, and DDCNN, perform better than the other three methods.When the ratio of training samples decreases, the classification performance of all the seven algorithms declines.In this case, deep learning-based methods SAE, DBN, and CNN have no obvious advantage over RBF-SVM.Compared with them, 3D-CNN, PPF-CNN, and DDCNN show better classification performance for the small-sized sample set.Among these methods, DDCNN consistently provides superior performance with different ratios of training samples.DDCNN improves by at least 6.8%, 5.6%, and 2.9% on the Indian Pines, Pavia University, and Salinas datasets, respectively, when the ratio of training sample is 1%.Thus, DDCNN is a better choice when the number of training samples is limited.

4. 5 .
Figure 11 shows the classification performance with different numbers of training samples.The classification performance of deep learning-based methods depends on the number of training samples greatly.Thus, it's necessary to investigate the sensitivity to the number of training samples.In the experiment, the number of training samples per class is changed from 1% to 9% with an interval of 2% on the Indian Pines dataset, 1% to 5% with an interval of 1% on the Pavia University dataset, and 1% to 3% with an interval of 0.5% on the Salinas dataset.Generally, deep

4. 6 .
Comparison with Other Classification Techniques Table 11 shows the classification results of different methods on three HSI datasets.RPCA-RNN obtains better classification results than CNN because RPCA-RNN makes full use of spatial information.Compared with CNN and RPCA-CNN, DCNN improves the classification performance by extracting joint spatial-spectral features.Compared with RPCA-CNN and DCNN, DDCNN obtains better classification results by using divide-and-conquer dual-architecture CNN and effective sample augmentation.It increases by 17.4% and 3.5% on the Indian Pines datasets, 19.7% and 7.1% on the Pavia University dataset, and 7.1% and 4.3% on the Salinas dataset in terms of OA index.

Figure 12 .
Figure 12.Sensitivity analysis to the spatial window sizes w1 and w2 for DDCNN on (a) the Indian Pines, (b) the Pavia University, and (c) the Salinas datasets.
shows the OA results of DDCNN o Indian Pines, Pavia University, and Salinas datasets under different parameters 1 w and w shown in Figure 12, when 1 w and 2 w are selected as 27 and 7 on the Indian Pines, 31 and 9 o Pavia University, and 31 and 9 on the Salinas, the classification performance reaches the peak v The Pavia University and Salinas dataset have higher spatial resolution than the Indian Pines d Therefore, the sizes of 1 w and 2 w in the Pavia University and Salinas datasets are larger than t the Indian Pines dataset.

Figure 12 .
Figure 12.Sensitivity analysis to the spatial window sizes w 1 and w 2 for DDCNN on (a) the Indian Pines, (b) the Pavia University, and (c) the Salinas datasets.
value of similarities between any two superpixels containing the training samples of the k th category.For each class, an adaptive threshold k T can be obtained by considering all the training samples of this class.When the value of k T is too large or small, the classification performance would degrade due to misdivision of homogeneous and heterogeneous regions.Compared with empirical setting, the proposed adaptive calculation is a better choice due to considering data distribution.

Figure 13 .
Figure 13.The sensitivity analysis of DDCNN to the threshold

Figure 13 .
Figure 13.The sensitivity analysis of DDCNN to the threshold T v .

Table 1 .
The procedure of the proposed DDCNN method.
Count the labels Y test by Y testHo and Y testHe 22. END 23.OUTPUT: the labels of the test samples classified by the trained DDCNN

Table 2 .
The 16 Classes of the Indian Pines dataset and the numbers of training and test samples for each class.

Table 4 .
9 Classes of the Pavia University dataset and the numbers of training and test samples for each class.

Table 4 .
9 Classes of the Pavia University dataset and the numbers of training and test samples for each class.

Table 6 .
The 16 Classes of the Salinas dataset and the numbers of training and test samples for each class.

Table 6 .
The 16 Classes of the Salinas dataset and the numbers of training and test samples for each class.

Table 6 .
The 16 Classes of the Salinas dataset and the numbers of training and test samples for each class.
, many samples belonging to the grapes_untrained and vinyard_untrained classes are confused by RBF-SVM, SAE, DBN, CNN, and PPF-CNN.Compared

Table 11 .
Classification results of CNN, RPCA-CNN, DCNN, and DDCNN on the Indian Pines, Pavia University, and Salinas Datasets.
As shown in Table12, compared with FCNN, DDCNN increases by 3.6%, 1.1%, and 2.5% on the ian Pines, Pavia University, and Salinas datasets.Compared with MCNN, DDCNN increases by , 0.7%, and 1.7% on three HSI datasets.It is shown that dual-architecture is more effective than le network architecture for HSI classification.DDCNN exploits dual-architecture CNN to rove the classification performance of HSIs.Compared with DDCNN-WDA, DDCNN increases .0%,0.8%, and 0.4% on the Indian Pines, Pavia University, and Salinas datasets.It is shown that a augmentation is effective for HSI classification.DDCNN improves the classification formance of HSIs by exploiting the data augmentation.

Table 13 .
The sensitivity analysis of numbers of convolutional layers.

Table 13 .
The sensitivity analysis of numbers of convolutional layers.

Table 14 .
The sensitivity analysis of numbers of superpixels in DDCNN.