Combining Spectral Unmixing and 3D/2D Dense Networks with Early-Exiting Strategy for Hyperspectral Image Classiﬁcation

: Recently, Hyperspectral Image (HSI) classiﬁcation methods based on deep learning models have shown encouraging performance. However, the limited numbers of training samples, as well as the mixed pixels due to low spatial resolution, have become major obstacles for HSI classiﬁcation. To tackle these problems, we propose a resource-efﬁcient HSI classiﬁcation framework which introduces adaptive spectral unmixing into a 3D/2D dense network with early-exiting strategy. More speciﬁcally, on one hand, our framework uses a cascade of intermediate classiﬁers throughout the 3D/2D dense network that is trained end-to-end. The proposed 3D/2D dense network that integrates 3D convolutions with 2D convolutions is more capable of handling spectral-spatial features, while containing fewer parameters compared with the conventional 3D convolutions, and further boosts the network performance with limited training samples. On another hand, considering the existence of mixed pixels in HSI data, the pixels in HSI classiﬁcation are divided into hard samples and easy samples. With the early-exiting strategy in these intermediate classiﬁers, the average accuracy can be improved by reducing the amount of computation cost for easy samples, thus focusing on classifying hard samples. Furthermore, for hard samples, an adaptive spectral unmixing method is proposed as a complementary source of information for classiﬁcation, which brings considerable beneﬁts to the ﬁnal performance. Experimental results on four HSI benchmark datasets demonstrate that the proposed method can achieve better performance than state-of-the-art deep learning-based methods and other traditional HSI classiﬁcation methods. Experimental results on four benchmark datasets show the proposed method outperforms state-of-the-art deep learning based and traditional HSI classiﬁcation methods.


Introduction
Hyperspectral Image (HSI) comprise hundreds of narrow and contiguous spectral bands, and each represents the measured intensity of a narrower range of light frequencies [1]. The great spectral resolution of HSI improves the capability of precisely discriminating the surface materials of interest [2,3]. Such abundant spectral information makes it beneficial to a wide range of applications, especially in some cases that cannot be directly detected by humans. For most of these applications, HSI classification has been an active area of research in remote sensing research. Abundant spectral resolution is useful for classification problems but at the expense of much lower spatial resolution. Because of the low spatial resolution of HSI, the spectral signature of each pixel contains a mixture of different spectra, which is caused by the multiple components that form the ground surface materials.
If a pixel is highly mixed in HSI data, it is very difficult to categorize it in the original feature space. Therefore, the presence of mixed pixels is one of the major obstacles affecting seriously the classifier accuracy [4].
In recent years, in HSI data analysis, the spectral unmixing techniques [5] have been employed to handle with the mixed pixel issue. The spectral unmixing includes two steps: (1) extracting the pure material spectra (endmembers) from the HSI and (2) calculating their relative proportions (abundances) of the HSI data [6]. Spectral unmixing has been extensively studied as a possible solution in HSI analysis. Related research works and applications have been developed in many fields, such as HSI super-resolution, denoising, change detection, and so on [7][8][9]. For instance, in Reference [7], Lanaras et al. proposed a method which performs hyperspectral super-resolution by jointly coupled spectral unmixing. The proposed joint formulation significantly improves hyperspectral super-resolution. Yang et al. [8] proposed a sparse representation framework that unifies denoising and spectral unmixing in a closed-loop manner. This method utilizes spectral information from spectral unmixing as feedback to correct spectral distortion, while denoising and spectral unmixing act as constraints to iteratively solve other constraints. In Reference [9], a general framework for HSI change detection using sparse unmixing is proposed. This model has the potential to get more information than other change detection techniques.
Moreover, spectral unmixing also carries valuable information for the HSI classification problem. A brief review of existing HSI classification methods with spectral unmixing is given below. Generally, these algorithms can be divided into two groups. Firstly, spectral unmixing has been widely studied as a feature extraction strategy before classification [10][11][12][13]. For instance, in Reference [10], unmixing results is used to improve classification performance in an alternative strategy and spectral unmixing can be used to extract suitable features for future classify images. Later, Dópido et al. [11] quantitatively evaluated the unmixing-based feature extraction methods, and further proved that these features can effectively improve the accuracy of classification. This strategy was further explored in many works [12,13] and also proved that the unmixing before classification provided an effective solution for HSI classification. Secondly, several techniques are proposed to utilize the complementarity of the classification and spectral unmixing in a semi-supervised framework, where the abundance maps have been applied as a supplementary source for the multinomial logistic regression (MLR) classifier [14][15][16][17]. First, the framework utilizes the information provided by spectral unmixing to select new training samples for classification, and then it integrates the abundance maps and classification to obtain the final classification results. This strategy considers the output provided by both classification and unmixing simultaneously, which provides a joint approach for HSI interpretation and can effectively improve the classification results, particularly when the available training set is very limited.
More recently, the deep learning-based methods have shown state-of-the-art performance in HSI classification [18][19][20][21][22], thanks to its great success in computer vision and the fast advancement of computing facilities [23][24][25][26][27]. Instead of shallow manually-crafted features, deep learning network models can extract high-level, hierarchical and abstract features which are generally more robust to nonlinear processing. In Reference [18], Pan et al., proposed a simplified deep learning model called R-VCANet [18] (vertex component analysis network) based on the deep learning baseline PCANet [27]. In recent studies, convolutional neural networks (CNNs) [23] are most often used in deep learning-based methods for HSI classification [19][20][21]. For example, a 3D CNN based on the 3D convolutional kernel is proposed in Reference [19], and the discriminative spectral-spatial features and classification are performed in an end-to-end manner. Zhong et al. [20] proposed a supervised spectral-spatial residual network (SSRN) based on the residual neural network (ResNet) [24]. An SSRN consists of consecutive spectral and spatial residual blocks, which are used to extract spectral-spatial features of HSI.
Furthermore, dense convolutional networks (DenseNet) have demonstrated significant achievement in deep learning network models and have also been used for HSI classification [28][29][30], particularly in limited training samples, because the dense connections have a regularizing effect, which reduces overfitting on tasks with smaller training set sizes [31]. In Reference [21], a 3D dense convolutional network with multiple scales dilated convolutions [32] and a spectral-wise attention mechanism (MSDN-SA) is proposed for HSI classification with limited training samples. The 3D CNN has a very important characteristic, that is they can directly create hierarchical representations of spectral-spatial data. However, the number of parameters grows exponentially when convolution goes from 2D to 3D. Due to the additional kernel dimension, 3D network has more parameters than 2D CNN. A large number of parameters make it easily prone to over-fitting when there are only limited labeled samples. Besides, when the 3D network is applied to HSI classification, the power of 3D network comes at a considerable cost, namely the computational cost of applying them to new examples. It is necessary to design a network model for resource-efficient HSI classification with limited training samples [33].
Considering the successful combination of HSI unmixing and classification, as well as the development of deep learning, we aimed at integrating spectral unmixing with deep learning-based classification algorithm to improve the classification accuracy. Little research has been undertaken on the combining of these two techniques. Recently, Alam et al. [12] used spectral unmixing to generate abundance maps and then used abundance maps as the input for deep learning-based HSI classification. However, in some cases, it is important to take advantage of the unmixing and classification information in a complementary manner, but the algorithm [12] uses the information provided by spectral separation before classification [15].
Based on the above motivations, a novel 3D/2D dense network where multiple intermediate classifiers are integrated with the spectral unmixing method for HSI classification is proposed. For HSI data with mixed pixels, compared with state-of-the-art CNN, this model shows its superiority in terms of overall classification accuracy, especially in limited training samples. The three contributions of this paper can be summarized as follows.

1.
Our model adopts a specially designed network with multiple intermediate classifiers that is trained end-to-end. A 3D/2D dense networks with multiple intermediate classifiers (3D/2DNets) are jointly optimized during training and early-exiting strategy is adopted for each sample during testing. This is a resource-efficient model concerning other deep learning-based HSI classifiers.

2.
We proposed a spectral-spatial 3D/2D convolution (SSDC) for the proposed framework. It enables the network to incorporate fewer 3D convolutions, while taking advantage of 2D convolutions to obtain more spectral information feature maps and enhance feature learning capabilities, thereby reducing the training complexity of each round of spectral-spatial fusion, which reduces overfitting on tasks with limited training samples.

3.
An adaptive spectral unmixing is proposed as a complementary source for classification. The endmember composition of each pixel is established by the probabilistic output of softmax adaptively.
The remainder of this paper is organized as follows. In Section 2, we describe our approach for HSI classification. The experimental results are presented and discussed in Sections 3 and 4. Finally, in Section 5, the paper is summarized.

Overview
The proposed method aims to learn an early-exiting deep learning framework for HSI classification based on 3D/2D dense networks (denoted as 3D/2DNets) and adaptive spectral unmixing (ASU). The whole framework is abbreviated as ASU-3D/2DNets. We exploit the fact that HSI data is typically a combination of easy examples and hard examples. Based on the above facts, a 3D/2D dense network with early-exiting strategy is proposed, which can reduce the evaluation time without loss of accuracy. Furthermore, considering the pixels with a low probabilistic output are either mixed pixels or pixels that are difficult to classify due to spectral variability, we unmix the hard samples to get more accurate classification results.
The framework of the proposed method is shown in Figure 1. All available labeled samples are divided into three parts: training samples, validation samples, and testing samples. The method is mainly composed of three parts: 3D/2D dense networks; early-exiting strategy; and adaptive spectral unmixing.

3D/2D dense networks:
To be specific, the 3D/2D dense networks composed of one convolution layer and three blocks, each block is connected to a classifier exit. In training processing, a 3D/2D dense network with multiple intermediate classifiers is jointly optimized.
Early-exiting strategy: Considering the existence of mixed pixels in HSI, the pixels in HSI classification can be divided into hard samples and easy samples. We intend to prioritize easy samples from early layer and difficult to classify samples (hard samples) from later layers. This process is called the early-exit strategy. All samples first pass the Block 1, each sample can be assigned to a class and the probability of each sample in the softmax layer is denote as y i , where 0 < y i < 1, Σy i = 1, i is the number of categories. If the softmax probability value max(y i ) of samples obtained in the classification process is greater than a chosen threshold T 1 , the system sends them down to exit; otherwise, it sends the samples (the hard samples D 1 ) to the Block 2. After the samples pass through the Block 2, if the softmax probability value of samples obtained in the classification process is greater than a chosen threshold T 2 , the system will send them to exit and future unmix them. Otherwise, the samples (the hard samples D 2 ) will be input into the Block 3 to continue to extract deeper features. The samples output from each block is represented as N b1 , N b2 , and N b3 , respectively.
Adaptive spectral unmixing: At the exit of the Block 2, by considering the results of the coarse classification step and applying the fully constrained least squares (FCLS) [34] method to each unlabeled pixel, spectral unmixing is performed to the unclassified pixels to obtain the abundance maps. As a result, abundance maps provide additional information about the composition of each pixel. Finally, the contribution degree of abundance maps and classification results are controlled by weight, and the final classification map is obtained.
Next, we will detail the 3D/2D dense networks with early-exiting strategy and adaptive spectral unmixing.

3D/2D Dense Networks with Early-Exiting Strategy
Recently, two-dimensional multi-scale dense networks (MSDNets) is first proposed for resource-efficient image classification [35]. MSDNets uses a cascade of intermediate early-exiting classifiers throughout the network. In these intermediate early-exiting classifiers setting, MSDNets can improve the average accuracy by reducing the amount of computation spent on easy samples to save up computation for hard samples [35].
Based on the fact of that HSI data is typically a mix of easy examples and hard examples, we are trying to apply MSDNets to the HSI classification, thereby increasing the classification accuracy whilst reducing the computational requirements. As HSI data are 3D cubes, it is reasonable to extend the 2D model to the 3D model for HSI classification; however, greatly increasing on both computational complexity and memory usage has followed. To resolve this problem, an early-exiting dense network with mixed 3D and 2D convolutions (3D/2DNets) is proposed. In this section, we first give a detailed description of early-exiting dense networks . Then, a 3D/2D convolution based on spectral and spatial information is presented for early-exiting dense networks.  In sample extraction, we extract cube with the size W × W × L, where W and L are the spatial size and the number of spectral bands, respectively. Each cube is extracted from a neighborhood window centered around a pixel, and the label of each sample is that of the pixel located in the center of this cube. Then, we feed 3D cube into the multi-scale dense networks model, which is itself composed of one convolution layer and three blocks, to obtain the classification result.
(1) The convolutional layer functions in the first layer ( = 1 ), h s 1 ; denote a sequence of 3 × 3 × 8-sized 3D convolutions (Conv), the batch normalization layer with rectified linear unit (ReLU) function, and a 3D max pooling layer with 3 × 3 × 3-sized kernel, stride of 2. The output feature maps at layer and scale s are denoted as x s .
(2) For subsequent feature layers in each block, the transformations h s andh s are defined following the design in DenseNets [32]: We set the number of output channels of the three scales to 6, 12, and 24, respectively. The output feature maps x s produced at subsequent layers, > 1 and scales s, are a concatenation of transformed feature maps from all previous feature maps of scale s and s − 1 (if s > 1 ).
(3) Each classifier has two down-sampling convolutional layers with 128 dimensional 3 × 3 × 8 filters, followed by a 2 × 2 × 2 3D average pooling layer and a linear layer. The classifier at layer uses all the features x s 1 , . . . , x s . Let f k (·) denote the k th classifier, every sample traverses the network and exits after classifier f k (·) if its prediction confidence (we use the maximum value of the softmax probability as a confidence measure) exceeds a pre-determined threshold T.
During training, we use cross entropy loss functions L ( f k ) for all classifiers and minimize a cumulative loss as follows: where D denotes the training set.

3D/2D Convolutional Based on Spectral-Spatial Information
In HSI classification, a 3D convolution couples spectral-spatial information to effectively extract spectral-spatial features. Though promising, regarding 2D CNN, 3D CNN extends the spatial kernel to spectral-spatial space, which significantly increase the number of parameters, thus greatly increasing the computational complexity and memory usage, as well as increasing the network's demand for huge training sets [36]. It can be seen from the above analysis that the above facts limit the performance of existing 3D CNN on HSI classification, especially in dense convolution networks based on 3D convolution [21]. There are currently some efforts to ameliorate the downside of the 3D convolution model in HSI classification. Zhong et al. [20] first employed the style of residual connection to extract the spectral features by continuous 1D convolution, and then used 3D convolution to extract spatial information. Furthermore, in Reference [37], the combination of a 2D spatial convolution and a 1D spectral convolution was used to replace spectral-spatial 3D convolution, which means that this network structure was no longer 3D CNN.
Recently, intertwined 3D/2D networks [38][39][40] have shown up as a hybrid between 2D CNN and 3D CNN in human action recognition. In Reference [38], a mixed convolutional tube (MiCT) was proposed to integrate 2D convolution with the 3D convolution to learn better spatio-temporal features. Compared to the 3D CNN, a benefit of using such 3D/2D networks is that the parameters involved in the networks are much reduced.
Inspired by this, to alleviate the drawback of 3D convolution in the HSI classification, inspired by Reference [38], we proposed a spectral-spatial 3D/2D convolution (SSDC) for HSI, as illustrated in Figure 3. The SSDC replaces each 3D convolution in the first layer of the proposed framework. Considering the HSI data has a lot of redundant spectral information among consecutive bands, this results in redundant information in feature maps along the spectral dimension. In the first layer, if all the bands are directly used in the network input, the 3D sample block will input too many parameters and increase the computational complexity. Therefore, the proposed SSDC is used to replace the 3D convolution used in the first layer network. It enables the network to incorporate fewer 3D convolutions, while taking advantage of 2D convolutions to obtain more spectral information feature maps and enhance feature learning capabilities, thereby reducing the training complexity of each round of spectral-spatial fusion, which reduces overfitting on tasks with limited training samples.
The shortcut in our SSDC is cross-domain [38] is different from the residual connections in previous works [21,24]. The SSDC is obtained by 3D convolution mapping for the 3D inputs and a 2D convolution mapping for the 2D inputs. By introducing a 2D convolution to extract the 2D features information on each band, the 3D convolution in SSDC only needs to learn residual information along the spectral dimension. Thus, the cross-domain residual connection largely reduces the complexity of SSDC in 3D convolution kernels learning.

An Adaptive Endmember Selection of Unmixing
As described above, the method of combining classification and unmixing has achieved good results in pixel labeling, in which the abundance maps have been used as an auxiliary information source in the MLR classifier [10,11,16]. However, all the above methods process all pixels in the same way, but the fact that hyperspectral data is that some samples may be not highly mixed (in this case, the coarse classification step may be sufficient to characterize them), and some samples may be highly mixed (in this case, spectral unmixing is particularly useful for enhancing the classification) [15]. With the aforementioned issues in mind, adaptive spectral unmixing is introduced to the 3D/2D dense networks with early-exiting classifier. Through this network architecture, the easy examples were correctly classified and exited by the first classifier. The examples with low probabilistic outputs are either mixed pixels or pixels hard to classify due to spectral variability; we unmix the hard samples to achieve more accurate classification results. In general, adaptive spectral unmixing consists of two important parts, the collection of endmembers spectrum and adaptive endmember selection.
Firstly, in our framework, the spectral signatures used for unmixing purposes are not obtained by endmember extraction but are obtained by averaging the spectral signatures of each labeled category in the training set. Although the average endmembers will cause a decrease in spectral purity, it can reduce the effects of noise and/or average the subtle spectral variability of each spectral category, resulting in a more representative final endmember as a whole [10,13].
In the spectral unmixing of mixed pixels, the choice of endmembers is extremely important. We did a simple experiment, and Figure 4 shows the classification results of each block output. In Figure 4, we find the hard samples (maybe highly mixed), which are from the second and third classifiers, and the probabilistic output of top3-the top3 value refers to the top three in the maximum probability vector. As long as the correct probability is present, the prediction is correct; otherwise, the prediction error is close to 99% in second classifiers, and the probabilistic output of top5 is also 95% in third classifiers. So, we speculate that the endmember composition of a mixed pixel can represent its main component with a few endmembers instead of all endmembers, and according to the different spectral purity, different processing strategies should be adopted for different types of the pixel. This theory has also been verified in the literature [15]. Therefore, we propose an endmember selection of unmixing in which the probabilistic output of softmax is exploited to determine the endmember set for each pixel.
To be specific, all samples first pass the first block, and every sample can be assigned to a class. If the probabilistic output obtained in the classification process is greater than a chosen threshold T ( T 1 and T 2 ), the system sends it down to exit; otherwise, it sends the sample to the second block. It is worth mentioning that the easy examples were correctly classified and exited directly by the first classifier. For each sample output from the second classifier, we take the top3 result of the corresponding probabilistic output as the endmember, and for the sample output by the third classifier, we take the top5 of the probabilistic output classification result as the endmember.
where S k i is the selected endmember set of sample i from the k th classifier. If k = 2, then M = 3; if k = 3, then M = 5. E = [e 1 , e 2 , . . . , e L ] denotes endmember set. Lastly, the adaptive endmembers are adopted for the fully constrained least squares (FCLS) [34] unmixing model. As a result, abundance map provides additional information about the composition of each pixel. Finally, the contribution degree of abundance map and classification result is controlled by weight λ , and the final classification map L F is obtained as follows: where function f k (·) is the probability obtained by the classification algorithm, i.e., the kth classifier described in Section 2.2.1; and function f a (·) is the abundance fraction obtained by the spectral unmixing with adaptive endmember S k i .

Experimental Data Sets
In this section, one synthetic dataset and four benchmark HSI datasets, including Indian Pines, Salinas Valley, Kennedy Space Center (KSC), and Pavia University, are used to evaluate the performance of the proposed method. The first three datasets were collected by the NASA Airborne Visible/Infrared Imaging spectrometer (AVIRIS) instrument; the last one was collected by the ROSIS-03 sensor.
To assess the classification performance in a totally controlled environment, we generate synthetic datasets of four classes (see Figure 5). It should be noted that the proposed approach exploits the linear mixture model. Let x (k) i be the i th samples in class k, where m (l) , l = 1, ...8 are pure spectra from the U.S. Geological Survey digital spectral library, α (k+j) is the corresponding abundance fraction, and c (k) is the number of constituents in class k. For a certain sample x i , we assume that m (k) receives the maximum abundance value, which, in turn, determines the corresponding label y i = k. The zero-mean Gaussian noise n i ∼ N (0, σ 2 I) is also added to the pixel x i .   The KSC image was recorded by the AVIRIS sensor over the KSC site in Florida, with 18 m spatial resolution and 0.4-2.5 µm wavelength range. After removing water absorption and low signal-to-noise-ratio (SNR) bands, the remaining 176 bands and 512 × 614 pixels are used for assessment. Training data were selected using land cover maps derived from color infrared photography provided by the KSC and Landsat Thematic Mapper (TM) imagery. The vegetation classification scheme was developed by KSC personnel in an effort to define functional types that are discernible at the spatial resolution of Landsat and these AVIRIS data. Discrimination of land cover for this scene is difficult due to the similarity of spectral signatures for certain vegetation types. For classification purposes, 13 upland and wetland classes representing the various land cover types that occur in this scene were defined for the site. Figure 8a,b shows the true color composite of the KSC image and the corresponding reference data. The Pavia University image was gathered by the ROSIS sensor during a flight campaign over Pavia, northern Italy, having 610 × 340 pixels with 1.3 m spatial resolution. It consists of 115 spectral bands at the range 0.43-0.86 µm. Twelve spectral bands were removed due to noise, and the remaining 103 bands were used for classification of nine classes. The true color composite of the Pavia University image and the corresponding reference image are shown in Figure 9a,b, respectively.

Experimental Settings
All the compared methods are assessed numerically using the following three criteria: overall accuracy (OA), average accuracy (AA), and statistically kappa coefficient (κ). We implemented 10 trials of hold-out cross validation for each dataset: the mean values and standard deviations are reported for each dataset. For each trial, a limited number of training samples were randomly selected from each class, 10% of the labeled samples are chosen as validation samples, and the remaining samples were used as testing samples. The training samples are used to train the weights and biases of each neuron in the model, while the architecture variables are optimized based on the validation samples. More specifically, the number of training samples in Indian Pines, Salinas Valley, Kennedy Space Center (KSC), and Pavia University datasets are set as 5%, 2%, 1%, and 1% per class, respectively.
The performance of ASU-3D/2DNets is compared with several recent proposed HSI classification methods related to our algorithm, which are summarized as follows.
On the one hand, in dealing with mixed pixels in HSI classification, we compare two HSI classification methods for mixed pixels. MLRsubMLL (multilevel logistic) [40] is a supervised algorithm which integrates a subspace projection method with the multinomial logistic regression (MLR) and further combined with an markov random field (MRF)-based multilevel logistic (MLL) prior for spatial-contextual information. Subspace projection methods can provide advantages by separating classes of mixed pixels which are very similar in a spectral sense. SVM (support vector machine)-MLRsub-MRF [41] is a spectral-spatial classifier for HSI data that specifically addresses the issue of mixed pixel characterization. More specifically, a subspace-based multinomial logistic regression method (MLRsub) for learning the posterior probabilities and a pixel-based probabilistic support vector machine (SVM) classifier as an indicator to locally determine the number of mixed components participate in each pixel.
In the input layer, RGF is used to combine the spectral and spatial information of the original HSI data. Based on the result of RGF, two VCA-based convolution layers are followed to explore the deep information in the HSI data. At last, an output layer is used to determine the feature expression for each pixel. More specifically, the VCA-based convolutional kernels are extracted from the HSI by VCA. The SSRN [20] includes a spectral feature learning section, a spatial feature learning section, an average pooling layer, and a fully connected (FC) layer. The spectral feature learning section is composed of two convolutional layers and two spectral residual blocks, and the spatial feature learning section comprises one 3D convolutional layer and two spatial residual blocks. Besides, MSDN-SA [21] directly extends the 2D DenseNets architecture into 3D DenseNets with multiple scales dilated convolutions and spectral-wise attention mechanism; the network structure is set as given in Reference [21] For the proposed framework, the λ is set as 0.75. For the 3D/2DNets algorithm, we set the spatial size to 13 × 13, following the practice in Reference [21], and the specification of the architecture employed on four datasets in the experiments in Table 1. We use Nesterov momentum with a momentum weight of 0.9 without dampening and a weight decay of 10 −4 . All models are trained for 90 epochs, with an initial learning rate of 0.1, which is divided by a factor 10 after 30 and 60 epochs.

Experimental Results
In this section, we first present a synthetic dataset experiment showing the effects of additional noise. We select 100 samples per class from the image for training and use the rest samples for testing. In this experiment, we use the synthetic datasets of linear mixed classes to evaluate the algorithm performance with different noise effects. Gaussian additive noise with a signal-to-noise-ratio (SNR) from 20 dB to 50 dB is shown in Table 2. It can be seen that the proposed ASU-3D/2DNets always achieves the best performance. For example, in the case of SNR=40 dB, the OA of ASU-3D/2DNets is 5.61% and 3.25% higher than MLRsubMLL and SVM-MLRsub-MRF. For deep learning-based methods, R-VCANet adopt advanced endmember-based convolution to explore the deep features in the HSI data; SSRN learns deep spectral-spatial features by decomposing 3D convolutions; MSDN-SA directly extends the 2D DenseNets architecture into 3D DenseNets; the ASU-3D/2DNets still performs the best. In addition, regarding subspace projection-based methods to separate mixed pixels (such as MLRsubMLL, SVM-MLRsub-MRF) with the best accuracy up to 87.82% on Indian Pines, 93.85% on Salinas Valley, 80.07% on KSC, and 93.85% on Pavia University, our proposed ASU-3D/2DNets performs the best.

Experimental Analysis
In this section, experiments on effect of training samples is shown firstly. Then, we qualitatively evaluate the performance of the early-exiting strategy in the proposed framework. Then, we investigate the efficacy of the adaptive spectral unmixing by confusion matrix obtained from the classification results. Then, we provide an ablation study of our ASU-3D/2DNets on four datasets. Lastly, experimental analysis on challenging HSI dataset is provided.

Effect of Training Samples
The above experimental results have shown that the proposed ASU-3D/2DNets method performs well in HSI classifications, especially in the case of having smaller training samples. In this part, we would like to further investigate the scenarios of extremely scarce training samples. The curves of AA, with respect to a different number of training samples, are shown in Figure 14. As expected, as the number of training samples increases, the accuracy increases. We can see from Figure 14 that ASU-3D/2DNets outperforms other methods in most cases. Regarding Salinas Valley, KSC, and Pavia University datasets using only small training samples per class, ASU-3D/2DNets has achieved best. Although classification of Indian Pines dataset is more challenging, on 3-9% training samples per class, ASU-3D/2DNets scores significantly higher than other compared methods. It is worth mentioning that 55% of the training samples selected in the Indian dataset were allocated according to GRSS DASE website [42], and the algorithm also showed good classification results under fixed train/test data.

Analysis of Early-Exiting Strategy and Unmixing
In order to evaluate the performance of the early-exiting strategy in the proposed framework, quantitative data in the experiments are carried out on four datasets. As shown in Tables 7-10, the first column indicates the block number, the second column indicates the output threshold of softmax for different blocks, the third column indicates the number and correct rate of sample output at T without unmixing, and the fourth column indicates that, under T, the number and correct rate of sample output at the time of unmixing. It should be noted that we set the λ to 0.75, following the practice in Reference [14]. In the early-exiting strategy, the number of samples output per block is changed by the correct rate before and after unmixing. According to the value of T, the classification result is outputted from the first two blocks successively, and the remaining samples are output in the last block, so the value of T in the last block is null. Next, we investigate the efficacy of the adaptive spectral unmixing through confusion matrix obtained from the classification results. Under the early-exiting strategy of our proposed ASU-3D/2DNets, we remove the adaptive spectral unmixing method and represent this model as 3D/2DNets. Take KSC dataset as an example, the confusion matrix of the classification obtained by the 3D/2DNets and ASU-3D/2DNets is shown in Tables 11 and 12, respectively. As shown in Table 11, from line five, we can see that confusion between CP/oak hammock and Slash pine (class 4 and class 5) is significant. After adding the adaptive spectral unmixing strategy as shown in Table 12, the number of samples in class 5 that were misclassified to the class 4 decreased from 48 to 9, reducing nearly 80%. In Table 11, for the Cattail marsh (class 10) of line ten, the misclassified samples are distributed in the Spartina marsh (class 9) and Mud flats (class 12). In Table 12, class 10 is completely separated from class 9. However, 3D/2DNets provides more accurate classification scores than ASU-3D/2DNets in class 8. In general, the performance of the 3D/2DNets becomes further improved by incorporating the adaptive spectral unmixing, with its OA increasing from 88.62% to 92.95% on KSC dataset. It can be inferred from the above analysis that spectral unmixing provides a useful source of information for classification and has the ability to further interpret mixed pixels, especially for classes where highly mixed pixels are dominant and its classification labels may be changed accordingly. Table 11. Confusion matrix of the classification obtained by the 3D/2DNets on KSC dataset.  removing the adaptive spectral unmixing or removing the first two classifiers. To make a fair comparison, the same preprocessing strategy was used. The obtained OA, AA, values were displayed in Tables 13-16. Take Indian Pines dataset as an example, where it shows that, compared to using 3D convolutional layers, using SSDC improves the OA from 0.9567 to 0.9634. It indicates that, compared with conventional 3D convolution, the SSDC can learn and represent spectral-spatial features more efficiently and accurately. Besides, either multiple intermediate classifiers or adaptive spectral unmixing can improve the performance to some extent. It is worth noting that the test samples passed through a deeper network layer in Model-D leads to no further improvement, but the model contained much more parameters and computational requirements. The same conclusion can be obtained by analyzing the other three datasets.
All these results show that the proposed ASU-3D/2DNets outperforms the state-of-the-art HSI classification methods. The design of the multiple intermediate classifiers make it possible to use adaptive spectral unmixing to facilitate classification, which brings considerable benefits for computational requirements and final performance. In addition, SSDC is more capable of processing spectral-spatial features than conventional 3D convolution, and each component of SSDC does help improve classification results.

Experimental Analysis on Challenging HSI Dataset
In this section, we will explore the classification results of our algorithm on more challenging high resolution HSI datasets. The HSI we used are provided by the Image Analysis and Data Fusion Technical Committee in the 2018 IEEE GRSS Data Fusion Contest, which are the images of the University of Houston Energy Research Park (UHEP) and the Earth and Atmospheric Science building (UH01) and one temporary station located at the Baytown airport (KHPY) [43]. It contains 48 bands with a spectral range of 380-1050 nm and a spatial resolution of 1 meter. The size of this data is 601 × 2384, and it contains 50,4856 labeled reference samples. The classes and the number of labeled samples in each are listed in Table 17. Classification results of Houston 2018 dataset with different numbers of training samples are shown in Table 18. It can be seen from Table 18 that our algorithm does not perform well in this dataset, and we analyze the reasons as follows.
Firstly, Table 17 shows the samples in the given hyperspectral image are severely unbalanced. Some classes, e.g., buildings and roads, have an adequate amount of data for training. However, classes, such as water, unpaved parking lots, and artificial turf, contain less than seven hundred samples. It has been well known that unbalanced training data may result in an underperformance of the network. Therefore, for this dataset, the future work of our algorithm needs to carry out a data augmentation method for the problem of unbalanced data, so as to re-balance the training data while keeping the data diversity.
Secondly, after adding the unmixing step, the classification result of the algorithm will not improve but decrease. We analyze that, since our algorithm is designed for the dataset with low resolution, and the spatial resolution of Houston 2018 is relatively high, further unmixing on preliminary classification results will have a bad effect on the classification result. Therefore, it can be concluded that our algorithm is more suitable for HSI with mixed pixels at low resolution.

Conclusions
In this paper, we proposed a network architecture specifically designed for low resolution hyperspectral datasets with mixed pixels. Based on the fact that HSI data is typically a mix of easy and hard examples, in this paper, we proposed a specially designed framework for HSI classification jointly using 3D/2D dense networks with multiple intermediate classifiers (i.e., 3D/2DNets) with an adaptive spectral unmixing. The design of the multiple intermediate classifiers with early-exiting strategy make it possible to use adaptive spectral unmixing to facilitate classification, which can decrease the computational requirements and improve final classification results. Besides, we proposed a 3D/2D convolution based on spectral-spatial information for the proposed framework, which fully takes advantage of 2D convolutions to obtain more spectral information, so that the 3D convolution can incorporate fewer 3D convolutions, while achieving feature learning, thereby reducing the training complexity of spectral-spatial fusion. Experimental results on four benchmark datasets show the proposed method outperforms state-of-the-art deep learning based and traditional HSI classification methods.