Hybrid Dilated Convolution with Multi-Scale Residual Fusion Network for Hyperspectral Image Classification

The convolutional neural network (CNN) has been proven to have better performance in hyperspectral image (HSI) classification than traditional methods. Traditional CNN on hyperspectral image classification is used to pay more attention to spectral features and ignore spatial information. In this paper, a new HSI model called local and hybrid dilated convolution fusion network (LDFN) was proposed, which fuses the local information of details and rich spatial features by expanding the perception field. The details of our local and hybrid dilated convolution fusion network methods are as follows. First, many operations are selected, such as standard convolution, average pooling, dropout and batch normalization. Then, fusion operations of local and hybrid dilated convolution are included to extract rich spatial-spectral information. Last, different convolution layers are gathered into residual fusion networks and finally input into the softmax layer to classify. Three widely hyperspectral datasets (i.e., Salinas, Pavia University and Indian Pines) have been used in the experiments, which show that LDFN outperforms state-of-art classifiers.


Introduction
The technology of hyperspectral remote sensing makes full use of high-altitude detection equipment with visible light, infrared light and microwave and other technical methods through information processing and transmission, which can carry out the remote non-contact classification and recognition of ground objects. Hyperspectral image (HSI) has hundreds of adjacent narrow bands [1] that have a large number of channel dimensions, so it plays a significant role in the field of remote sensing. Hyperspectral image has important information on two sides: one is spectral information, which can provide the ability to differentiate land-cover materials, the other is spatial information which can provide rich information about the spatial structure. Therefore, HSI is applied widely in many domains, such as military exploration [2,3], agriculture [4,5], environment monitoring [6,7] and medical treatment [8].
In the early age of hyperspectral image classification, traditional machine learning methods were widely used, for example, support vector machines (SVM) [9,10], k-nearest neighbor (KNN) [11,12], multinomial logistic regression (MLR) [13,14], decision tree [15,16]. However, within the same material exist spectral differences in different spaces and different materials may have similar spectral characteristics, so the obtained maps are still noisy due to the limited ability of spatial structure feature extraction. In order to resolve the problem where it is difficult to effectively classify hyperspectral images only by spectral features, many methods of artificial extraction of spatial and spectral features are proposed, for example, Markov random fields (MRFs) [17], generalized composite kernel machine [18].
In recent years, with the development of technology, deep learning methods can provide more dynamic automation features. The basic idea of deep learning is that the training model resolves which features are more significant than others in the case of fewer human constraints. Therefore, deep learning methods have been widely used in HSI classification, for instance, M. He et al. [19] proposed a neural network that has a multi-scale 3D deep convolution for HSI classification that can learn 2D multi-scale spatial features and 1D spectral features from the HSI data end-to-end. S. Mei et al. [20] proposed an unsupervised 3D convolutional auto-encoder (3D-CAE) that designed a 3D decoder to reconstruct the input patterns and all parameters could be trained without marking training samples. In [21], a spectral-spatial residual network (SSRN) was put forward, which uses the original 3D cube as input data and continuously studies the distinguishing feature information of HSI through the spectral-spatial residual block. In [22], a contextual deep CNN (D-CNN) optimizes local context interaction by exploiting local spatial-spectral relationships of neighboring individual pixel vectors. In [23], a novel Synergistic CNN which fuses 2D and 3D networks was proposed for accurate HSI classification. In [24], a 3D CNN based on residual group channel and attention network was proposed for HSI classification, which strengthens the spatial features by extracting spatial context information and can reduce the loss of meaningful and useful information.
These models use different methods on the basis of deep learning. However, with the depth of network layers becoming deeper, they face difficulties with training and accuracy decline.
Considering the problems above, the paper proposed a multi-scale feature fusion network based on local and hybrid dilated convolution (LDFN), which uses a fusion strategy that not only picks up the local information of details but also collects the rich spatial features by expanding the perception field. A residual fusion network was designed to integrate the local convolution and hybrid dilated convolution (HDC), which have a deeper structure of networks and fast connection with other layers. Therefore, our methods have great robustness and excellent ability to learn spatial-spectral information classification.
In summary, the main contributions of this paper are three-fold.
(1) The proposed hybrid dilated convolution stacked with different sizes of dilation rates is used to extract the spatial information. (2) Local and hybrid dilated convolution methods are integrated which can simply replace the traditional standard convolution. Local convolution connects local pixels closely, which can make our convolution layer more flexible and expressive. Then hybrid dilated convolution is able to raise the field of vision without raising the amount of computation, which can fully collect the spatial-spectral features of hyperspectral images. (3) The proposed new model also uses the specific residual network [25] to fuse the previous HDC and standard convolution on main channels, which can extract multiscale fusion features.
The rest of this paper is arranged as follows. Section 2 discusses the related CNN methods and introduce the LDFN framework for HSI. Section 3 shows experiments over four benchmark hyperspectral datasets. Finally, conclusions are presented in Section 4.

Proposed Methods
Traditional CNN consists of several normal operations, such as convolution operations, activation operations, batch normalization operations and pooling operations. The details of the convolution operations are as follows.
(1) Convolutional Layers The convolution layer is the most significant part of the convolution neural network. The input of each node is only a small part of the upper layer. Convolution layers analyze feature maps of the previous layer through the filters deeply and obtain more abstract features. Therefore, it can deepen the depth of the network. Let x be the input of data, and the size of input data is h × l × c, where h and l mean the height and width of the spatial feature, c represents the numbers of spectral channels. Let w and b represent the weight parameter and bias parameter. y i represents the oth layer output and k means kernel. The formula of the convolutional layer [26] is: It should be noted that σ represents the activation function.
(2) Dilated Convolution and HDC While considering the classification algorithm, spatial-spectral characteristics should also be considered. In order to pick up hyperspectral features, standard convolution pays more attention to repetitive operations, which tremendously improves the computational complexity, and local convolution ignores the spatial similarity of adjacent regions.
The comparison of standard and dilated convolution is revealed in Figure 1. Dilated convolution is able to expand the perception field of the convolution domain and capture multiscale context information, which is able to effectively settle the matter of insufficient spatial information extraction. However, traditional dilated convolution may lead to two problems, one is the gridding effect which means the kernel is not continuous, the other is that long-ranged information might be not relevant, this means that the method may be invalid on small objects. Therefore, a hybrid dilated convolution, which consists of different sizes of dilation rates, is proposed. Hybrid dilated convolution consists of different dilation rates that can effectively solve the problems above. Figure 2 illustrates the process from dilated convolution to HDC. HDC has three characteristics, first, the shape of the dilation rate is designed as a zigzag structure. Second, the dilation rates of stacked HDC cannot have a common measure of more than 1. The last, HDC satisfies a formula [27]: where r i represents the dilation rate of the ith layer. M i represents the largest dilation rate of the ith layer. With the integration of local convolution and HDC, full spatial information can be covered.

HSI Classification Based on LDFN
HSIs have four characteristics: band correlation, high resolution, mass of data and spectral variability. In order to solve these problems, the proposed deep CNN consists of local convolution and hybrid dilated convolution, which can not only extract rich spectral content but also rich spatial information. The framework of the proposed LDFN model is shown in Figure 3. In Figure 3, the data of HSIs are pretreated. In the method of preprocessing, which is named principal component analysis (PCA) [28], an algorithm removes some useless bands at first, then the HSI is processed by reducing the dimension. It is carried out to pick up the most effective components of hyperspectral information and then patch blocks centered on label pixels are extracted to train LDFN. The overall process of LDFN is as follows: The original input size of the image block is set to X ∈ R (H,W,C) , where H, W represent the height and width of the image in the space dimension, and C represents the number of bands in the spectral dimension. First, input the image block into a 3 × 3 two-dimensional convolutional layer. After that, the main channel is divided into two parts. On the one hand, the image block passes upward through two 1 × 1 local convolutional layers. On the other hand, the image block descends into a hybrid dilated convolution block (HDC), which is composed of a stack of dilated convolution layers with dilation rates of 2, 3 and 5. Then, the features are integrated, which generates a composite layer and then are fed into a residual block. In the residual block, two 3 × 3 convolution layers are used to extract input features and generate an output features layer, then the cross-layer connection is proposed to concatenate the HDC layer, the composite layer and the output features layer.
Then, the fused feature map passes through a 1 × 1 two-dimensional convolutional layer, a 2 × 2 average pooling layer and a global average pooling layer. Finally, the high-level features are input to the softmax layer to predict the classification label.
The number of filters is 48 except first the convolution's filters are 16, batch normalization and relu activation are required after each convolution operation except the first local convolution. The size of the dilation rate in HDC is 2, 3 and 5. Last but not least, dropout is required in two local convolutions, the first one is 0.2, the other is 0.5.

Datasets and Baseline
In this paper, three benchmark hyperspectral datasets were used to verify the effectiveness of the proposed LDFN model, which includes the three datasets of the Indian Pines, the Salinas and the University of Pavia. Figure 4 shows the image of the band, the ground truth and color code of the Indian Pines dataset, Figure 5 shows the band image, the ground truth and color code of the Salinas dataset and Figure 6 shows band image, the ground truth and color code of the University of Pavia dataset.   Supervised learning needs a lot of label data, but hyperspectral label data is rare and the labeling process is very complex. Therefore, the experiment uses small samples, which are able to availably resolve the problem of the insufficient labels of hyperspectral data samples. The proportion of training samples in the three datasets is less than 10%.
The Indian Pines dataset consists of 145 × 145 pixels and 224 spectral reflectors, the wavelength range is 0.4-2.5 µm with a spatial resolution of 20 m. The segmentation details of samples are listed in Table 1. The Salinas dataset was collected over Salinas Valley in California. The area covered by 217 samples and Salinas ground truth also contains 16 classes. The segmentation details of the samples in the Salinas dataset are listed in Table 2. The dataset from Pavia University was collected during a flight over Pavia in northern Italy. The size of image pixels is 610 × 610 and the geometric resolution is 1.3 m. The segmentation details of samples in the University of Pavia dataset are listed in Table 3. The proposed LDFN model is established on tensorflow2.0 and the keras framework, it uses the programming language python. The experiments are trained and tested on a Geforce GTX 1660 GPU, RAM 16.00 GB. The Adam optimizer [29] is adopted and the epochs are 100 with the mini-batch size of 64. The training group and the test group are divided according to the ratio of 1:9. The initial learning rate is 0.001. To unify the input pixels, in the three used datasets, a pair of adjacent pixel units with the same size is fed into the model. Figure 7 shows the accuracy of OA obtained by LDFN for the three used datasets with different hyperparameters.

1.
According to the analysis of the curve in Figure 7a, first, the principal component numbers in the three used datasets are set to 20, then the tendency of curves can be seen clearly in the picture that OA increases rapidly first and gradually goes into a steady state, then curves drop which indicates that larger or smaller patch blocks cannot make the model stable and optimal. Therefore, the size of the patch is set to 11 × 11, 13 × 13 and 11 × 11 for Indian Pines, Salinas and University of Pavia, respectively. 2. Figure 7b reveals the curve changes with principal component numbers. First, the patch size in the three datasets is set to 11 × 11. As can be seen, the curves of overall accuracy increase till a steady state then drop down, which means that a reasonable expansion of the principal component numbers is conducive to the extraction of rich spectral information, but if the principal component numbers are excessive, it can lead to a decline in the performance of the network. Therefore, the principal component numbers are set to 25, 20 and 20 for the Indian Pines, Salinas and the University of Pavia, respectively.

Quantitative Metrics and Compared Methods
Three evaluation indexes are used in HSI classification to evaluate the model performance of different methods. Three objective metrics are used, that is, overall accuracy (OA), average accuracy (AA) and the Kappa coefficient.
The proposed LDFN is in contrast with different other methods. The comparison methods can be generally split into two groups. One is the traditional machine learning method, including SVM [5]. The other consists of deep learning methods, including 3D-CNN [19], 3D-CAE [20], D-CNN [22] and SSRN [21]. The different methods have the same input size of patch blocks as our LDFN model.

Classification Results
The first one is carried out on the dataset of the Indian Pines. All methods choose 10% samples to train the model and 90% samples to test. Table 4 reveals the quantitative results of the different methods and specific results are under the average of 10 training results. It is obvious that the accuracies of SVM, 3D-CNN and 3D-CAE are less than 95% in the three metrics above. D-CNN with contextual deep CNN framework and SSRN with several residual blocks have more than 95% accuracy in the three metrics. Overall, our proposed LDFN model has a better performance than the other methods in the three metrics. Figure 8 reveals the classification maps of the different methods clearly for the Indian Pines dataset. SVM has serious noise, 3D-CNN and 3D-CAE are smoother than SVM but still have some obvious noise in vision. SSRN and LDFN perform well and have less noise, furthermore, our LDFN model is better than SSRN in terms of detail.  The second one is based on the dataset of Salinas. A 1% sample was chosen to train the model and 99% to test all the methods. Table 5 reveals the quantitative results of different methods and specific results are under the average of 10 training results. It is obvious that the traditional method of SVM only has a classification accuracy of around 80%. However, the methods of deep learning basically have more than 95% accuracy. The deepening of the network layers in 3D-CNN, 3D-CAE and D-CNN can reach up to an accuracy of 95%. Our OA is 99.36%, AA is 99.56% and Kappa coefficient is 98.29%. Figure 9 reveals the classification diagrams of the Salinas dataset, which clearly shows that the LDFN model is smoother than other compared methods. Therefore, the performance of our LDFN model is better.  The third experiment is carried out on the dataset of Pavia University for which 2% of the samples are selected to train all methods' models and 98% to test all the methods. Table 6 reveals the quantitative results of different methods and the specific results are under the average of 10 training results. As it can be seen, the OA of our LDFN model is 99.19%, AA is 98.89% and Kappa is 98.92%, which shows better performance than the 98.57% OA, 97.16% AA and 98.27% Kappa in SSRN as well as some other deep learning methods that have accuracies around 95%. Due to the loss of use in spatial features, the traditional method of SVM can only reach up to an accuracy of 80% on average. Figure 10 shows the classification maps of different methods for the University of Pavia dataset, which intuitively presents that our LDFN model has better performance in terms of vision, especially for the details of the edge and local parts.

Comparison of Different Local and HDC Fusion Strategies
In this section, different structures of local and HDC fusion models are compared to prove the effectiveness of the LDFN model. Since the local spatial-spectral contents are extracted by local convolution, the size of the HDC starts at 2 instead of 1. Table 7 reveals the values of OA obtained from the local and HDC fusion models. The LDFN24 represents the dilation rates stacked by size 2 and 4, LDFN25 is stacked by size 2 and 5, LDFN34 consists of size 3 and 4. Particularly, LDFN234 consists of three different dilation rates, the fusion dilation size is 2, 3 and 4. It is obvious in Table 7 that the model with the local and HDC structure achieves better HSI classification results than the compared fusion model D-CNN. Meanwhile, according to accurate experiment results, the proposed LDFN indeed performs better than other methods with different sizes of dilation rates. In general, the extensive experiments with three HSI datasets prove that our LDFN model is not only steady and convenient for training, but also effective and advanced in technology.

Conclusions
In this paper, a novel deep learning method called the local and hybrid dilated convolution fusion network was proposed for HSI classification. The proposed local and hybrid dilation fusion network fuses local convolution and hybrid dilation convolution, local convolution connects local pixels closely, which can make our convolution layer more flexible and expressive. Hybrid dilation convolution stacked with different dilation rates of 2, 3 and 5 can raise the field of vision and consider the spatial correlation of hyperspectral images in adjacent areas without increasing the amount of computation. It also uses the specific residual fusion network to integrate the previous HDC and standard convolution into the main channels, which can not only solve the problem of the insufficient receptive field but also can extract multi-scale feature information. Experimental results demonstrate that the LDFN model can achieve a satisfactory classification accuracy for hyperspectral images under the lightweight standard. The proposed LDFN model still has great room for improvement. At present, the LDFN model still has redundant parameters and needs to spend some time training to extract spectral-spatial features. In future research, more attention will be paid to multi-scale information fusion and reducing model parameters, which may help optimize the LDFN model and can better integrate spectral features and spatial features for HSI classification.
Author Contributions: C.L., Z.Q. conceived the ideas; X.C., Z.C. and Z.H. gave suggestions for improvement; C.L. and Z.Q. conducted the experiment and compiled the paper. H.G. assisted and revised the paper. All authors have read and agreed to the published version of the manuscript.