Densely Connected Pyramidal Dilated Convolutional Network for Hyperspectral Image Classiﬁcation

: Recently, with the extensive application of deep learning techniques in the hyperspectral image (HSI) ﬁeld, particularly convolutional neural network (CNN), the research of HSI classiﬁcation has stepped into a new stage. To avoid the problem that the receptive ﬁeld of naive convolution is small, the dilated convolution is introduced into the ﬁeld of HSI classiﬁcation. However, the dilated convolution usually generates blind spots in the receptive ﬁeld, resulting in discontinuous spatial information obtained. In order to solve the above problem, a densely connected pyramidal dilated convolutional network (PDCNet) is proposed in this paper. Firstly, a pyramidal dilated convolutional (PDC) layer integrates different numbers of sub-dilated convolutional layers is proposed, where the dilated factor of the sub-dilated convolution increases exponentially, achieving multi-sacle receptive ﬁelds. Secondly, the number of sub-dilated convolutional layers increases in a pyramidal pattern with the depth of the network, thereby capturing more comprehensive hyperspectral information in the receptive ﬁeld. Furthermore, a feature fusion mechanism combining pixel-by-pixel addition and channel stacking is adopted to extract more abstract spectral–spatial features. Finally, in order to reuse the features of the previous layers more effectively, dense connections are applied in densely pyramidal dilated convolutional (DPDC) blocks. Experiments on three well-known HSI datasets indicate that PDCNet proposed in this paper has good classiﬁcation performance compared with other popular models.


Introduction
Hyperspectral remote sensing image is characterized by high dimension, high resolution, and rich spectral and spatial information [1], which have been diffusely used in numerous real-world tasks, such as sea ice detection [2], ecosystem monitoring [3,4], vegetation species analysis [5] and classification tasks [6,7]. With the speedy progress of remote sensing technology and artificial intelligence (AI), a great proportion of new theories and methods in deep learning have been proposed to handle the challenges and problems faced by the field of hyperspectral image [8].
Hyperspectral image classification is an vital branch in the subject of HSI, which has gradually become a crucial direction for scholars in the AI industry. It is worth noting that hyperspectral image pixel-level classification determines the category label of each pixel, and segmentation determines the boundary of a given category of objects. HSI classification and segmentation are related to each other, and segmentation involves the classification of individual pixels. A number of conventional spectral-based classifiers, such as support vector machines (SVM) [9,10], random forest [11][12][13], k-nearest neighbors (kNN) [14][15][16], Bayesian [17], etc., can only show good classification performance in the case of abundant labeled training samples. Recently, more and more methods based the algorithm can decompose them into inhomogeneous blocks, which well maintains the homogeneous characteristic. Sun et al. proposed a fully convolutional segmentation network, which can simultaneously recognize the true labels of all pixels in the HSI cube [48]. For those cubes that contain more land cover categories, it has better recognition capabilities. The DeepLab v3+ network shows active performance in the field of semantic segmentation. Si et al. applied it to the field of HSI image classification for feature extraction [49]. Then, the SVM classifier is used to get the final classification result.
CNN will have better classification performance if the convolutional layer can capture more spectral-spatial information. Although the problem that the receptive field of naive convolution is too small can be effectively solved by dilated convolution, there are unrecognized regions (blind spots) in the receptive field of dilated convolution. Inspired by densely connected multi-dilated DenseNet (D3Net) [50], densely connected pyramidal dilated convolutional network for HSI classification is proposed in this paper to acquired more comprehensive feature information. The structure of the network is composed of several densely pyramidal dilated convolutional blocks and transition layers. In order to increase the size of the receptive field and eliminate blind spots without increasing parameters, dilated convolution with different dilated factors are applied to develop PDC layers. A hybrid feature fusion mechanism is applied to obtain richer information and reduce the depth of the network. The main contributions of the paper are summarized as follows. Firstly, the larger receptive field is obtained by applying the dilated convolution to CNN. Furthermore, in order to avoid blind spots in the receptive field of the feature maps extracted by dilated convolution, we set the dilated factors appropriately and increase the width of the network. Then, the hybrid feature fusion method of pixel-by-pixel addition and channel stacking is applied to extract more abstract feature information while effectively utilizing features. In addition, our network (PDCNet) achieves better performance on well-known datasets (Indian Pines, Pavia University and Salinas Valley datasets) by combining dilated convolution and dense connections than some popular methods.
The remaining part of this paper is organized as follows: Some state-of-the-art technologies related to convolutional neural networks for HSI classification will be introduced in Section 2. In Section 3, methods and network architecture proposed in this paper will be described in detail. The experimental settings and classification results will be shown in Section 4. The discussion of training samples, the number of parameters and the running time of the networks are carried out in Section 5. The conclusion of the paper and the outlook for future work are given in Section 6.

Related Work
Before introducing the hyperspectral image classification network proposed in this paper, some relevant techniques are reviewed in this section, namely residual network structure, pyramidal network structure, and dilated convolution.

Residual Network Structure
CNN can achieve good HSI classification performance. However, when the depth of the network reaches a certain degree, the phenomenon of the vanishing gradient will become more and more obvious, which will lead to the degradation of the network performance. ResNet [37] addresses the problem by adding identity mapping between layers. Recently, the idea of ResNet has been applied to various network models with good results. In order to solve the problems of too small receptive field and localized feature information obtained by naive convolution, Meng et al. proposed a deep residual involution network (DRIN) for hyperspectral image classification by combining residual and involution [51]. It can simulate remote spatial interaction through enlarged involution kernels, which makes feature information obtained by the network more comprehensive. Hyperspectral images often have high-dimensional characteristics. The equal treatment of all bands will cause the neural network to learn features from the useless bands for classification, which will affect the final classification results. In order to solve the above problem, Zhu et al. combined the residual and attention mechanism and proposed a residual spectral spatial attention network (RSSAN) for HSI classification [6]. Firstly, the spectral-spatial attention mechanism is used to emphasize useful bands and suppress useless bands. Then, the characteristic feature information is sent to the residual spectral-spatial attention (RSSA) module. However, how to judge the useless band and the useful band is a key problem. Moreover, the attention mechanism in RSSA module will increase parameters and the calculation cost.

Pyramidal Network Structure
Based on the idea of ResNet, a pyramid residual network (PresNet) for hyperspectral image classification was proposed in [41]. It can involve more location information as the depth of the network increases. In the basic unit of the pyramid residual network, the number of channels of each convolutional layer increases in a pyramid shape. In order to extract more discriminative and refined spectral-spatial features, Shi et al. proposed a double-branch network for hyperspectral image classification by combining the attention mechanism and pyramidal convolution [52]. Each branch contains two modules, namely the pyramidal spectral block (the spectral attention) and the pyramidal spatial block (the spatial attention). To solve the limitation that the pyramidal convolutional layer has a single-size receptive field, Gong et al. proposed a pyramid pooling module, which can aggregate multiple receptive fields of different scales and obtain more discriminative spatial context information [53]. The pyramid pooling module is mainly implemented by average pooling layers of different sizes, and then the feature map is restored to the original image size through deconvolution. However, the multi-path network model has more parameters than a single-path structure, which increases the running time of the network. In addition, the average pooling layer will reduce the size of the feature map and lose some feature information.

Dilated Convolution
Convolutional neural network has shown outstanding performance in the field of hyperspectral image classification in recent years. However, naive convolution focuses on the local feature information of hyperspectral images, which will cause the network to fail to learn the spatial similarity of adjacent regions. As shown in Figure 1, the receptive field of dilated convolution is usually larger than that of naive convolution, and more spatial information can be obtained, which can effectively avoid the problem of limited features obtained by naive convolution. It is worth noting that, as shown in Figure 1b, there are unrecognized regions (blind spots) in the receptive field of the dilated convolution, which will cause the obtained spatial information to be discontinuous. A hybrid dilated convolution method is proposed for HSI classification, which combines multi-scale residuals to obtain good classification results [54]. Although it obtains a larger receptive field through hybrid dilated convolution, there are still a lot of blind spots in the receptive field. Furthermore, traditional CNN mostly uses fixed convolution kernels to extract features, which is not friendly to multi-scale features in hyperspectral images. In order to solve the above problems, Gao et al. proposed a multi-depth and multi-scale residual block (MDMSRB), which can fuse multi-scale receptive fields and multi-level features [55]. Although MDMSRB can integrate multi-scale receptive fields, the problem of blind spots in the receptive fields has not really been solved. In other words, when we introduce skip connections in different dilated convolution layers, there are still unrecognized areas in the receptive field corresponding to the skip connections.
In order to take full advantage of dilated convolution, Xu et al. extended the idea of multi-scale feature fusion and dilated convolution from spatial dimension to spectral dimension by combining dilated convolution, 3D CNN and residual connection, which makes it better applicable to HSI classification [27]. This method can obtain a wider range of spectral information, and it is a unique advantage of dilated convolution in 3D CNN. However, the introduction of dilated convolution into the spectrum will bring about the problem of blind spots, and it will lead to the discontinuity of the obtained spectrum information. In order to overcome the above problems, a PDCNet model is proposed in the paper.

Densely Connected Network Structure
With the development of deep learning, compared with traditional machine learning methods, neural networks show excellent performance on image recognition tasks. Simonyan et al. proposed the famous VGGNet in 2014 [56], which is mainly used in large-scale image recognition field. Then, ResNet [37] and DenseNet [45] for HSI classification came into being, which can extract more abstract spectral-spatial features and have fewer parameters. DenseNet has more advantages than ResNet in that it applies more skip connections, which improve the reuse of previous layers spectral-spatial features and reduce the vanishing gradient.
All layers in DenseNet are directly connected to ensure the maximum transmission of information between network layers. Simply put, the input of each layer is the output of all previous layers. As depicted in Figure 2, the densely connected structure is composed of several basic units, where the input of the n th basic unit (X (n) ) is consisted of the outputs of all previous blocks (1, 2, · · ·, n − 1) nd the input of the 1st basic unit, and the output of the basic unit will be the input of the next basic unit. Each basic unit contains the batch normalization (BN) layer, the ReLU activation function and the convolutional layer. The input data is scaled to the appropriate range through the nonlinear activation function of the BN layer, and then the expression ability of the neural network is improved by the ReLU nonlinear activation function. The equation of the BN layer is defined as : Var[X (i) ] where γ and β are the scaling factor and the shift factor, respectively. Var[·] is the variance of the input data. The BN layer can effectively avoid the internal convariate shift and maintain the data distribution stable. The output of the ReLU layer is sent to the convolution layer to extract richer information.

Densely Pyramidal Dilated Convolutional Block
Dilated convolution, rather than naive convolution, is applied to DPDC blocks, which can integrate more multi-scale context information without loss of resolution [54], thereby improving spatial information utilization of HSI. The dilated convolution and receptive field will be described in detail in Section 3.3.
The three different convolution blocks are depicted in Figure 3. As shown in Figure 3a, three naive convolutional layers are densely connected. In order to increase the receptive field and obtain richer hyperspectral information without losing the size of the feature maps, dilated convolution is applied to replace naive convolution. As depicted in Figure 3b, a larger receptive field is obtained by densely connecting multiple dilated convolutions with different dilated factors, but there are blind spots in the receptive field, which will result in the acquired feature information discontinuous. Reasonably setting the dilated factors of the dilated convolution and increasing the width of the network like a pyramid are considered to be effective methods to obtain more abstract and comprehensive feature information (Figure 3c). The DPDC block in this paper is composed of several PDC layers, and dense connections are adopted between different PDC layers to increase the flow of information within the network. The PDC layers is composed of dilated convolution layers with different dilated factors: where N k represents the k th PDC layer, and n d k indicates that k th sub-dilated convolutional layer with dilated factor d = 2 k−1 in the k th PDC layer. Λ represents the stacking of subdilated convolutional layers. Different skip connections correspond to different dilation factors. Generally speaking, the shallower skip connection corresponds to the smaller dilated factor. For instance, the skip connection between the input feature and the 3rd PDC layer corresponds to a sub-dilated convolutional layer with a dilated factor of 1; the skip connection between the 1st PDC layer and the 3rd PDC layer corresponds to a sub-dilated convolutional layer with a dilated factor of 2. The width of the network will increase as the number of PDC layers increases. The advantage of the structure is that more and larger ranges of spatial information can be obtained, while avoiding blind spots in the receptive field.

Receptive Field
The receptive field is defined as the region dominated by each neuron in the model. In other words, the receptive field refers to the area where the pixels on the output feature of each layer are mapped on the original image in the convolutional neural network. The receptive field of the 3rd layer of the densely naive dilated convolutional block ( Figure 3b) is depicted in Figure 4, and the size of convolutional kernel is 3 × 3. Red dots represent the points to which the filter is applied, and colored backgrounds represent the receptive field covered by red dots. Suppose that the input data is directly fed into these three blocks. The receptive field of the 3rd layer in the densely naive convolutional block: As shown in Figure 4a, firstly, the receptive field of 3 × 3 (purple shaded area) corresponds to the skip connection between the input and the 3rd layer (see Figure 3a). Secondly, the receptive field of 5 × 5 (green shaded area) corresponds to the skip connection between the 1st layer and the 3rd layer. Finally, the receptive field of 7 × 7 (blue shaded area) corresponds to the skip connection between the 2nd layer and the 3rd layer. Furthermore, they correspond to a grid point in the output feature map (yellow shaded area).
The receptive field of the 3rd layer in the densely naive dilated convolutional block: As shown in Figure 4b, the receptive field of 3 × 3 (purple shaded area) corresponds to the skip connection between the input and the 3rd layer (see Figure 3b), but it contains a large number of unrecognized areas, which leads to discontinuous hyperspectral information obtained. The skip connection from the 1st layer to the 3rd layer corresponds to a larger receptive field than the densely naive convolutional block, but there are still blind spots in the receptive field, which is caused by the unreasonable setting of the dilated factor.
The receptive field of the 3rd layer in the DPDC block: As shown in Figure 4c, compared with the receptive field of densely naive convolutioanl blocks, the skip connection from the 1st layer to the 3rd layer in the densely pyramidal dilated convolutional block (see Figure 3c) has a larger receptive field. Compared with densely naive dilated convolutional blocks, there are no blind spots in the receptive field corresponding to the skip connections from the 1st layer to the 3rd layer in the pyramidal dilated convolutional block. This is mainly benefited from our reasonable setting of the dilated factor and the design of the PDC layer. The PDC layer performs different convolutional operations on the feature maps from different skip connections. For instance, the 3rd PDC layer in Figure 3c performs d = 1 and d = 2 dilated convolutional operations on the feature maps from two different skip connections, respectively. In DenseNet, feature maps of all previous k − 1 layers [x 0 , x 1 , · · ·, x k−1 ] are used as the input of the k th layer: where R(·) refers to the composite operation of batch normalization and ReLU activation function. [x 0 , x 1 , · · ·, x k−1 ] denotes the stacking of the feature maps (x 0 : the input feature) on the channel from layer 0 to k − 1, and the size of convolutional kernel (w k ) is 3 × 3. The ⊗ d with the dilated factor d = 2 k−1 is used to represent the dilated convolution, and a variation of Equation (3) can be acquired by applying ⊗ d to Densenet: However, skip connections will cause blind spots in the receptive field, so that the feature information learned by the convolutional layer is not comprehensive. To overcome this problem, densely pyramidal dilated convolutional block is proposed and defined as follows: where is the output of composite layer, W k refers to the set of convolutional kernels at the k th layer and w i k denotes the convolutional kernel corresponding to the i th skip connection of the k th layer (w i k is a subset of W k ). The continuity of spatial information is well preserved in the DPDC block (Figure 4c). In other words, blind spots problem in densely naive dilated convolutional block is effectively solved by choosing approprite dilated factors and increasing network width like a pyramid. The more comprehensive feature information of the PDC layer is obtained by pixel-level addition of feature maps of its internal sublayers. Furthermore, the dense connection mode is adopted between PDC layers, which can make more effective reuse of features.

PDCNet Model
Take PDCNet with three DPDC blocks as an example, its network structure is shown in Figure 5. BN + ReLU + Convolution (hereinafter referred to as Conv) is used as our basic structure. Meanwhile, BN and ReLU operations are omitted in Figure 5. DenseNet model for HSI classification, DPDC block, receptive fields and dilated convolution were introduced in Sections 3.1-3.3. Although the size of the receptive field can be effectively increased by dilated convolution, feature information obtained is discontinuous due to the existence of blind spots. Therefore, while the dilated factors are effectively set in the DPDC block, network width gradually increases like a pyramid, which is conducive to eliminate blind spots, and acquire large-range and multi-scale feature information. Furthermore, to take advantage of the features of the previous layers, the dense connection pattern is introduced into PDCNet. High classification accuracy is achieved by combining dilated convolution and dense connection to extract more comprehensive and rich features. The Indian Pines dataset is applied as an example to feed into the PDCNet model proposed in this paper.  Figure 5. The framework of PDCNet.
PDCNet is composed of three DPDC blocks and two transition layers. The transition layers are, respectively, embedded between the three DPDC blocks. The hyperspectral image is divided into cubes and fed into the proposed network. Firstly, the input features are sent to a convolution layer (the kernel size is 3 × 3) for feature extraction, and then they are sent to the subsequent modules of the network. Each DPDC block is densely connected by different number of PDC layers, while the PDC layer N K is stacked by different sub-dilated convolutional layers n d k . The input features of the DPDC block will be allocated to the dilated convolutional layers in the PDC layer through skip connections.
Secondly, a hybrid feature fusion mechanism is applied in PDCNet. As shown in Figure 3c, the DPDC block contains two feature fusion methods: pixel-by-pixel addition and channel stacking. The feature fusion method of channel stacking is adopted between different PDC layers , and the pixel-by-pixel addition is used within each PDC layer and channel stacking is applied between each PDC layers. The hybrid feature fusion mechanism can realize the reuse of all previous layers output features while integrating large-range and multi-scale feature maps. In order to flexibly change the number of channels and reduce the parameters, Conv (the kernel size is 1 × 1) is applied in the transition layer. Finally, the classification results are obtained by an adaptive average pooling layer and a fully connected layer.

Description of HSI Datasets
Indian Pines (IP): As a famous dataset for HSI classification, IP dataset was captured by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the remote sensing test site in the northwest area of the India, 1992. It is composed of 200 valid bands with spectral range from 0.4 to 2.5 µm after discarding 20 water absorption bands. The image of IP has 145 × 145 pixels with a spatial resolution of 20 mpp, and 16 vegetation classes are considered, e.g., alfalfa, oats, wheat, woods, etc. The ground-truth map, the false-color image and the corresponding color label are given in Figure 6.  (c) The image of UP has 610 × 340 pixels with spatial resolution of 1.3 mpp, and 9 feature categories are used, such as trees, gravel, bricks, etc. The ground-truth map, the false-color image and the corresponding color labels are revealed in Figure 7.
Salinas Valley (SV): SV dataset was obtained by the AVARIS senor over an agricultural region of SV, CV, USA, in 1998, and it is consisted of 204 effective bands with spectral range from 0.4 to 2.5 µm after ignoring 20 bands of low signal to noise ratio (SNR). The image of SV has 512 × 217 pixels with spatial resolution of 3.7 mpp, and 16 land cover classes are analyzed, e.g., fallow, stubble, celery, etc. The ground-truth map, the false-color image and the corresponding color labels are displayed in Figure 8.

Setting of Experimental Parameters
PyTorch deep learning framework is applied to a computer with 2.90 GHz Intel Core i5-10400F central processing unit (CPU) and 16 GB memory for experiments, and the average of five experimental results is taken as the final classification result. Three evaluation indicators are used to evaluate the performance of different networks: overall accuracy (OA), average accuracy (AA) and kappa coefficient (Kappa).
As shown in Table 1, 15% of the labeled samples in the IP dataset are used as the training set. Similarly, 5% and 2% of the label samples in the UP and SV datasets are used as the training set and the remaining labeled samples as the testing set (Tables 2 and 3). To better illustrate the robustness of the network, the performance of networks for comparison under different proportions of training samples will be shown in Section 5.  To verify the effectiveness of the method proposed in this paper, several network models are adopted for comparative experiments. The optimal parameters of SVM [9] are obtained by grid search algorithm, which is a traditional machine learning method. In addition, comparative experiments are also carried out on some methods based on deep learning: 3-D CNN [32], FDMFN [42], PresNet [41] and DenseNet [45]. A baseline network (BMNet) composed of three densely naive convolutional blocks and two transition layers, and a dilated convolutional network (DCNet) composed of three dense ordinary dilated convolutional blocks are proposed for comparison experiments in this paper.
The relevant hyper-parameters of the experiment are set as follows. The patch size of comparative experiments with other models is set to 11 × 11, and the epoch and batch size is 100. The learning rate of 3D-CNN, FDMFN, DenseNet, and PDCNet are set to 0.001. The learning rate of PresNet is 0.1. We use AdaptiveMoment Estimation (Adam) optimizer to optimize the learning rate for 3D-CNN, FDMFN, DenseNet, and PDCNet. The Stochastic Gradient Descent (SGD) optimizer is used to optimize the learning rate of PresNet. We use Cosine Annealing LR scheduler in the comparative experiments.

Influence of Parameters
Growth Rate g: It is used to control output channels of the convolutional layer. In the DPDC block, the number of output channels of each PDC layer will increase by g. For instance, the final output channel number of a DPDC block with three PDC layers will increase by 3× g. By adjusting the parameter, the information flow in the network can be controlled flexibly. The growth rate g in PDCNet is set as 52 because it achieved the highest classification accuracy, as shown in Table 4.   Table 6. Note that here the number of PDC layers in each block is fixed to 3. PDCNet with 2 DPDC blocks has the highest OA on the IP dataset. PDCNet with 3 DPDC blocks has the highest OA on the UP dataset in Table 6. PDCNet with 3 DPDC blocks has the highest accuracy on the SV dataset.
From the perspective of accuracy, comparing Tables 5 and 6, firstly, we choose PDCNet with 2 DPDC blocks and 3 PDC layers in each block as the optimal PDCNet in the IP dataset. Secondly, PDCNet with 3 DPDC blocks and 3 PDC layers in each block is considered the optimal PDCNet in the UP dataset. Finally, the PDCNet with 3 DPDC blocks and 2 PDC layers in each block has the highest accuracy in the SV dataset. Patch Size: The impact of different patches on the overall accuracy of the network is shown in Table 7. The network proposed in this paper achieves good results under the different patch sizes. When the patch size is 11 × 11, PDCNet has the highest accuracy on the UP dataset, and has good performance on other datasets. In addition, considering the impact of patch size on training time, the patch size of the PDCNet model is set to 11 × 11.

Ablation Experiments
As shown in Figure 3, three different blocks are designed in the paper. BMNet: BMNet is constructed by stacking three densely naive convolutional blocks (Figure 3a) and two transition layers, where the dilated factor of each convolutional layer is 1, and then densely connecting three naive convolutional layers to form a densely dilated convolutional block. DCNet: Compared with BMNet, DCNet is constructed by stacking three densely dilated convolutional blocks (Figure 3b) and two transition layers, where the dilated factor (d = 2 k−1 ) of each dilated convolutional layer increases in turn, which can obtain larger the receptive field. However, there are blind spots in the receptive field. PDCNet: In order to reasonably increase the receptive field without introducing blind spots in the receptive field, a PDC layer is proposed, which contains sub-dilated convolutional layers with different dilated factors (Figure 3c). The DPDC block is composed of three PDC layers, and their width increases as the depth increases like a pyramid. The basic structure of PDCNet consists of three DPDC blocks and two transition layers through cross-stacking. To illustrate the effectiveness of the network proposed in this paper, BMNet, DCNet and PDCNet are experimented under the same parameter settings (i.e., patch size, learning rate, growth rate, etc.).
The overall accuracy of BMNet, DCNet and PDCNet in different proportions of training samples is shown in Figure 9. Overall accuracy on IP dataset of training samples with different proportions is shown in Figure 9a. The overall accuracy of PDCNet is represented by the red line, which has the highest accuracy compared to other models. As depicted in Figure 9b, with the proportion of training samples increasing, the overall accuracy of three networks become more and more close, but on the whole, PDCNet still showed good performance. As shown in Figure 9c, the overall accuracy of PDCNet is much higher than that of other networks with 2% of training samples. The OA, AA and Kappa of BMNet, DCNet and PDCNet with the same hyper-parameter settings on the three datasets (IP, UP and SV datasets) are shown in Figure 10. As a whole, the proposed network has the highest classification performance.

Classification Results (IP Dataset)
The classification results of PDCNet framework and other comparison methods on IP datasets are shown in Table 8. Correspondingly, Figure 11 shows classification maps of the model designed in this paper and other models, where Figure 11a,b are the false color image and the ground truth, respectively. Obviously, compared with other networks, the model designed in this paper has higher accuracy.
As shown in Table 8, the OA, AA and Kappa of the proposed network (PDCNet) are 99.47%, 99.03% and 99.39%, respectively. According to the classification results of Alfalfa (class 1), the accuracy of PDCNet reaches 97.95%, which is higher than that of other models.
Compared with SVM, 3-D CNN, FDMFN, PresNet and DenseNet, the overall accuracy of the network proposed in this paper is increased by 14.93%, 2.22%, 1.00%, 0.73% and 0.35%, respectively. The average accuracy and Kappa coefficient are also improved to different degrees. As depicted in Figure 11, the classification accuracy of SVM is poor, and there are many noises and spots in classification map. Three-dimensional CNN has a poor ability to process edge information, which leads to edge classification errors in many categories, such as Corn-notill (class 2) and Soybean-notill (class 10) in the classification map. FDMFN, PresNet, DenseNet and PDCNet have better classification performance, but FDMFN has poor classification ability on Alfalfa and Corn-notill. Furthermore, PresNet cannot correctly classify the edges of Soybean-mintill (class 11) and Buildings-grass-trees-drivers (class 15). Furthermore, DenseNet achieves accuracy close to that of PDCNet, but internal noises of DenseNet is more than PDCNet. The problem can be avoided by setting the dilated factor in the dilated convolution reasonably.
While the PDC layer acquires a larger receptive field, it also ensures the continuity of spatial information, which can effectively reduce noise pollution in the receptive field. Therefore, compared with the classification results of other models, the classification map of PDCNet ( Figure 11) has less noise and spots on the IP dataset.

Classification Results (UP Dataset)
The classification result of the proposed network and other comparison methods on the UP dataset are indicated in Table 9. Correspondingly, Figure 12 indicates the classification maps of PDCNet model and other models, where Figure 12a,b are the false color image and the ground truth, respectively. In summary, the model suggested in this paper has the highest accuracy compared to other networks.  As shown in Table 9, the OA, AA and Kappa of the designed netwrok (PDCNet) reaches 99.82%, 98.67% and 99.76%, respectively. Compared with SVM, 3-D CNN, FDMFN, PresNet, DenseNet, the kappa coefficient of PDCNet is improved by 12.50%, 2.06%, 0.71%, 0.65% and 0.11%, respectively. The overall accuracy and average accuracy are also improved to different degrees. As depicted in Figure 12, there are many noises in the classification areas of Gravel (class 3), Bare Soil (class 6) and Bitumen (class 7) in the classification map of SVM. Relatively speaking, the methods based on deep learning can reduce noises on the classification map of UP dataset. However, 3-D CNN, FDMFN and PresNet still have unsatisfactory classification results on Gravel and Bitumen. Although DenseNet has better classification performance on Bitumen, there is still obvious wrong classification for Gravel. It is worth noting that PDCNet has good classification results on areas that are difficult to classify, such as Gravel, Bare Soil and Bitumen.
Compared with the single feature fusion method of DenseNet, the feature fusion mechanism that combines pixel-by-pixel addition and channel stacking applied in PDCNet is more effective, and a larger receptive field is captured by dilated convolution. The spectralspatial features obtained by PDCNet are more abstract and comprehensive, which makes it possible to classify some areas that are more difficult to distinguish accurately.

Classification Results (SV Dataset)
The classification result of the network suggested in this paper (PDCNet) and other comparison methods on the SV dataset are shown in Table 10. Correspondingly, Figure 13 shows the classification maps of PDCNet and other models, where Figure 13a,b are the false color image and the ground truth, respectively. In short, the proposed model has higher accuracy compared to other networks on the SV dataset.
As shown in Table 10, the OA, AA and Kappa of the network proposed in this paper (PDCNet) obtains the classification results of 99.18%, 99.62% and 99.08%, respectively. Compared with SVM, 3-D CNN, FDMFN, PresNet, DenseNet, overall accuracy of PDCNet is improved by 8.52%, 5.43%, 1.56%, 1.02% and 1.06%, respectively. The overall accuracy and the average accuracy are also improved to different degrees. As depicted in Figure 13, SVM cannot classify Grapes_untrained (class 8) and Vinyard_untrained (class 15) well, and there is serious noise pollution in the classification area of these categories. Although 3-D CNN alleviates the problem of noise pollution to a certrain extent, it is more sensitive to edge information, such as Soil_vinyard_develop (class 9), Lettuce_romiane_7wk (class 14) and Corn_senesced_green_weeds (class 10). In addition, for Grapes_untrained and Vin-yard_untrained, PDCNet has less pollution and higher classification results than FDMFN, PresNet and DenseNet.
The higher classification results are mainly attributed to the combination of two ideas in PDCNet. Firstly, the blind spot problem in the receptive field is solved by setting the dilated factor reasonably and increasing the network width like a pyramid, which makes the classification map have less noise and spots. Secondly, the feature fusion method of the hybrid mode is adopted to obtain richer and comprehensive feature information. Furthermore, a larger receptive field is acquired through dilated convolution, which allows the edge features of each category to be better distinguished.
From the perspective of experimental results, compared with some traditional classification methods, the PDCNet proposed in this paper shows the best classification results on the three datasets. Firstly, from the classification accuracy of the three datasets, PDCNet obtained the highest classification accuracy. Secondly, the classification map of PDCNet on the three datasets suffers the least pollution, and it contains the least noise and spots. From the point of view of the network structure, we have introduced dilated convolution and short connections in the DPDC block, while obtaining a larger receptive field, it also eliminates the problem of blind spots caused by dilated convolution, which allows PDCNet to obtain more continuous and comprehensive spatial information.

Comparison with Other Segmentation Method
In this section, we use the PDCNet model structure shown in Figure 5 to conduct a comparative experiment with another hyperspectral image segmentation method (DeepLab v3+) [49] on the UP and KSC datasets. The corresponding classification results are shown in Table 11. We randomly select 5% of the labeled training samples in the UP and KSC datasets.
As shown in Table 11, the network proposed in this paper and DeepLab v3+ achieved similar OA and Kappa on the KSC dataset. However, AA of PDCNet is lower than that of DeepLab v3+. It is worth noting that the accuracy of PDCNet on the UP dataset is higher than DeepLab v3+. Among them, the OA, AA and Kappa of PDCNet are 0.72%, 0.31% and 0.95% higher than those of DeepLab v3+, respectively. Overall, PDCNet has achieved similar OA and Kappa to DeepLab v3+ on the KSC dataset. The OA, AA and Kappa of PDCNet on the UP dataset are all higher than DeepLab v3+.

Influence of Training Samples
Different proportions of training samples on IP, UP and SV datasets are adopted to measure the performance of different networks. The overall accuracy of SVM, 3D CNN, FDMFN, PresNet, DenseNet and PDCNet are shown in Table 12. Note that here PDCNet with 3 DPDC blocks (3 PDC layers in each block) is used for comparison. On the IP dataset, the netwrok suggested in this paper is 0.79%, 0.66%, 0.69%, 0.45% higher than PresNet, and 0.26%, 0.27%, 0.31%, 0.31%, 0.14% higher than DenseNet. The overall accuracy of PDCNet is also improved in UP and SV datasets. The designed network shows great overall accuracy under different proportion of training samples. Table 13 shows the running time and parameters of different networks on IP, UP and SV datasets. Note that here PDCNet with three DPDC blocks and three PDC layers in each block is used for comparison. Since the PDC layer in PDCNet could contain several sub-dilated convolutional layers, the training time of the network designed in this paper is longer than that of other networks. In addition, the parameters of the suggested network are more than those of 3D CNN and FDMFN. However, the proposed network has fewer parameters than DenseNet and PresNet.

Conclusions
In this paper, we propose a densely connected pyramidal dilated convolutional neural network for hyperspectral image classification, which can capture more comprehensive spatial information. Firstly, the PDC layer is composed of different numbers of dilated convolutions with different dilated factors to obtain receptive fields of multiple scales. Secondly, in order to eliminate blind spots in the receptive field, we densely connect different numbers of PDC layers to form a DPDC block. It can be seen from the classification result maps on the three datasets that the classification map of PDCNet suffers the least pollution and contains the least noise and spots, which is mainly due to the design of the DPDC block. Finally, a hybrid feature fusion mechanism of pixel-by-pixel addition and channel stacking is applied in PDCNet to improve the discriminative power of features. This is another reason for our good classification accuracy. In addition, the experimental results on three datasets show that our method can obtain good classification performance compared with other popular models.
In future work, since we have increased the width of the network, the training time of PDCNet is relatively long. Therefore, some methods to reduce computing cost will be considered and applied to the network in this paper. In addition, in order to further obtain more abstract spectral-spatial features, some new methods will be considered, such as channel shuffling technology and the utilization of more frequency domain information in pooling layer. Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Three public datasets used in this paper can be found and experimented at http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes (accessed on 3 June 2021).