Deep&Dense Convolutional Neural Network for Hyperspectral Image Classiﬁcation

: Deep neural networks (DNNs) have emerged as a relevant tool for the classiﬁcation of remotely sensed hyperspectral images (HSIs), with convolutional neural networks (CNNs) being the current state-of-the-art in many classiﬁcation tasks. However, deep CNNs present several limitations in the context of HSI supervised classiﬁcation. Although deep models are able to extract better and more abstract features, the number of parameters that must be ﬁne-tuned requires a large amount of training data (using small learning rates) in order to avoid the overﬁtting and vanishing gradient problems. The acquisition of labeled data is expensive and time-consuming, and small learning rates forces the gradient descent to use many small steps to converge, slowing down the runtime of the model. To mitigate these issues, this paper introduces a new deep CNN framework for spectral-spatial classiﬁcation of HSIs. Our newly proposed framework introduces shortcut connections between layers, in which the feature maps of inferior layers are used as inputs of the current layer, feeding its own output to the rest of the the upper layers. This leads to the combination of various spectral-spatial features across layers that allows us to enhance the generalization ability of the network with HSIs. Our experimental results with four well-known HSI datasets reveal that the proposed deep&dense CNN model is able to provide competitive advantages in terms of classiﬁcation accuracy when compared to other state-of-the-methods for HSI classiﬁcation.


Introduction
The goal of this section is to introduce a new deep neural model for remote sensing data processing, aimed at conducting classification of hyperspectral images (HSIs). To present the rationale and objectives of this work, this section will introduce the problematic around hyperspectral data processing, focusing on the classification task, and indicating the most widely-used classification methods and algorithms. In particular, this section will highlight the use of deep learning (DL) strategies for data analysis, highlighting those methods based on convolutional neural networks (CNNs) as the current-state-of-the-art of DL field. Also, this section will point out the limitations of these methods when working with complex HSI datasets and very deep architectures, describing the new improvements developed on very deep neural networks. Finally, the section enumerates the goals and contributions of the current work.

Hyperspectral Imaging Concept and Missions
The combination of spectroscopy and photography technologies in current imaging spectrometers allows for the acquisition of spatial-spectral features capturing the visible and solar-reflected infrared (near-infrared and short-wavelength infrared) spectrum at different wavelength channels for different locations in an image plane. Particularly, in the remote sensing field this image plane is usually obtained over an observation area on the surface of the Earth by airborne and spaceborne spectrometers [1] which produce large amounts of data per hour, usually close to the Gigabytes of information, with improved spatial resolution also. In the category of aerial spectrometers we can highlight, for instance, the Airborne Visible InfraRed Imaging Spectrometer (AVIRIS) [2], which records information in the range of 0.4-2.45 µm in 224 spectral bands, providing continuous spectral coverage at intervals of 10 nm over the spectrum. Another well-known aerial spectrometer is the the Reflective Optics System Imaging Spectrometer (ROSIS-3) [3], which covers the region from 0.43-0.86 µm with more than 100 spectral bands at intervals of 4nm. Other aerial spectrometers are the Hyperspectral Digital Imagery Collection Experiment (HYDICE) [4], the HyMAP [5,6], the PROBE-1 [7] or the family of spectrometers composed by the Compact Airborne Spectrographic Imager (CASI), which capture information in the visible/near infrared regions, and its related SASI and TASI, which capture information in the short wave infrared and also in the thermal infrared, respectively [8][9][10]. Focusing on spaceborne sensors, we can highlight the Earth Observing-1 (EO-1) Hyperion [11,12], which also records information in the range 0.4-2.5 µm with 10nm spectral resolution, obtaining data cubes with 220 spectral bands. Other spectrometers allocated on satellite-platforms are the Moderate Resolution Imaging Spectroradiometer (MODIS) [13], the Fourier Transform HyperSpectral Imager (FTHSI) [14], and the CHRIS Proba [15]. New missions include the German Environmental Mapping and Analysis Program (EnMAP) [16], the Italian Precursore IperSpettrale della Missione Applicativa (PRISMA) program [17], the commercial Space-borne Hyperspectral Applicative Land and Ocean Mission (SHALOM) [18], the NASA Hyperspectral Infrared Imager (HyspIRI) [19], or the Japanese Hyperspectral Imager Suite (HISUI) [20]. These missions are expected to generate a continuous stream of HSI data [21,22]. For instance, AVIRIS can collect nearly 9 GB/h, while Hyperion can collect 71.9 GB/h in many different locations over the world, which generates large HSI repositories [23] characterized by their heterogeneity and complexity.

Hyperspectral Image Classification
The data acquired by imaging spectrometers is a collection of n bands images that measure the spectrum in n bands narrow and continuous spectral bands [24]. Usually, in hyperspectral images the parameter n bands is in the order of hundreds or thousands, covering a wide spectral range of frequencies [25]. As a result, each pixel provides a spectral signature that contains a highly detailed and unique representation of the reflectance for each captured land-cover material. This leads to a better discrimination among the different materials contained in the image, allowing hyperspectral imagery (HSI) to serve as a tool for the analysis of the surface of the Earth in many applications [26][27][28][29]. The analysis of HSIs involves a wide range of techniques, including classification [29,30], spectral unmixing [31][32][33][34], target and anomaly detection [35][36][37][38]. In recent years, HSI classification has become a popular research topic in the remote sensing field [39]. Given a HSI data cube X ∈ R n 1 ×n 2 ×n bands , the goal of classification is to assign a unique class label (from a set of predefined classes) to each pixel vector in the scene. Let us denote by x i ∈ R n bands = [x i,1 , x i,2 , · · · , x i,n bands ], a pixel vector in the scene, with i = 1, 2, · · · n 1 × n 2 [40]. Several efforts have been made in order to develop effective and efficient methods for HSI classification. Traditionally, the spectral information contained in each HSI pixel allows standard pixel-wise methods to achieve an improved characterization of land cover targets, i.e., by processing each HSI pixel as an independent entity whose content is considered to be a pure spectral signature, related to only one surface material. These kind of methods are known as spectral-based approaches [41], where the HSI X is considered to be a collection of n 1 × n 2 pixel vectors of n bands without a specific spatial arrangement, so that the classification process only takes into account the information comprised by individual pixels x i .
In the literature, several unsupervised and supervised pixel-wise classifiers have demonstrated good performance in terms of classification accuracy. On the one hand, unsupervised methods do not require labeled data, performing the classification based on the inherent similarities present in the data structure, which is usually measured as the distance between the features of each sample x i and separating the data into groups accordingly, assigning a label to each group (clustering). The most representative unsupervised methods are the k-means [21], k-nearest neighborhood (KNN) [42], and iterative self-organizing data analysis technique algorithm (ISODATA) [43][44][45]. In turn, supervised methods generally obtain a better performance than their unsupervised counterparts by learning a function that models the relationship between the data and the output categories Y = f(X), using a training set made up of labeled samples D train = {x i , y i } n label i=1 . Popular supervised methods are decision trees (DTs) and random forests (RFs), which have been successfully used in the past, providing good and accurate land cover maps [46][47][48]; the multinomial logistic regression (MLR) [49] and support vector machines (SVMs) [50,51], which are able to perform accurately in the presence of limited training sets, where the first one allows to easily model the posterior probability distributions of the data while the second one produces highly accurate results, even with large numbers of classes in high dimensional feature spaces; bayesian estimation models [52], which provide a flexible framework to represent the probabilistic features of HSI data and efficiently exploit the prior knowledge of it; kernel-based methods [53,54], which present a great simplicity in data modeling and effective performance over traditional learning techniques (although one-single kernel approaches tend to overfit when dealing with high dimensional data coupled sparsity of training samples [55]); extreme learning machines (ELMs) [56], and artificial neural networks (ANNs) [26]. These methods face challenges related by the high spectral dimensionality of HSI data and the limited availability of training samples [54]. In fact, supervised classifiers tend to suffer the Hughes phenomenon [57][58][59] when dealing with HSI data, where the obtained classification accuracy increases gradually at the beginning as the number of spectral bands increases, but decreases dramatically when n bands exceeds a limit, introducing the need for dimensionality reduction methods in order to reduce n bands . Some of these methods include principal component analysis (PCA) [60,61], independent component analysis (ICA) [62], or the maximum noise fraction (MNF) [63,64]. For instance, the SVM and MLR methods can deal with small D train sets in a robust way [65] and exhibit good performance with high-dimensional data [66].

Deep Neural Networks for Hyperspectral Image Classification
Traditional classification methods present some limitations when compared to ANNs. In fact, ANNs present more flexible architectures, which are able to scale much better with larger amounts of data, exhibiting a great generalization power, without the need for prior knowledge about the statistical distribution of the data. In this sense, deep ANN architectures [67,68] (also called deep neural networks or DNNs) have attracted broad attention in HSI data classification due to their ability to extract more abstract data representations from the original data in a hierarchical way. In other words, DNNs are able to learn simple features in the first (low-level) layers and then build more complex features at last (high-level) layers by merging the simpler ones [69]. Particularly, convolutional neural network (CNNs) [70] have become a widely representative deep model due to their feature detection power, which drastically improves the classification and detection of objects. As result, the CNN is able to reach good generalization in HSI classification [71][72][73]. In particular, its kernels-based architecture allows to naturally integrate the spectral and spatial information contained in the HSI in a simple and natural way, taking into account not only the spectral signature of each pixel x i but also the spectral information of a d × d neighborhood (also called patch) that surrounds it, denoted by p i ∈ R d×d×n bands . Spectral-spatial classification techniques are a tool of choice because it is well-known that the simultaneous consideration of spectral and spatial-contextual information can signnificantly improve the performance of HSI classifiers [40], reinforcing the learning process of the classifier by adding more information provided by adjacent pixels (that commonly belong to the same class). This additional information helps to reduce label uncertainty, reducing the intra-class variance present into HSI pixels.

Convolutional Neural Networks and Their Limitations in Hyperspectral Image Classification
The aforementioned characteristics have placed the CNN model as the current state-of-the-art in Deep Learning (DL) frameworks for HSI classification, with an increasing interest in the task of improvements to the network architecture in order to adapt the model to complex and large datasets such as HSIs. This is because the pixels in a HSI often present intraclass variability and interclass similarity, which introduces high correlation between adjacent bands and also high data redundancy due to the high number of spectral bands n bands . The tendency of Deep Learning (DL) models is to develop very deep network architectures, increasing the number of layers under the assumption that deeper architectures are more expressive than shallow ones, allowing for the extraction of more complex and high-level relationships between input data sets [74], which leads to improvements in the model accuracy and performance [75,76].
However, deep stacks of CNN building blocks (i.e., convolutional, pooling or fully connected layers, for instance) are hard to train [77], requiring a D train with a large number of training samples in order to correctly fine-tune all the network's parameters with the aim of avoiding the overfitting problem. DL-based approaches could induce to think that the deeper model is, the better performance can be achieved. However, although deep architectures are able to produce more high-level features at the end of the network thanks to their hierarchical structure [71,78,79], helping the classifier to perform better decisions, the real fact is that the excessively increasing network depth will result in several negative effects [80], producing an accuracy degrading of the network. In particular supervised DNN models such as standard CNN are very susceptible to overfit because of two main reasons: (i) the lack of available labeled samples to train the model and (ii) the large number of parameters that must be fine-tunned [81,82]. As result, the model is able to perform properly on training data, learning the details and features present in these samples, but its results over the testing or held out data are quite poorly [83], where the previous learned features are not helping into the model performance and even can produce a negative impact on the model's ability to generalize when new data must be processed. This fact can get worse due to the complexity of HSI data, composed by thousand of spectral channels, which gives rise to the curse of dimensionality [57]. Usually, techniques such as data augmenting [84], active learning [85], or regularization methods (such as dropout [86] or L1/L2 regularization) have been employed in order to minimize this problem. However, deep CNN models still have to face the vanishing gradient problem, which arises from the backpropagation mechanism used for training an ANN or DNN. In this sense, it must be taken into account that training is composed by two main steps: 1.
The forward pass, where data is passed through the network until it reaches the final layer, whose output is used to calculate an optimization or cost function (normally the difference between all the desired outputs y i ∈ Y and the obtained ones y i ∈ Y , calculated as the cross-entropy of the data φ c = − ∑ i y i log 2 y i ) -normally optimized by a stochastic gradient descent method.

2.
The backward pass, where the obtained gradient signal must be backpropagated through the network in order to ensure that the model's parameters are properly updated. However, this gradient fades slightly as it passes through each layer of the CNN, which in very deep networks produces its practical disappearance or vanishing. As result, the accuracy of the deep CNNs is saturated and degrades rapidly. To avoid this problem, models implement their optimizer with a really small learning rate, making the training more reliable but forcing the gradient to perform many small steps until convergence.
Several solutions have been proposed in the literature with the aim of avoiding the vanishing gradient problem. These solutions can be categorized into four main tendencies: (i) developing new and better initialization strategies [77,[87][88][89], (ii) modifying the network's non-linear activation function [90,91], (iii) developing better optimizers [92][93][94], and (iv) changing the network topology. In this last category, several efforts have been recently made in order to increase the depth of the CNN architecture. For instance, Lin et al. [95] present a network in Network (NIN) architecture, where multilayer perceptrons (MLPs) are inserted between convolutional layers. In fact, the architecture of NIN-based MLPs can be understood as a set of additional 1 × 1 convolutional layers followed by an activation function. This approach has inspired the work in Szegedy et al. [96], where the inception architecture is presented as a stack of inception modules, being each one composed by n-parallel streams (which in turn are composed by several layers) whose outputs are merged by concatenation [97]. On the other hand, some works focuse on increasing the network's depth by creating short paths from low-level layers to high-level layers. In this sense, we can highlight the residual neural networks (ResNets) [79,98] which are composed by residual units [99] as groups of several CNN layers and whose inputs are connected to their outputs in order to reformulate the layers as learning residual functions where identity mappings help to avoid the degradation problem and promotes gradient propagation. The hourglass CNN architecture [100,101] follows a similar approach, making extensive use of residual modules. In this case, each hourglass module presents an encoder-decoder architecture, where the encoder's blocks are connected to their decoder's counterparts via skip connections, maintaining the overall hourglass shape. Other deep models with connections between their layers are the ResNet in ResNet [102], the FractalNet [103], and the highway networks [74,104,105]. Also, for HSI classification, we highlight the deep feature fusion network (DFFN) presented by Song et al. [80], which combines the fusion of multiple-layer features with the use of residual blocks, combining additional levels of information which reduces the "distance" between each layer and the final classifier while enhancing, at the same time, the final performance without sacrificing layers of depth. Another interesting work is presented by Mou et al. [106], which propose a fully Conv-Deconv network with residual blocks and shortcut connections between the convolutional and deconvolutional network branches for unsupervised spectral-spatial feature learning of hyperspectral data. Lee and Kwon [107] also suggest the fusion of features from different levels, developing the deep contextual CNN (DC-CNN), a full CNN (FCN) with 9 convolutional layers that combines, after applying an inception module, feature maps from different layers to form a joint feature map used as input to the subsequent layers. This DC-CNN model was extended by Lee and Kwon [108], increasing the richness and diversity of the extracted information using a multi-scale filter bank at the initial stage of the network. Finally, Zhang et al. [109] propose the diverse region-based CNN (DR-CNN) to encode semantic context-aware representation of hyperspectral scenes by implementing an architecture of several CNN branches or "multi-scale summation" modules which combine local fine details and high-level structure information by implementing two cross-layer aggregation through shortcut connections between the top and deep layers.

Contributions of This Work
Inspired by [110], this paper proposes a new deep&dense CNN model architecture for spectral-spatial classification of HSIs. Following the structure of a production line, each different convolutional layer of the proposed model extracts an output volume composed by several feature maps. These volumes represents the original input HSI data at different abstraction levels. The proposed architecture presents some important advantages for HSI classification: • It exploits the rich and diverse amount of information contained in HSI data, integrating the spectral and the spatial-contextual information in the classification process by analyzing, for each sample, its full spectrum and surrounding neighborhood.

•
It improves the network generalization while avoiding the vanishing of the model gradient. • It combines both low-level and high-level features in the classification process. This is done by concatenating the output volume of each convolutional layer c (i) with the corresponding inputs of the subsequent high-level layers i + 1, i + 2, · · · , i + L.

•
It can perform properly in the presence of limited training samples, as will be shown in our experimental assessment.
The remainder of the paper is organized as follows. Section 2 describes the proposed method. Section 3 validates the newly proposed deep&dense model by drawing comparisons with other state-of-the-art HSI classification approaches over four well-known HSI datasets. Finally, Section 4 concludes the paper with some remarks and hints at plausible future research lines.

Classification of Hyperspectral Images Using Traditional Neural Networks
HSI data cubes X ∈ R n 1 ×n 2 ×n bands have been traditionally processed and analyzed as a collection of n 1 × n 2 pixel vectors, being each one x i ∈ R n bands an array of n bands spectral values measured at different wavelengths. In this sense, standard ANNs (such as MLPs) process HSIs in pixel-by-pixel fashion, so that each pixel x i is passed through their computational units (neurons or nodes), organized as fully connected layers, from input to output, passing through the hidden ones [111]. Each neuron obtains the corresponding representation of the original input HSI vector by applying a dot product (·) between the connection weights of the previous and current layers (when the current layer is not the input one) and its input data vector as follows: k is the output value of the k-th neuron in the l-th layer, N (l−1) is the number of neurons in the previous l − 1 layer, w (l) j,k is the weight of matrix W (l) that connects the k-th neuron in layer l with the previous j-th neuron in layer l − 1, b (l) is the bias of layer l, and x (l) i,j is the incoming data provided by neurons in l − 1 (i.e., x (l) = o (l−1) ). This output is passed through a non-linear activation function o , implemented by the sigmoid, hyperbolic tangent or the rectified linear unit (ReLU) [112]. As we can observe from Equation (1), a standard ANN with L fully connected layers needs to learn ∑ L−1 l=1 N (l) · N (l+1) different and independent weights, which in practice is extremely inefficient in deep architectures for HSI classification, due the large number of parameters that must be fine-tuned.

Classification of Hyperspectral Images Using Convolutional Neural Networks
The CNN architecture resorts to local-connected blocks (also called D-dimensional blocks, with D = {1, 2, 3}, or operations layers), applying kernels with parameter sharing over small regions of the input data, which allows the network to look for specific features at any location in the HSI image while drastically reducing the number of parameters to train. In this sense, CNN offers an efficient model, which can be easily adapted to naturally integrate the spectral and spatial-contextual information in HSI classification. This is done by considering a neighborhood region around each x i instead of only the spectral information contained in x i . In this sense, a spatial window or patch p i ∈ R d×d×n bands needs to be extracted for each x i , usually setting d to an odd value in order to leave each x i as the central pixel. Also, for those pixels that belong to the image borders, a mirroring technique is applied in order to replicate the spatial information as described in Paoletti et al. [71]. This design decision implies that the CNN takes each spatial patch p i as 3-dimensional input data, imposing a 3D structure throughout the entire network (3D-CNN).
Conventional CNNs are composed by several blocks, stacked one after the other, where the output volume of the (l − 1)-th layer O (l−1) (composed by several feature maps) is used as the input volume of the next l-th layer (O (l−1) = X (l) ). The final objective of this stack of layers is to perform the feature extraction process. In this way, the l-th convolutional layer adapts Equation (1) in order to apply its K (l) kernels over small windows of X (l) , defined by its receptive field k (l) × k (l)×q (l) , as follows: being W (l) and b (l) the matrix of weights and biases vector of the l-th layer, and ( * ) the convolutional operator. As a result of Equation (2), an output volume O (l) is generated as a collection of K (l) feature maps, where the element (i, j, t) of the z-th feature map can be calculated by an element-wise multiplication of the (i, j, t) input feature and the corresponding weight as follows: is the (î,ĵ,t) weight, which makes the dot-product between the weight kernel W (l) and the input feature element x that it overlaps, b (l) is the bias of the z-th kernel and s is the kernel stride, which defines how the connected regions are overlapped. Then, as in standard ANNs, the ouput volume O (l) is sent to a non-linear layer that computes the activation function H(·) of the CNN model, usually the ReLU, so the output volume will be . Also, a sample-based discretization step can be added, sending the obtained O (l) to a down-sampling or pooling layer, which is usually employed to perform a spatial reduction, without affecting the volume's depth (K (l) ) by applying an average, sum or max pooling operation. This allows for a simplification in the parameters of the network, reducing the chances of overfitting while, at the same time, providing translation invariance to the model. Finally, the resulting output volume is considered to be the new input volume of the next convolutional layer O (l) = X (l+1) . This stack of convolution-reduction normally ends with an MLP made up of several fully connected layers, whose goal is to obtain the final representation Y of the original input data X. The final output is used to calculate an empirical cost function φ c , which in fact obtains the error incurred by the network. Then, in the backward pass, this error φ c is propagated through all the layers of the CNN in order to compute how much a change in the network's parameters would affect the cost function. Particularly, the backward phase tries to compute how each weight w (l) i,j,t -derived from a weight kernel W (l) being l = 1, 2, · · · , L, the number of layers)-affects φ c : where o is obtained by Equation (3), being n (l) . From Equation (4), it is clear that the weights of the CNN are proportionally updated to ∂φ c ∂w i,j,t , i.e., the partial derivative of the cost function φ c with respect to the current weight w i,j,t , in each training iteration [113]. However, error signals fade away during the backpropagation phase as they go deeper inside the CNN model. If the gradient is very close to zero (and usually, traditional activation functions work in the range [0, 1]), this can prevent the weight to change its value, stopping the learning process.

Proposed Deep&Dense Architecture for Hyperspectral Image Classification
To avoid the vanishing gradient problem described in the previous subsection, the proposed deep&dense architecture employs several inter-layer connections to reinforce the learning process. In this sense, the architecture of the proposed model can be regarded as a hierarchy of layers. From the upper levels to the lower levels, the architecture is organized as a collection of M dense blocks, where the m-th block, D (m) , is composed by N densely connected inner blocks, i.e., the input of the n-th inner block allocated into the m-th dense block, B (m) n , is composed by the outputs of all the preceding layers 1, 2, · · · , n − 1, while its output is sent as an input to the following inner blocks n + 1, n + 2, · · · , N. The main idea behind this model is that, if each inner block B (m) n is connected directly (in feed-forward fashion) to every other inner block in a dense block D (m) , the network model will be easier to train because each D (m) will take its decisions by making use of more advanced and elaborate features, reaching a more accurate performance [114]. While standard CNN models have only L connections (which corresponds to the number of layers), our model employs connections per D (m) . Such increasing number of connections does not imply a growth of model parameters that must be fine-tuned. Quite opposite, these connections allow to reduce their number due to the amount of redundant information, which encourages a reuse of the output volumes (i.e., using more than once those feature maps obtained by each inner block), which reinforces the feature propagation along the network. The reuse of redundant information at different blocks also allows to reduce the overfitting problem, while the connections between the inner blocks have a regularizing effect on the network [110], which tends to avoid also the vanishing gradient effect and the performance degradation.
This idea is inspired by the architecture of the ResNet. Following the previous notation and looking at standard CNN, we can observe that the output volume of the l-th convolutional layer (including the non-linear activation function) can be obtained as: As the output volume of the l-th layer in any CNN will feed the next layer, becoming the input volume of the l + 1-th layer, we can change O (l) by X (l+1) , so Equation (5) can be re-written and simplified as: From Equation (6) we can observe how the extracted features are hierarchically processed by the successive layers that compose the classical CNN architecture. Instead of that, the ResNet combines the output volumes of previous layers (which are in fact group of layers called residual units) with those obtained by the top layers. In this case, Equations (5) and (6) can be expressed as: changing O (l) by X (l+1) and simplifying: X (l+1) = H X (l) + X (l) which also can be expressed as: where O (l) = H X (l) contains the convolutions and activations of those layers that compose the l-th residual block, being indeed O (l) and X (l) the output and input volumes of the entire block. At the end, Equation (7) implies the reuse of those features obtained by previous residual blocks, which also allows to transmit the gradient of Equation (4) to previous and earlier blocks of layers in a simpler and easier way. Our proposal follows the same simple idea. However, instead of combining the outputs of previous inner blocks by additive identity transformations as the ResNet does, our model employs the concatenation ( ) of all the previous feature maps, so each dense block D (m) can be interpreted as an iterative concatenation of those feature maps extracted by its inner blocks. In this sense, the input of each B (m) n is calculated as: changing O (n) by X (n+1) and simplifying: X (n+1) = H X (n) H X (n−1) · · · H X (1) which also can be expressed as: being in this case X (n) and O (n) the input and output volumes of each inner block B (m) n , while H(·) contains all the convolutions and activation operations of every layer that compose the inner block, as we can observe in Figure 1. From Equation (8) we can infer that the internal connections of each dense block strongly encourage the reuse of all the features obtained by every inner block. As a result, all layers in the architecture receive a direct supervision signal [114]. is composed by two packs, where each one implements its corresponding convolution preceded by one normalization and one non-linearity layer, following the standard composition of those building blocks present in He et al. [79], as we can observe in Figure 1. In the following, we will describe the main idea behind these layers. Each pack begins with a normalization layer, which normalizes the input data by scaling and adjusting previous activations into the non-linear activation function's range, as shown by Equation (9): where γ and β are learnable parameter vectors of a linear transformation that scales and shifts the input data, and is a parameter for numerical stability. The goal of Equation (9) is to avoid the internal covariance shift [115], where the data distribution in each network layer changes training, allowing a more independent learning process in each layer. Then, the normalized data are sent to the non-linearity layer, which performs a ReLU function over the input data and feeds a subsequent convolutional layer, which applies Equation (2) in order to obtain its output volume. In the first bottleneck K (n) 1 , identity kernels of size 1 × 1 × q (n) are used [95]. Their goal is to combine the data by aplying the same operation to the full input volume, changing the depth of the output volume without affecting the spatial dimension [96]. In the second bottleneck K (n) 2 , kernels of size 3 × 3 × q (n) have been implemented in order to extract spatial-spectral features from the data. Precisely, the feature maps obtained by the second pack's convolutional layer, are the final inner block's output volume, O (n) , which is concatenated to the outputs of the previous layers to form the next input volume, following Equation (8), i.e., X (n+1) = O (n) O (n−1) · · · O (1) . To perform the actual data concatenation, maintaining the spatial dimensions after the second convolutional layer becomes crucial, so zero-padding has been added before the convolution. Another interesting aspect of the inner blocks is the number of filters K n , a dropout layer [86] has been added for regularization purposes. The layer deactivates a percentage of the activations in order to improve the network generalization, preventing overfitting.
In the same way that we can stack inner blocks in a dense block, the dense blocks can also be stacked in the global CNN architecture. To achieve this, intermediate or transition layers T (m) are added between each pair of D (m−1) and D (m+1) to connect the feature maps obtained by D (m−1) with the inner blocks of D (m+1) . Also, transition layers are composed by several operations. In particular, they perform a convolution implemented by K (m) filters of 1 × 1 × q (m) size (preceded by a normalization and non-linearity layers and followed by a dropout layer) and a spatial downsampling, which implements an average pooling. The goal of these layers is to reduce in half the spatial and depth 2 (with K (z) being the depth of the dense block's input data). The resulting volume is an In summary, the proposed topology is composed by two main parts: (i) a feature extractor, which implements a first convolutional layer that transforms and prepares the input HSI patches into feature maps and is followed by two dense blocks, which are connected by a transitional layer, and (ii) a classifier, composed by a first group of layers that normalize and reduce the obtained feature maps, and a fully connected layer that performs a softmax operation. The proposed topology is summarized in graphical form in Figure 2, while Table 1 provides a detail of the parameters of the network. The cost function φ has been minimized by using the Adam optimizer [116], with 100 epochs and using a batch size of 100 samples. The learning rate has been adapted to the different considered HSI datasets, described in the following section. Table 1. Parameters used in the different layers of the proposed network topology. Kernels are indicated as K × k × k × q, while strides are represented by s. The growth rate has been fixed to g = 32, with α = 4.

Classification layers
ReLU, Global Average Pooling with output 1 × 1 × K (3) * , with K (3) * = K (1) * 2 + 16 · g Fully connected of n classes layers, with softmax Figure 2. Architecture of the proposed deep&dense CNN architecture for spectral-spatial HSI data classification. The input patch p i ∈ R d×d×n bands is passed through a first convolutional layer C (0) that transforms the data into an output volume of K (0) feature maps. This O (0) feeds the first dense block, D (1) , which is composed by 6 inner blocks (see Figure 1). The obtained output volume is reduced spectrally and spatially by the transition layers T (2) , and then it feeds the second dense block D (3) , composed by 16 inner blocks. The final output is normalized and reduced with global average pooling (GAP) [95] after applying a ReLU that calculates an average value for each feature map, obtaining a final 1D-tensor that feeds the final fully connected layer which performs the final classification.

Experimental Settings
The conducted experiments have been carried out in a hardware environment composed by a 6th Generation Intel R Core TM i7-6700K processor, with 8 M of Cache and up to 4.20 GHz (4 cores/8 way multi-task processing), 40 GB of DDR4 RAM with a serial speed of 2400 MHz, a graphical processing unit (GPU) NVIDIA GeForce GTX 1080 with 8 GB GDDR5X of video memory and 10Gbps of memory frequency, a Toshiba DT01ACA HDD with 7200RPM and 2 TB of storage, and an ASUS Z170 pro-gaming motherboard. The software environment is composed by Ubuntu 16.04.4 x64 as operating system, CUDA 9 and cuDNN 5.1.5, Tensorflow [117] and Python 2.7 as programming language.

Hyperspectral Datasets
To demonstrate the performance of the proposed deep&dense model for HSI classification, four well-known datasets have been used: AVIRIS Indian Pines (IP), ROSIS University of Pavia (UP), AVIRIS Salinas Valley (SV) and AVIRIS Kennedy Space Center (KSC). Table 2 provides a summary of these datasets, indicating the number of labeled samples per class and the available groundtruth information.

Results and Discussion
Several experiments have been carried out with the goal of evaluating the performance of the proposed method. At this point, it is important to indicate the adopted learning rates, which have been set to 0.001 for the IP and KSC data sets, and to 0.0008 for the UP and SV data sets, based on their spectral characteristics.

Experiment 1: Comparison between the Proposed Model and Standard HSI Classifiers
Our first experiment conducts an experimental comparison between the proposed method and a total of five different and well-known classification methods available in the HSI literature. As Tables 3-5 indicate, we have compared the proposal with three different kinds of classifiers: (i) three spectral-based classifiers, i.e., the SVM with radial basis function kernel (SVM-RBF) [118], the RF, and the MLP; (ii) a spatial-based classifier, the 2D-CNN), for which each patch p i ∈ R d×d is reduced to a single component using PCA; and (iii) a spectral-spatial classifier, the 3D-CNN [71], which receives the complete patch p i ∈ R d×d×n bands , just like the proposed method.
Tables 3-5 report the obtained classification accuracies obtained for the IP, UP and SV datasets, calculated as the average of 5 Monte Carlo runs (indicating also the standard deviation values) and using 15% of the available labeled data (randomly selected per class) to carry out the training phase of the different supervised classifiers. For spatial and spectral-spatial methods, patches of 11 × 11 pixels have been extracted from the original data in order to feed the classifiers. Three widely used quantitative metrics have been used: overall accuracy (OA), average accuracy (AA), and Kappa coefficient. As we can observe, the proposed method obtains high accuracies in all the considered HSI scenes. In general, spectral-based classifiers (SVM, RF and MLP) obtain the lowest results in terms of OA, being the SVM the one with best values (86.24%, 95.20% and 94.15% with IP, UP and SV, respectively). This is due to its ability to handle large data sets, while RF results in the lowest accuracies (78.55%, 92.03% and 90.76%). Regarding the results obtained by the 2D-CNN and the 3D-CNN, it can be observed on the one hand that the spatial classifier is not able to outperform the SVM and MLP, which indicates that spatial information (i.e., larger d × d patches) are needed to improve the classification accuracy. The combination of spatial information (given by the pixel's neighborhood) and spectral signatures can improve the classification, as indicated by the results obtained by the 3D-CNN model. In this sense, the use of both sources of information (spatial and spectral) plays a fundamental role, and the architecture of the proposed model helps to natyrally integrate these sources of information. If we focus on the results obtained for the IP dataset (Table 3), the proposed method is able to outperform the spectral-based methods SVM, RF and MLP (in terms of OA) in 13.22, 20.91 and 14.19 percentual points, respectively. The proposed method also outperforms the spatial and the spectral-spatial classifiers 2D-CNN and 3D-CNN (in terms of OA) in 15.87 and 1.65 percentual points, respectively. An interesting issue is the classification of the class Oats, with only 20 labeled samples. The limited number of samples and their complexity in this case negatively affects the considered classifiers, being the MLP the only one able to achieve 70% accuracy. In this sense, the internal connections of the proposed network become crucial to the generalization power of the network, contributing to the final accuracy by adding more features to each block of the network. This greatly helps to characterize the data. Also, these additional connections allow to reduce the standard deviation values, as it can observed in the first class (Alfalfa), which is very hard to classify due to its spectral variability. This demonstrates that the proposed architecture is quite robust and stable, suggesting the potential of our method to deal with the inherent complexity of HSI datasets. Figure 3 shows some of the obtained classification maps for the IP scene. Spectral-based methods present the characteristic "salt and pepper" noise in the obtained classifications, which is reduced by the 2D-CNN (that still exhibits misclassifications within large regions). The classification maps obtained by the 3D-CNN and the proposed method significantly reduce misclassifications, with the proposed method providing a slightly more similar map with regards to the available ground-truth.   Table 3. Table 4 reports the results obtained for the UP data set. In this case, the proposed method is able to outperform spectral-based methods SVM, RF and MLP (in 4.76, 7.93 and 5.14 percentual points, respectively). The proposed method also reaches an OA that is 5.19 percentual points higher than the 2D-CNN, and 0.68 percentual points higher than the 3D-CNN. We can also observe that certain classes such as the third and seventh (Gravel and Bitumen, respectively) are hard to classify. In the case of the third class, the 2D-CNN and 3D-CNN try to overcome the problem using spatial information, reaching better results than spectral-based methods. The proposed method is able to reach the best accuracies due to its communication channels between the layers, which results in a variety of information that helps obtaining a higher precision. Figure 4 shows some of the obtained classification maps, where the SVM, RF and MLP exhibit salt and pepper noise in the classification results and the 2D-CNN also present noise, that is reduced by the 3D-CNN. The clearest map is the one provided by the proposed method, which also exhibits better delineation of urban classes such as Bitumen.    Table 4. Table 5 shows the results obtained for the SV dataset, where the proposed method reaches an OA that is 5.81, 9.2 and 5.99 percentual points better than SVM, RF and MLP, respectively. This superiority is also observed in the comparison with the CNN classifiers, being the 2D-CNN 7.65 percentual points worse than the proposed method and the 3D-CNN 2.52 percentual points worse than our method. Again, in those classes that are hard to classify, such as the eighth or the fifteenth (Grapes-untrained and Vinyard-untrained, respectively), the spectral classifiers exhibit sharp drops in classification accuracy. For instance, the Vinyard-untrained class is difficult to classify using only spectral information, while spectral-spatial classifiers such as the proposed one increase significantly the classification accuracy. Figure 5 shows some of the obtained classification maps. The one obtained by the proposed method exhibit almost perfect classification of the Vinyard-untrained and Grapes-untrained regions, as opposed to the other methods that exhibit misclassified pixels (SVM, RF and MLP) and patches (2D-CNN and 3D-CNN) in the same regions.  Table 5. In this experiment, we use different percentages of training data with the aim of providing a relevant comparison in the performance of the proposed method and other HSI data classifiers when few training samples are available. In this sense, we compare the classification accuracy of the proposed approach with that obtained by spectral methods (in particular: SVM, RF, MLR, MLP and 1D-CNN), by considering different training percentages over the IP and UP datasets, following the same configuration proposed in Experiment 1. Specifically, we use 1%, 5%, 10%, 15%, 20% randomly selected labeled samples per class to perform the training step. Also, we compare these results with those obtained by spatial and spectral-spatial classifiers (in particular 2D-CNN and 3D-CNN) when 1%, 3%, 5%, 10%, 15%, 20% of training samples are used, employing spatial patches of 11 × 11 for the 2D-CNN model, and spectral-spatial blocks of 11 × 11 × n bands for the 3D-CNN and the proposed model.
The results obtained for the IP and UP datasets can be observed in Figure 6. Focusing on the IP dataset, we can observe how the spectral classifiers are strongly affected by the number of available labeled samples and the characteristic complexity of IP spectral signatures, being the RF and MLR the most most strongly affected methods, and SVM the one with better results when less than 10% labeled samples per class are used. For 10-20% training percentages, the 1D-CNN and MLP obtain similar results to those obtained by the SVM. The 2D-CNN is negatively affected by the limited number of training samples and also by the fact that the 11 × 11 neighborhood cannot properly capture spatial information, suggesting that IP samples are more difficult to characterize when using only small patches of spatial information than when using the full spectral information. However, the combination of spatial and spectral features allows the model to improve substantially its performance, even when only 1% of the available labeled samples are used, as the results obtained by the 3D-CNN suggest. In this sense, we can observe how the proposed method is able to reach the best accuracy for different training percentages, being closely followed by the 3D-CNN model when 15-20% of labeled data are used. These results demonstrate that, although the proposed model is the deepest classifier, the reuse of different levels of features and the regularization performed by the internal connections indeed help to mitigate the overfitting and gradient vanishing problems. Similar conclusions can be obtained by looking at the UP results. In this case, when comparing the different spectral methods, we can observe that the spectral characteristics allow the 1D-CNN to reach the best results, being closely followed by the SVM and the MLP when 5% of the available labeled data are used for training. Also, the spatial features present in UP dataset provide more discriminative information, allowing the 2D-CNN model to obtain better accuracy when more than 10% of the available training data are used. This represents a significant improvement wuth regards to the IP results. However, looking at the accuracy results with 1-10% training percentages, we can conclude that the spatial classifier is still the worst method when very few training samples are available. Again, the combination of spectral and spatial features allows spectral-spatial models (in our case, 3D-CNN and the proposed method) to achieve the best accuracy results, even with a neighborhood window of small size. This suggest that UP spatial features are more discriminative than those in the IP dataset. Finally, we emphasize that the proposed model is able to outperform the accuracy reached by the 3D-CNN model with all the considered training percentages.

Experiment 3: Deep&Dense Model Learning Procedure
Once we have tested how the proposed model is able to deal with overfitting, we now study the characteristics of its learning process. To do that, we perform a comparison between the 3D-CNN model and the proposed one, taking into account the training and overall accuracy (OA) reached by both models as the number of epochs increases, with the goal of observing how the network is learning the features at different epochs and how this affects in the final OA during the testing step. The obtained results are shown in Figure 7. In both the IP and UP datasets, we can observe that the training accuracy is higher than the OA at the first training epochs, whereas the OA tends to vary below the training accuracy until reaching a more stable state from the 48th epoch.
Regarding the variations present in OA during the earliest epochs, these are produced by the employed optimizer and the selected learning rate. In fact, the Adam optimizer [116] is characterized by being quite aggressive during the gradient descent process, producing large jumps in the function to be optimized. Also, the selected learning rates (0.001 for IP and and 0.0008 for UP) are large enough to allow those jumps in the optimization process, being the models robust enough to stabilize themselves in relatively few epochs (between [40][41][42][43][44][45][46][47][48].
Finally, we can observe that the proposed model is able to adjust better its OA values to those accuracies obtained during the training procedure, as compared with the standard 3D-CNN model. This suggests that the reuse of features through dense connections allows deep supervision, where additional information in each inner-block enforces discriminative feature learning.

Experiment 4: Comparison between the Proposed Deep&Dense Model and Fast Convolutional Neural Networks for HSI Data Classification
In this experiment we compare the proposed method with the DFCNN architecture in Paoletti et al. [71]. In this case, two HSI data sets have been considered: IP and UP, using a fixed number of samples per class, as shown in the samples column of Tables 6 and 7. These samples have been randomly extracted from the original scenes in patches of size 9 × 9, 19 × 19 and 29 × 29 for DFCNN and 9 × 9 for the proposed method when considering the IP dataset. For the UP dataset, patches have been fixed to 15 × 15, 21 × 21 and 23 × 23 for DFCNN and 15 × 15 for the proposal (the difference is due to the higher spatial resolution of the UP scene). The main objective of this experiment is to demonstrate the performance of the proposed architecture by employing the minimum amount of spatial information in comparison with DFCNN, highlighting its robustness and precision compared to that model. The obtained results have been calculated as the average of 5 Monte Carlo runs.   Table 6 shows the results obtained for each class of IP dataset, as well the OA, AA and Kappa statistic. As we can see in Table 6, the FDCNN improves its accuracy by incrementing the input patch size. The proposed method is able to reach an OA value 4.71 percentual points higher than the FDNN with the same amount of spatial information (9 × 9 patch size). Also, it is able to reach an OA 2.36 percentual points higher using 77.56% less spatial information than FDCNN (19 × 19 patch size) and 0.78 percentual points higher using a 90.37% less spatial information than FDCNN (29 × 29 patch size). Focusing on runtime, we can observe that the fast full-GPU implementation of FCNN is able to reach the lowest execution time when spatial patches of size 9 × 9 are processed, while the proposed model is faster than the FCNN when patches of 19 × 19 and 29 × 29 are adopted. Table 7 shows the results obtained for the UP dataset. Again, the FDCNN improves its results by adding more spatial information but, after a saturation point, the improvement is not relevant (the difference between 21 × 21 and 23 × 23 patch sizes is just 0.01 OA). With 15 × 15 patch size, the proposed architecture is 1.09 percentual points better points better. Even with patches of size 21 × 21 and 23 × 23 the DFCNN cannot improve the accuracy obtained by the proposed method. The aforementioned results demonstrate that the proposed model architecture is able to achieve the best global performance with two HSIs with different spatial resulutions, exhibiting relevant performance improvements even with reduced input patch spatial sizes. Finally, focusing on runtime, we can observe that the fast full-GPU implementation of FCNN is able to reach the lowest execution time again. In both cases (IP and UP dataset) we emphasize that FCNN has been implemented to be highly efficient in GPU. On the other hand, as Figure 7 demonstrates, the proposed network is able to reach a good OA value in the first epochs (from 40-48 epochs), although we run several additional epochs in order to ensure the correct convergence of the model.

Experiment 5: Comparison between the Proposed Deep&Dense Model and Spectral-Spatial Residual Networks for HSI Data Classification
Our last experiment performs a comparison between the proposed network and a recent state-of-the-art spectral-spatial classification method. Particularly, in this experiment we compare the proposed architecture with the spectral-spatial ResNet (SSRN) proposed by Zhong et al. [119] for HSI data classification, a model that also employs shortcut connections between its residual units, increasing the variety of the input information of each one by adding an identity mapping. To carry out this experiment, 20% of the available labeled samples in the IP and KSC datasets have been randomly selected. For, the UP dataset, 10% of the available labeled samples have been randomly selected. Also, patches with different spatial sizes have been considered: 5 × 5, 7 × 7, 9 × 9 and 11 × 11. The results have been obtained as the average of five Monte Carlo runs. Table 8 shows the results obtained in this experiment. The table indicates that the proposed architecture provides remarkable improvements in terms of OA regardless of the sizes of the considered input patches (only in the case of KSC with a patch size of 7 × 7 the SSRN reaches an OA that is slightly superior to the one obtained by the proposed approach). This suggests that the proposed deep&dense exhibits good generalization ability, extracting discriminative features that are effectively exploited in each dense block, adding more information to each layer. Table 8 also shows that the standard deviation values for the proposed approach are very small, indicating that our deep&dense CNN architecture exhibits a robust and stable behavior.

Conclusions and Future Lines
This work introduces a new deep neural network architecture for spectral-spatial classification of HSI datasets. The proposed model adds internal connections between the hidden blocks of convolutional layers in order to forwardly propagate all the obtained feature maps to all the layers that compose the network. As a result, more quantity and variety of information is provided to the internal operators, which are able to better discriminate the features contained in the original HSI, resulting in improved classification accuracies. Our method has been evaluated using four well-known HSIs and compared to traditional spectral-based classifiers (SVM, RF and RF), convolutional neural models (2D-CNN, 3D-CNN and FDCNN) and new deep techniques (SSRN), using different training percentages as well as input spatial dimensions. The obtained results reveal that the proposed model exhibits competitive advantages with regards to state-of-the-art HSI data classification methods, demonstrating higher robustness and stability when dealing with limited amounts of training data. This makes the proposed approach a powerful tool for the analysis of large and complex hyperspectral datasets. In future developments, we will consider additional HSIs for the evaluation of the method.