Adaptable Convolutional Network for Hyperspectral Image Classiﬁcation

: Nowadays, a large number of remote sensing instruments are providing a massive amount of data within the frame of different Earth Observation missions. These instruments are characterized by the wide variety of data they can collect, as well as the impressive volume of data and the speed at which it is acquired. In this sense, hyperspectral imaging data has certain properties that make it difﬁcult to process, such as its large spectral dimension coupled with problematic data variability. To overcome these challenges, convolutional neural networks have been proposed as classiﬁcation models because of their ability to extract relevant spectral–spatial features and learn hidden patterns, along their great architectural ﬂexibility. Their high performance relies on the convolution kernels to exploit the spatial relationships. Thus, ﬁlter design is crucial for the correct performance of models. Nevertheless, hyperspectral data may contain objects with different shapes and orientations, preventing ﬁlters from “seeing everything possible” during the decision making. To overcome this limitation, this paper proposes a novel adaptable convolution model based on deforming kernels combined with deforming convolution layers to ﬁt their effective receptive ﬁeld to the input data. The proposed adaptable convolutional network (named DKDCNet) has been evaluated over two well-known hyperspectral scenes, demonstrating that it is able to achieve better results than traditional strategies with similar computational cost for HSI classiﬁcation.


Introduction
Within the remote sensing (RS) field, major advances in spectroscopy and imaging technologies since the first approaches in 1979 have enriched the knowledge gained about the Earth's surface, significantly enhancing our capacity to explore and exploit terrestrial resources [1]. As a consequence, hyperspectral imaging (HSI) has become one of the foremost data acquisition methods within RS, as it captures the spatial distribution of large observation areas whilst differentiating the solar absorption features of the different materials contained in the target scene. Indeed, HSI facilitates the acquisition of detailed spectral information, gathering large datacubes composed of a mesh of N 1 × N 2 pixels, which collect in C bands of the solar radiation reflected by materials along the visible-to-shortwave infrared (VSWIR) regions of the electromagnetic spectrum. As a result, each hyperspectral pixel contains the continuous radiance spectrum (known as spectral signature) of a particular material, in the form of peaks and valleys that capture the physico-chemical behavior of the material at different wavelengths [2]. The large amount of rich spectral information provides a detailed characterization of the land cover, as each signature is unique for each type of material. Hence, HSI data are very useful in a wide variety of fields with a high economic and social impact, such as: climate change study; crop management, which ranges from pest and plague detection to the prevention and recovery of degraded and eroded soils; forestry; mineral identifications and resource exploitation; urban planning and land-use/land-cover (LULC) classification; military, defense and intelligence applications; risk prevention; and disaster monitoring, among many others [3][4][5][6][7].
Nowadays, HSI data analysis is considered a hot topic due to the great relevance of this data in RS and its consequent impact on the applications mentioned above. A multitude of methods have been implemented to efficiently process and extract the relevant spectral-spatial information contained in HSI cubes. Nevertheless, these methods must cope with several challenges, drawbacks and limitations that characterize this kind of data. For instance, there is extensive literature on unmixing methods that attempt to separate the spectral signatures contained in the scene by searching for pure or almost-pure pixels (endmembers) and measuring the spectral mixture and composition of the remaining non-pure pixels [8,9]. Indeed, spectrometers often exhibit a low spatial resolution due to instrument limitations and the restriction imposed by their high data rate, so the compactness of channels results in the limitation of pixel size. In addition, other uncontrolled phenomena may degrade and introduce non-desired artifacts within the data obtained [10,11], for instance limitations in the radiometric and geometric resolution of the sensor, atmospheric scattering, lighting disturbances (such as light obstruction by aerosols and clouds), sensor noise or imperfect imaging optics hinder the data acquisition, obtaining noisy images in the end with an important number of mixed pixels that greatly degrade the detection and recognition performance, preventing the material identification.
The resulting high data variability is compounded by the lack of labeled samples. Indeed, although it is relatively easy to collect HSI data, obtaining image-like label information is utterly arduous [12]. These challenges severely impair classification algorithms, which analyze the HSI data distribution in order to identify the land cover contained in each pixel, assigning them the corresponding label. Traditionally, HSI classification methods implemented isolated pixel analysis strategies through non-guided analysis of data divergence and similarity-based measurements (unsupervised methods) [13][14][15] or guided training with known ground-truth information (supervised classifiers) [16][17][18][19]. More recently, semi-supervised and hybrid strategies are gaining popularity to deal with the lack of data problem, estimating the distribution of unlabeled data from a very small pool of labeled data (semi-supervised models) [20][21][22]. Among these methods, the support vector machine (SVM) stands out for its efficiency in processing large data, it being relatively easy to include kernels to adjust and improve its performance in the data [23,24]. Other widely used methods are: probabilistic methods, such as multinomial logistic regression (MLR) because of its outstanding results when few training samples are provided [25,26]; ensemble methods such as decision trees (DTs) and random forests (RaF) [27,28]; and iterative classifiers, such as multilayer perceptrons (MLP) and other shallow/deep artificial neural networks (ANNs), which do not require prior information of the data distribution [29]. In fact, the great generalization capability of ANNs, their great architectural flexibility and their impressive ability to extract abstract features and relationships from raw data, i.e., without the need for handcrafted features, have made them very popular classifiers within HSI data processing, achieving a never-before-seen performance on current HSI benchmarks [30,31].
Nevertheless, learning based on isolated spectrum analysis is not stable due to the large spectral mixture of the HSI data, which is mainly introduced by the low spatial resolution. In this context, some works introduce spatial information in order to eliminate the classification uncertainty by stating that neighboring pixels must belong to the same class. This mitigates the spectral mixture, as more spectral signatures are introduced to support the classification as well as spatial information in the form of spatial patterns such as shapes, edges or textures. As a result, many classification algorithms have been implemented to process the spectral-spatial information contained in HSI data [32][33][34]. These algorithms may usually be divided into those that concatenate the spectra of several pixels, applying or not a band reduction/selection step in order to process them all together, and those that extract shape and texture features, such as morphological profiles (MP) [35] or global Gabor features [36]. Despite their accurate results, many of these methods are based on a feature vector, vectorizing the original 3D data in order to process it. In this sense, the spatial and structural relationships of the data are disregarded [37]. In contrast with these methods, the convolutional neural network (CNN) is based on multidimensional kernels that process the spectral-spatial information of the HSI data without flattening the data. Indeed, the convolutional model is designed to process multi-dimensional input samples, respecting the original distribution. This is performed using kernels locally connected to small regions of the input data. These kernels are defined as arrays with one dimension more than the input data, thus indicating, on the one hand, the number of filters to be applied and, on the other hand, the size of the region where they will be applied. Hence, different kernels can be defined, such as: 1D kernels, which are applied locally to several channels of the input data; 2D kernels, which are locally applied to areas of the input maps, and 3D kernels, which are locally applied to both areas and channels of the input maps. This provides different ways to process HSI data, where 2D and 3D kernels are the most widely used because of the spectral-spatial processing they conduct [31].

The Rigid Kernel: Limitation of CNNs for Spatial Analysis
It is noteworthy that ANNs (in general) and CNNs (in particular) try to model the visual cortex by creating a deep stack of layers composed of neurons. These layers are connected by weights that modify the inputs to extract their features. As a result, each layer produces a feature representation, which is increasingly abstract and tailored to the task for which the network is designed. In this way, by learning the weights that best fit the data, they are able to extract the best data representations to enhance the classification performance [38]. Focusing on HSI data processing, after the success of the first attempts [39], a vast amount of deep learning (DL) approaches have been used on HSI processing [31]. Particularly in the case of CNNs, the flexibility of the convolution kernel provides a natural processing as it is applied as a sliding and locally-connected multidimensional-array to the input data. In addition, the backpropagation mechanism adjusts the kernel weights, creating a filter adapted to the data to automatically extract the most discriminative features. Hence, the network extracts abstract representations tailored to the classification task, obtaining very accurate results. This behaviour has positioned CNN as the current state-of-the-art in HSI analysis, and there is a vast literature of adapted and improved HSI-CNNs [2,[40][41][42][43][44][45][46][47][48][49].
Nevertheless and despite its impressive results, the CNN has some drawbacks related to its training procedure and its tendency to overfit. Indeed, the CNN consumes a large amount of labeled samples in order to adjust its parameters properly and to be effective in generalizing to unseen data, i.e., the CNN requires a representative training set to cover the high data variability and its classification results tend to degrade in the absence of labeled samples. Recently, more complex and deeper models have been developed to improve the performance of the HSI classification process, introducing skip connections that reuse extracted data representations at different levels of the architecture in order to avoid overfitting and vanishing gradient problems, such as residual models (ResNets) [50,51], developing dual-path models to enhance the feature extraction [52] or introducing new architectures based on capsule networks (CapsNets) to process the spatial relationships of extracted features [53,54]. Despite these improvements, the CNN still struggles with a significant loss of information caused by the rigid structure of the convolution kernels. Inspired by the visual cortex, the multidimensional array that composes the kernel is organized as a regular square mesh, which is overlapped over the data, applied as an affine transformation, and sliding by means of a stride hyperparameter. In this regard, the structure of the kernel has a decisive impact on how the input data affect the output features, and hence the final classification. This is due to the dependencies between the generated output feature and a certain input region, known as the receptive field (RF), i.e., the size of the input region that has produced the output feature. Therefore, the receptive field depends on the size and shape of the convolution kernel. Due to the their rigid structure, convolution kernels suffer a considerable loss of information, as the objects located in remotely sensed images do not usually have regular shapes and may appear different in size and orientation, as is observed in Figure 1. Although data augmenting and multiscale-based strategies [55,56] have traditionally been employed to overcome this limitation, their kernels exhibit the same restraint, i.e., they cannot be adaptively tailored with respect to the spatial features.  Often, objects located in remotely sensed hyperspectral images do not have regular shapes and may appear with different sizes and orientations. For instance, some data augmentation methods apply spatial transformations to the input samples to increase the size of the datasets and cover the variability in the data (a), whilst other scenes contain the same elements with different orientations and poses; for example, in (b) we can observe several parking areas, with different sizes and where the cars are placed differently. However, the convolution kernel disregards these spatial variations, defining a rigid regular mesh. Some strategies have been implemented to face the limitations of the rigid kernel. For instance, Zhu et al. [57] proposed the deformable HSI classification network (DHCNet), where deformable convolutional sampling locations are introduced by applying 2D offsets to the input features. As a result, deformable feature images are obtained, which adaptively combine the similar neighboring information of each pixel into fixed grids that will be employed by standard convolution layers. However, this proposal does not control the impact that each item of the fixed grid has on the outputs due to the intrinsic behavior of the convolution RFs, where central items exhibit a higher impact than those items in the outer area of the RF, following a Gaussian distribution. In this sense, there is a much smaller effective area (ERF) within the theoretical RF (TRF) [58], whose information flow is crucial for the model performance.

Contributions
This paper attempts to resolve the aforementioned limitation by adjusting the ERF to the HSI data through deformable kernels (DKs) [59], which adapt the kernel spaces leaving the data intact, instead of reconfiguring the data to meet geometric deformations. In this sense, not the input features (which remain untouched), but the kernel itself is modified by means of offsets to produce the deformation. Moreover, to enhance the data deformation modeling, deformable convolution (DC) [60] is combined with DK. The resulting model is an adaptable convolutional network, which adapts the application of each convolutional kernel to the features of the input data by deforming its kernel space. In this context, the main contributions of this work are:

1.
A strategy based on DKs is proposed to train CNNs for HSI data classification tasks in an effective way. Moreover, CNN with DKDC (DKDCNet) has been implemented and tested over two widely-used HSI scenes.

2.
An experimental section is provided to compare the performance of KDCNet with common DL-based classification models. Conducted experiments demonstrate the DKDCNet can outperform the classification performance achieved by popular DLbased classification models.
The rest of this work is organized as follows. Section 2 discusses the role of the RF, providing the details of the deformable kernel method. Section 3 describes the experimental results, evaluating the classification performance exhibited by the implemented DKDCNet over two well-known HSI datasets. Finally, Section 4 presents the obtained conclusions and future work.

The Receptive Field
Traditionally, an ANN defines an architecture of interconnected neurons that communicate with each other through weighted connections. However, this fully-connected (FC) architecture presents severe limitations in image processing, involving an expensive computational burden coupled with severe overfitting due to a large number of parameters. On the contrary, CNNs have adapted their behaviour to mimic the visual cortex, where output neurons are activated by some input visual stimuli that fall within their RF, following a hierarchical organization where deeper neurons are activated by combined and more complex stimuli. In this context, the visual cortex behaviour is artificially implemented by the CNN through a deep stack of convolution layers, where each layer defines a filter bank, i.e., a set of learnable, locally-connected and shared weights that composes a linear n-dimensional kernel. Indeed, the kernel is basically a group of data-fitted filters which slide over the input data following a stride, overlapping and applying themselves to the data. Each application produces a weighted sum between the filter weights and the inputs falling within the application region. Moreover, in order to determine the existence of the features they filter (such as edges and shapes), a non-linear activation function is introduced to encode the neural response of the convolution layer to the seek features. Mathematically, this behavior is defined by Equation (1): where the superscript l indicates the l-th layer of a CNN. Hence, X (l−1) ∈ R P×P×K (l−1) and X (l) ∈ R Q×Q×K (l) denote the input and output data, also known as input and output feature maps, respectively, where the former comprises K (l−1) maps of P × P features (pixels) and the latter K (l) maps of Q × Q features. W (l) ∈ R K (l−1) ×M×M×K (l) denotes the convolution weights of the l-th layer (to simplify the notation, the bias is ignored), composed of K (l) filters of M × M weights. Coordinate k enumerates discrete kernel positions at the l-th layer within the support of the regular mesh K = [−M/2, M/2] 2 ∩ Z (where Z denotes the set of integers), whilst j is indexing each output coordinate. Finally, H defines the activation function, which is usually implemented with ReLU. This is graphically depicted by Figure 2. Particularly, Figure 2b depicts a 3 × 3 kernel which is applied with stride 1 and zero padding over a 6 × 6 input feature map. In this context, for the output coordinate j = (3, 5, z), k will sample the input features positions from (2, 4) to (4, 6) (using part of the padding added to the map border) according to the grid based on K = [−1, 1]. As we can observe, the output unit X (l) i only depends on a small region of the input feature map that the kernel is "looking" at. This region is denoted as the RF for that unit, so any information within the input feature map but outside the RF will not affect the value of the output unit [61]. Figure 3a represents the number of times that a 3 × 3 kernel has been applied on a 13 × 13 feature map with stride 1, i.e., the overlapping of all RFs of the output features. In this sense, the central pixels are used more often than the pixels located at the edges of the map, i.e., central RF pixels are part of more calculations, playing a major role within the convolution layer as they propagate their information more times. This follows the behavior of the visual cortex, where local features are detected and processed in early visual stages (or layers) of the cortex and then progressively combined in a hierarchical fashion, creating more complex patterns [62,63]. The connotations behind this behaviour are quite intuitive: the pixels in the RF do not contribute equally to the layer output response [58]. There is an effective receptive field (ERF) that defines the areas of the input data that influence and modulate the output activations by tracing the hierarchy back from the considered output feature to the input image. This extends to the entire network; for instance, Figure 3b highlights the RF of a CNN output response relative to the input data. In this sense, the ERF of a CNN follows a Gaussian distribution, defining not only a region to "look at", but also exponentially concentrating the attention on the middle of the feature map. Indeed, it can be identified as a Gaussian-based soft attention map [59,64].
Following Equation (1), the computation of the output coordinate j within the final (L-th) layer can be generalized by Equation (2) as follows (H is omitted to simplify the formulation): where X is the original HSI input and k l the discrete kernel positions related to the kernel weights W (l) at the l-th convolution layer. In this regard, the ERF value of an output coordinate j that depends on input i can be defined as R (L) (i; j): (a) (b) Figure 3. Graphical representation of the receptive field. (a) Considers a unique input layer to which a 3 × 3 kernel is applied with a stride of 1. The color map indicates the number of times the kernel has been applied on each feature (i.e., the darker the color, the more kernel applications). The number of times the kernel has been applied over a feature is also included. For instance, the edge pixels are used between 0 and 3 times, while the center pixels are used 9 times. (b) Receptive field of an output feature obtained by a CNN composed of two 3 × 3 convolution layers (in red) and a 2 × 2 pooling layer (in yellow).
where 1 defines the indicator function. As we can observe, the ERF depends on the data and kernel sampling locations j and k, and the kernel weight arrays {W (l) } L l=1 . In addition, if the m-th convolution kernel W (m) is implemented by a 1 × 1 kernel of a single learnable parameter at positionk m , i.e., W (m) k m , the ERF can be obtained by extracting the weight of the summation and multiplying it directly to the remaining kernels, following Equation (4): where S = [1, L] \ m contains all the layers of the CNN architecture except the m-th layer. Furthermore, any M × M kernel can be considered as a composition of M 2 point-wise 1 × 1 kernels [59], hence Equation (3) can be reformulated as: In this regard, although the ERF can be modified in several ways, for instance by stacking more or less convolution layers, changing the kernel size M, implementing subsampling stages through pooling layers and striding, or including dilated convolutions and multiple data paths, the rigid shape of the convolution kernel imposes a serious limitation, as it applies a regular mesh regardless of the spatial characteristics of the elements contained in the HSI image. In this context, the convolution kernel must be properly controlled to ensure that the entire relevant region is covered, dealing with data deformations. Taking Equation (5) into account, we automatically adapt the ERF to spatially fit HSI data, developing a deformable kernel (DK)-based architecture [59] for spectral-spatial HSI analysis.

Adaptable Deep Model with Deformable Kernels and Deformable Convolution
The strategy for adapting the ERF to geometric deformations and transformations of the data is quite simple. Indeed, the DK method takes the layer input features X (l−1) and automatically generates a group of kernel offset ∆k l . Then, taking the original kernel weights W (l) with a certain size M × M , the DK method performs an interpolation between them and the generated offsets ∆k l , obtaining as a result the desired kernelW (l) with size M × M. This final kernel is convolved with the input data X (l−1) in order to obtain the output feature maps. The procedure is detailed below.
The DK method attempts to adapt the computation of how much each input pixel contributes to the final output response directly during runtime. In this context, its purpose is to modify the kernel space by resampling the kernels weights, adapting them to the input data but leaving the data untouched. As a result, Equations (2) and (5) are re-interpreted as: Focusing on Equation (6a), the DK can be addressed as a bilinear interpolation problem within the rigid kernel grid and the desired adaptive/deformable kernel.
Denoting W (l) ∈ R M ×M as the original kernel weights andW (l) ∈ R M×M as the desired deformable kernel of the l-th layer (where the M is any kernel scope size that meet the restriction of the number of sampling locations M 2 determined by M), the DK conducts a sub-pixel sampling in the kernel space to handle data deformation by resampling the original weights intoW (l) through a group of learned kernel offsets ∆k l . These offsets are automatically computed from the input data X (l−1) through a generator G; therefore, they are automatically adapted to the characteristics of the input data. It should be noted that local generators have been adopted to build the DK method. In this sense, each G (l) has been implemented by a lightweight rigid convolution layer with the same configuration as the target kernelW (l) and 2M 2 output channels. Moreover, each offset of ∆k l is indexed by output locations j. Knowing the original kernel weights W (l) and the offsets {∆k l }, the DK samples a new set of kernel weightsW (l) employing a bilinear sampler B following Equation (7):W where B = max(0, 1 − |k l,x + ∆k l,x − k l,x |) max(0, 1 − |k l,y + ∆k l,y − k l,y |). Finally, the ob-tainedW (l) is applied over the input volume following Equation (1). In contrast to other methods, such as the DC which resamples the features to counter the effects of geometric data deformation [57,60], the application of the kernel offsets ∆k l to each kernel location k l and deforms the rigid grid imposed by K, modifying the kernel space but keeping the data untouched, which preserves the input information intact. In this sense, DK and DC are perfectly compatible methods that may be combined to enhance the feature extraction. Therefore, with the aim of completing the data deformation modeling, DK is combined with DC to implement the proposed adaptable model for HSI data classification, which have been denoted as DKDCNet. In this sense, the DC learns a group of data offsets {∆j} with respect to data positions j. DC is combined following Equation (8): As can be observed, the DC method introduces a learnable 2D offset ∆j to the regular grid, sampling the positions of the input feature map, thereby deforming the ERF of the activation unit. Similarly to ∆k l , the obtained ∆j offsets are learned from the input feature maps using additional lightweight convolutional layers. As a result, the data is first deformed and then the deformable kernel is obtained by the DK method and applied to the feature maps. In this sense, the deformations applied within each DKDC block depend on the input features in a local and adaptive manner. Figure 4 provides the graphical representation of the implemented adaptable network, detailing the DK computation. As we can observe, the developed model computes three principal components from the original HSI datacube by PCA, obtaining 25 × 25 × 3 patches from each HSI pixel to encode the most relevant spectral-spatial information whilst reducing the dimensionality. These patches are sent to the adaptable DKDCNet, whose architecture is detailed in Table 1, following [57]. In particular, the implemented DKDCNet defines a convolution block with DK and DC mechanisms (denoted as DKDC). These layers define an offset (first row), original kernel scope (second row), and final depth-wise layer (third row), where F = 3 is the number of spectral bands, and N = number of classes. Maxpooling and average pooling have also been introduced. The model has been trained in about 5000 iterations, applying the ADAM optimizer with a learning rate of 1 × 10 −3 and a batch size of 150.

Experimental Environment
An implementation of the proposed HSI DKDCNet has been developed and tested on a hardware environment with an X Generation Intel ® Core™i9-9940X processor with 19.25 M of Cache and up to 4.40 GHz (14 cores/28 way multi-task processing), installed over a Gigabyte X299 Aorus with 128 GB of DDR4 RAM. Furthermore, a graphic processing unit (GPU) NVIDIA Titan RTX GPU with 24 GB GDDR6 of video memory and 4608 cores has been used.

Hyperspectral Datasets
One of the greatest achievements of DL is to design and exploit the same neural architecture for the classification of different images. In this sense, the experiments developed an aim not only to evaluate the performance of the model in terms of accuracy and complexity (understood as number of parameters), but also its generalization capacity. Thus, following [57], the same architecture has been exploited for classifying two real widely-used HSI scenes with different spectral-spatial characteristics: (i) University of Pavia and (ii) University of Houston scenes. Details are provided below.

•
The University of Pavia (UP, Figure 5) dataset [65] is an HSI scene gathered by the airborne reflective optics system imaging spectrometer (ROSIS-3) over the university campus of Pavia, northern Italy, in July 2002. The scene consists of 610 × 340 pixels with 1.3 m of spatial resolution, and 113 spectral bands with a spectral range coverage ranging from 430 to 860 nm. The ground truth contains a total of 42,776 labeled samples grouped into nine different land-cover classes, covering different urban elements, including asphalt, meadows, gravel, trees, metal sheet, bare soil, bitumen, brick, and shadows. • The University of Houston (UH, Figure 6) dataset [66] is an HSI scene collected by the lightweight compact airborne spectrographic imager (CASI) over the Houston University area. It consists of 349 × 1905 pixels with 2.5 m of spatial resolution and 144 channels in the 380 nm to 1050 nm spectral region. The ground truth comprises 15,029 labeled samplescontaining 15 different classes within an urban environment too.

Experimental Discussion
The performance of the DKDCNet for HSI data classification has been evaluated through widely used classification metrics, in particular: (i) the overall accuracy (OA), which measures the probability that a sample will be correctly classified by the classification model; (ii) the average accuracy (AA), which measures the mean of each obtained classification accuracy per class; and (iii) the statistical Cohen's Kappa coefficient K(×100), which measures the inter-rater agreement between the observed and expected agreements. Moreover, the number of parameters has been considered.
In addition, the accuracy performance of the network has been compared with standard CNN [57] and with other deformable-based models, in particular the DC-based network (DCNet) and DHCNet [57]. Finally, several well-known HSI classification methods have also been considered, such as the pyramidal ResNet (pResNet) [51], and the spectral-spatial squeeze-and-excitation residual bag-of-feature (S3EResBoF) model [67].

Experimentation on UP Dataset
The first experiment evaluates the performance of the proposed model in the UP image. In this sense, a different number of training samples have been randomly selected from the pool of labelled data; particularly, 45, 55, and 65 random samples per class have been considered to train the proposed DKDCNet and the standard CNN, DCNet, DHCNet, pResNet and S3EResBoF models.
In this context, it is quite interesting to compare the behavior of the original convolutional model and the proposed model for HSI data analysis throughout the training epochs, both in terms of training and validation loss and in terms of accuracy during the training and validation stages. In this regard, Figure 7a depicts the comparison between the proposed model and the standard CNN considering 55 randomly selected training samples. In particular, from left to right we can observe the evolution of both models regarding their training and validation loss and their training and validation accuracy along different epochs. Focusing on training and test loss, the proposed model seems to achieve a training loss quite similar to CNN during the first epochs; even its test loss is higher than CNN in the first 10 epochs. Nevertheless, the training loss drops drastically, below the CNN already in the first five epochs, whilst the validation loss drops abruptly from the first 15 epochs onward. This suggests the model parameters are not over-fitted to the training data, which allows the optimizer to converge to a more optimal solution than the CNN. Indeed, during the training stage, the obtained accuracy climbs rapidly above 90%, whilst the validation accuracy evolves slower but better than the CNN.   Figure 8a provides the measurements of the OA, AA, and Kappa coefficients for each model, along with their respective standard deviations, in order to compare the accuracy achieved by each network. These data have been obtained considering 45, 55, and 65 randomly selected training samples after 10 Monte Carlo runs. As an overall conclusion, the proposed DKDCNet model outperforms the results of the other models. When evaluating in detail, the S3EResBoF model achieves the worst results with the highest deviation due to its feature processing; indeed, its Kappa coefficient does not exceed 88%, as we can observe in Table 2, whilst its OA and AA measurements involve the highest uncertainty due to their large variability. On the contrary, the CNN and the pResNet models achieve better results, where the CNN has a lower standard deviation compared to the pResNet. This is due to the higher complexity of the pResNet model, which introduces 3.22M parameters that must be appropriately learned. Indeed, behaviours of the CNN and the pResNet vary depending on the amount of training data; thus, the pResNet, which is more complex than the standard CNN, is greatly affected by the lack of samples, but outperforms the CNN when it has enough samples to properly fit its parameters. Regarding deformable models, the implemented DKDCNet clearly outperforms the DCNet and DHCNet models, exhibiting a higher OA and a very small deviation, which makes it very stable even when there are few training samples.   Table 2 provides the estimated number of parameters. The pResNet model is the most complex classifier with more than 3M parameters, followed by DHCNet, DCNet, and CNN by far. Indeed, such deformable solutions as DHCNet and DCNet exhibit a large number of parameters. On the contrary, the proposed adaptive model has significantly less parameters, exhibiting not only less over-parametrization (which helps to reduce overfitting problems), but also a lower memory consumption. In this sense, the proposed network contains light depth-wise layers that prevent the overgrowth of the parameters. This feature makes it not only a lightweight, but also a pretty efficient model for HSI data classification. In this context, the adaptation of this model to low-power environments or embedded devices for onboard processing can be highly interesting.  Figure 9. Trainable parameters of the models considering the image University of Pavia.
Finally, Table 2 reports in more detail the obtained classification results in terms of OA, AA, Kappa (with the respective standard deviations after 10 MonteCarlo runs), and number of trainable parameters. As already observed, the implemented HSI DKDCNet outperforms the accuracy results achieved by the other five deep classification models. Moreover, its low deviation results indicate its classification performance is quite stable. Finally, the model lightness in terms of the number of parameters makes it a potential candidate for onboard processing.

Experimentation on UH dataset
The second experiment evaluates the performance of the proposed model on the UH image. Similar to the first experiment with the UP scene, different numbers of training samples have been randomly selected from the pool of labeled data; particularly, 30, 40, and 50 random samples per class have been considered to train the proposed DKDCNet and the standard CNN, DCNet, DHCNet, pResNet, and S3EResBoF models.
Once more, a comparison between the standard CNN and the proposed DKDCNet has been conducted to evaluate the behaviour of the models along different epochs. Figure 7b depicts, from left to right, the evolution of both models regarding their training and validation loss and their training and validation accuracy along different epochs considering 40 randomly selected training samples. Focusing on training and test loss, the proposed model has a fairly similar evolution to that presented in the previous experiment. Indeed, the training loss drops drastically between the first 10 epochs, overcoming the CNN model, whilst the validation loss drops abruptly from the first 15 epochs onward. Furthermore, during the training stage, the obtained accuracy climbs rapidly above 90%, whilst the validation accuracy evolves slower but better than the CNN, reaching an OA of over 90%.
Furthermore, a deeper comparison is conducted between the proposed DKDCNet and the considered CNN, pResNet, S3EResBoF, DCNet, and DHCNet models. Figure 8b provides the obtained OA, AA and Kappa coefficients for each model, along with their respective standard deviations, in order to compare the accuracy achieved by each deep network. Following the previous experiment, these results have been obtained considering 30, 40, and 50 randomly selected training samples after 10 Monte Carlo runs. Once more, similar behaviors are observed in the UH scene, where the proposed method reaches the best measurements in terms of OA, AA, and Kappa. Again, the S3EResBoF model achieves the worst results with the highest deviation for each measurement. Focusing on the CNN and the pResNet models, in this case, due to the data distribution of UH scene, the pResNet obtains better performance than CNN, since its skip connections provide a path to reuse local features. On the other hand, the DKDCNet outperforms the deformable DCNet and DHCNet models. Table 2 reports not only the OA, AA, and Kappa measurement, but also the complexity of each model. Indeed, the first column of Table 3 provides the estimated number of parameters. It must be noted that the size of the model depends on the size of the input data and the number of classes, which increases or decreases the number of parameters to be trained. In this sense, models with a UH scene are slightly larger than those with UP since they contain 16 classes instead of 9. This increase in the number of classes is hardly noticeable in very large models, such as the pResNet or the DHCNet, or even the S3EResBoF (with only 220K), so their approximation is practically the same as the parameters in UP. However, in the proposed DKDCNet model, the increase can be observed because of its reduced number of parameters. Finally, Figure 10 provides the classification maps of the UH scene obtained by each network considering 30 randomly selected training samples. As we can observe, the HSI DKDCNet extract classification maps with high quality, even when small training sets are employed.

Conclusions and Future Lines
This paper presents a new deep convolution-based neural network for HSI data classification. The model is inspired by deformable kernels, which are combined with deformable convolutions to automatically adapt the effective receptive field of the CNN classifier to spatial deformations within remote sensing HSI data. In this regard, the adaptable classification network fits its parameters to the HSI data patches in an optimal way, not only by being able to deform the convolution (i.e., adding an offset to the feature positions), but by automatically exploiting the deformation of the kernel itself applied to each receptive field on the input feature volume. Moreover, the deformations included in the model through different offsets are learned directly from the input data in a local and adaptive way. The conducted experiments over two widely-used HSI scenes (i.e., the University of Pavia and University of Houston scenes) demonstrate the HSI DKDC model is able to overcome the current state-of-the-art HSI data classification considering different deep models, some of them also with data deformation modeling, such as DCNet and DHCNet. In this sense, the implemented network is able to extract relevant spectral-spatial features more effectively from HSI data, using fewer parameters and reaching satisfactory accuracy results.
In future works, new deformable schemes will be studied to enhance the data deformation modeling within deep architectures, also improving the flexibility of current neural models.
Author Contributions: The authors have contributed as equally to this work. All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported by Junta de Extremadura FEDER under Grant GR18060 and Leonardo Grants for Researchers and Cultural Creators awarded by the Fundación BBVA under grant 2021/00203/00.

Conflicts of Interest:
The authors declare no conflict of interest.