Deep Residual Involution Network for Hyperspectral Image Classiﬁcation

: Convolutional neural networks (CNNs) have achieved great results in hyperspectral image (HSI) classiﬁcation in recent years. However, convolution kernels are reused among different spatial locations, known as spatial-agnostic or weight-sharing kernels. Furthermore, the preference of spatial compactness in convolution (typically, 3 × 3 kernel size) constrains the receptive ﬁeld and the ability to capture long-range spatial interactions. To mitigate the above two issues, in this article, we combine a novel operation called involution with residual learning and develop a new deep residual involution network (DRIN) for HSI classiﬁcation. The proposed DRIN could model long-range spatial interactions well by adopting enlarged involution kernels and realize feature learning in a fairly lightweight manner. Moreover, the vast and dynamic involution kernels are distinct over different spatial positions, which could prioritize the informative visual patterns in the spatial domain according to the spectral information of the target pixel. The proposed DRIN achieves better classiﬁcation results when compared with both traditional machine learning-based and convolution-based methods on four HSI datasets. Especially in comparison with the convolutional baseline model, i.e., deep residual network (DRN), our involution-powered DRIN model increases the overall classiﬁcation accuracy by 0.5%, 1.3%, 0.4%, and 2.3% on the University of Pavia, the University of Houston, the Salinas Valley, and the recently released HyRANK HSI benchmark datasets, respectively, demonstrating the potential of involution for HSI classiﬁcation.


Introduction
Hyperspectral images (HSIs) are three-dimensional (3D) data with hundreds of spectral bands, which contain both spatial information and approximately continuous spectral information. The abundant spatial-spectral information offers the opportunity for accurate discrimination of diverse materials of interest in the observed scenes. Therefore, HSIs have been applied in many fields related to Earth observation (EO), such as geological exploration [1,2], precision agriculture [3], and environmental monitoring [4]. Classification is a basic and important technique in the field of HSI processing, which aims to identify the land-cover category of each pixel in the HSI [5][6][7].
In the early approaches, handcrafted features [8,9] are first extracted from HSIs and then classified using traditional classifiers, e.g., support vector machine (SVM) [10,11]. In general, the feature extraction and classification are implemented separately, and the adaptability between these two processes is not fully considered.
For the past few years, deep learning-based methods have been widely applied in the classification of HSIs and have achieved great success [12]. From the pioneering work that utilizes stacked autoencoders (SAEs) for HSI classification [13], various deep learning models have been explored for accurate classifying hyperspectral data, such as that DRIN is able to achieve better classification performance than other state-of-the-art approaches. Moreover, our proposed DRIN demonstrates the huge potential of designing involution-based neural network architecture for accurate HSI classification, which opens a new window for future research.
The rest of this paper is organized as follows. Section 2 presents the related works of CNN-based HSI classification approaches. Our method is elaborated in Section 3. Section 4 shows the experimental results on four real HSI benchmarks. Section 5 gives a further discussion of the specialty of our approach. Section 6 presents the conclusion and prospect.

Related Works
HSIs are usually made up of hundreds of bands, which record reflectance measurements of various objects at hundreds of different wavelength channels, as shown in Figure 1. In the early study of deep learning-based HSI classification methods, each pixel in an HSI (a high-dimensional vector) is directly fed into deep networks, such as SAE [42], DBN [43], and 1D CNN [44], to extract discriminative spectral features for classification, which ignores the inherent spatial structure of HSIs. After that, in order to correctly identify the land-cover class of each hyperspectral pixel, researchers propose to utilize not only the unique spectral information of the pixel itself but also the spatial context information and the corresponding spectrum from its neighboring pixels [45,46].  To reduce the high redundancy between the spectral bands of HSIs and the complexity of classification, dimensionality reduction methods are often employed, which either transform the raw spectral features from high-dimensional to low-dimensional ones [34] (feature extraction-based methods) or select a band subset from the raw band set [47,48] (band selection-based methods). Meng et al. [34] evaluated the influence of five feature extractionbased dimensionality reduction methods on the HSI classification performance, including principle component analysis (PCA), sparse PCA, independent component analysis (ICA), incremental PCA (iPCA), and singular value decomposition (SVD). Lorenzo et al. [47] proposed to use an attention-based CNN coupled with an anomaly detection technique to select informative bands from HSIs. The CNN architecture was used to extract attention heat maps, which can quantify the importance of specific parts of the spectrum during the training phase, and the anomaly detection technique was utilized to further select the most important bands within an HSI. Zhang et al. [48] proposed a dense spatial-spectral attention network for HSI band selection, in which an embeddable spectral-spatial attention module is developed to adaptively select the bands and pixels that play an important role in classification during the training phase.
CNN-based HSI classification models can take 3D HSI cubes as input directly, which is effective in exploiting spectral-spatial information. In [49], Zhang et al. proposed a lightweight 3D CNN for spectral-spatial HSI classification and introduced two transfer learning strategies to further improve the classification performance. Zhu et al. [50] proposed a deformable convolution-based HSI classification framework, which applies regular convolutions on deformable feature images to extract more effective spatial features. However, when conventional CNN models become deeper, the classification accuracy decreases due to the gradient vanishing and overfitting phenomena.
To ease the training of deeper CNN model, residual learning [51] has been introduced in HSI classification. In [52], a deep residual network model with 30 more layers is constructed to extract more discriminative features from HSIs. Zhong et al. [53] presented a 3D spectral-spatial residual network, which can extract discriminative features from raw HSI cubes. Meng et al. [54] proposed a wide multipath residual network for HSI classification, which utilizes shorter-medium neural connections to enable more efficient gradient flow throughout the entire depth of the network. In [55], a deep pyramidal residual network (DPRN) is proposed, which is able to extract more spatial-spectral features from HSIs as the network depth increases. Dense convolutional network [56] extended the idea of residual network, which uses dense connectivity pattern to encourage feature reuse and strength feature propagation, alleviating the vanishing-gradient problem and making the network easy to train. In [57], the deep&dense CNN is proposed for the pixel-wise HSI classification. Meng et al. [58] developed a fully dense multiscale fusion network (FDMFN), which exploits multiscale feature representations learned from different convolutional layers for HSI classification. Moreover, some advanced CNN architectures that integrate the benefits of the residual network and densely connected CNN have also been developed for the classification of HSIs [17,59]. In addition, several works proposed to utilize attention-aided CNN models that focus on more discriminative spectral channels and/or spatial positions for HSI classification. For instance, Mou et al. [6] combined CNN with a spectral attention module for HSI classification, which can adaptively emphasize informative and predictive bands. Wang et al. [60] utilized squeeze and excitation modules for adaptive feature refinement, which can excite or suppress features in the spectral and spatial domains simultaneously.
Considering that the number of convolution kernels' parameters increases quadratically with the increase of kernel size, modern CNN-based HSI classification models generally restrict the convolution kernel size to no more than 3 × 3 for efficiency [6,17,[58][59][60], resulting in the restriction of convolution operation's receptive field. For instance, in [58][59][60], CNN models with a large number of convolutional layers with 3 × 3 kernel size are used to extracted robust spectral-spatial features. Gao et al. [32] mixed 1 × 1 and 3 × 3 kernel sizes into a depthwise convolution operation, which can learn feature maps at different scales and reduce the trainable parameters in the network. Paoletti et al. [33] combined standard 3 × 3 convolution operation with cheap linear operations to build efficiency CNN models. Zheng et al. [15] used fully convolutional network (FCN) to classify HSIs, performing training and inference over the whole image directly. However, the kernel size of the convolutional layers in the FCN is still restricted to 3 × 3, posing challenges for capturing wider spatial context in a single shot. In [55], Paoletti et al. gradually increased the kernel numbers across layers and utilized a larger convolutional kernel (i.e., 7 × 7 or 8 × 8) to extract spatial features, which incurs massive trainable parameters and high memory requirements during training. Therefore, how can we harness large receptive field to exploit the context in a wider spatial extent and avoid to bring unacceptable extra parameters and computations? In addition, convolution kernels are reused among different spatial locations to pursue translation equivalence [61], which also deprives their ability to adapt to different visual elements in an HSI. How can we design content-adaptive dynamic filters to extract more discriminative features? Zhu et al. [50] and Nie et al. [62] proposed to use deformable CNNs to classify HSIs, in which the convolution kernel shape can be adaptively adjusted according to the spatial contexts. However, only the footprint of convolution kernels are determined in an adaptive fashion, and they also employ small convolution kernels. To overcome the aforementioned issues, a DRIN model is proposed for HSI classification in this work. Figure 2 illustrates the framework of our DRIN model for HSI classification. The proposed DRIN utilizes HSI patches as data input and is mainly constructed by stacking multiple residual involution (RI) blocks, in which the involution operation is the core ingredient.  In this section, we first give a brief review of the standard convolution operation. Then, we introduce the involution operation and detail the proposed spectral featurebased dynamic kernel generation function. Finally, the RI block and the DRIN-based HSI classification framework are detailed. Figure 3 gives the diagram of standard convolution. Suppose that the spatial height and width of the input feature maps are H and W, respectively. We denote input feature maps as X ∈ R H×W×C in , where C in indicates the number of input channels. Let F ∈ R C out ×C in ×K×K denote a cohort of C out convolution filters, where each filter contains C in convolution kernels and K × K is the kernel size. Specifically, we denote each filter as F p ∈ R C in ×K×K , p = 1, 2, · · · , C out , and let F p,q ∈ R K×K , q = 1, 2, · · · , C in denote convolution kernels contained in each filter. To obtain output feature maps Y ∈ R H×W×C out , convolution filters are applied on the input feature maps and execute multiply-add operations in a sliding-window manner, defined as

Convolution Operation
where 1 ≤ i ≤ H and 1 ≤ j ≤ W index the spatial positions. Ω denotes the set of offsets in the neighborhood considering convolution with respect to position (i, j), written as where indicates Cartesian product.

Involution
Different from convolution kernels, involution kernels are distinct over different positions in the spatial domain, but could be shared in the channel domain, i.e., they are spatial-specific and channel-agnostic kernels [41].
Let H ∈ R H×W×K×K×G denote the involution kernels, where G denotes the number of groups. Note that the same involution kernel is shared across channels in each group, which could reduce the number of parameters and hence the computational complexity. Specifically, for each position (i, j), we denote the corresponding involution kernel as H i,j,·,·,g ∈ R K×K , g = 1, 2, · · · , G. Analogously, to obtain output feature maps Y ∈ R H×W×C , involution kernels are applied on the input feature maps and multiply-add operations are performed (see Figure 4), defined as where C represents the input and output channel number. Figure 4. Illustration of the involution. The involution kernel H i,j ∈ R K×K×G is generated from the function φ conditioned on a single spectral feature vector at (i, j). For ease of demonstration, group number G is set to be 1, which means that H i,j is shared among all the channels in this example. indicates multiplication broadcasting across C channels. refers to the summation operation, which aggregates features within the K × K spatial neighborhood.
As shown in Figure 5, in this work, the involution kernel H i,j is generated solely conditioned on the spectral feature vector X i,j ∈ R C for efficiency, defined as where φ : R C → R K×K×G denotes the kernel generation function. Specifically, we employ two fully connected (FC) layers for kernel generation, which form a bottleneck architecture (see Figure 5). The first FC layer with parameters W 0 ∈ R C r ×C is utilized to reduce the dimensionality of the input spectral feature from C to C r . The second FC layer with parameters W 1 ∈ R (K×K×G)× C r is used to increase the feature dimensionality to the desired involution kernel size. r is a reduction ratio, which controls the intermediate channel dimension between two transformations and is used to reduce the parameters of the kernel generation function. BN denotes batch normalization and δ is the rectified linear unit (ReLU). In this way, the involution kernel can learn relationships between the target pixel and the neighboring pixels in an implicit fashion and adaptively allocate weights over different spatial positions, prioritizing informative visual patterns in the spatial extent according to the spectral information of target pixel. Note that the choice of the reduction ratio r, kernel size K × K, and group number G is discussed in Section 4. After the generation of the involution kernels, the output feature maps can be derived by performing multiply-add operations on the local blocks of the input with their corresponding kernels, as shown in Figure 6. The sliding local blocks can be easily extracted by using the unfold technique in Pytorch [63]. Note that the unfold operation materializes intermediate tensors with the shape of (H × W) × K × K × C, which causes involution to consume more memory resources than standard convolution.

RI Block
It has been widely demonstrated that residual learning [51] is helpful for enhancing the information flow throughout the network, alleviating the vanishing/exploding gradient problem effectively. It enables networks to learn deeper discriminative features without sacrificing the feasibility of optimization.
Owing to the simple design principle and the elegant architecture, residual learning is introduced in the proposed residual involution (RI) block. Figure 7a illustrates the residual block. There are two types of connections/paths, namely, the feed-forward path and the shortcut path. For each convolutional (Conv) layer in the feed-forward path, its input is the output of the previous Conv layer, and its output is the input of the next Conv layer. For the lateral shortcut path, it performs identity mapping to preserve information across layers, which not only contributes to effective feature reuse but also enables gradients propagate directly from the later layers to earlier ones.
Specifically, a bottleneck residual block with pre-activation is introduced here. As shown in Figure 7a, the kernel size of the first and the last Conv layer is 1 × 1, while that of the middle Conv layer is 3 × 3. For the residual block, the input and output features have the same size and can be aggregated directly. In addition, the 1 × 1 Conv layers are employed to first reduce and then recover the channel number, leaving the middle 3 × 3 Conv layer with fewer input and output channels for efficient processing. Given an input X, the computation process in the residual block is formulated as where F (X) represents the residual block's output, h(·) is the residual function to be learned during network training, and g(·) denotes the identity mapping, i.e., g(X) = X. Specifically, residual function h(·) performs a nonlinear transformation and is implemented by executing a series of BN, ReLU, and Conv layers. Note that each Conv layer is preceded by the BN and ReLU layers, known as the pre-activation [64]. In this work, RI block is proposed for extracting more discriminative spectral-spatial features, where the 3 × 3 Conv layer in the bottleneck residual block is replaced with a involution layer, as shown in Figure 7b. The 1 × 1 Conv layers are retained and dedicated to spectral feature learning. The involution layer is used to extract key informative spatial features. Compared with the static 3 × 3 convolution, the involution operation can adaptively allocate weights for different spatial positions in an HSI scene and prioritize the key informative visual elements that have positive contributions to the discrimination of the targets. In addition, thanks to the delicately designed involution operation, the proposed RI block can harness a large involution kernel without introducing prohibitive memory cost. Therefore, it can achieve dynamically reasoning in an enlarged spatial range and capture long-range spatial interactions well in comparison with the compact and static 3 × 3 convolution.

DRIN-Based HSI Classification
Taking the popular Salinas Valley HSI dataset as an example, Table 1 summarizes the corresponding topology of the proposed DRIN. As can be seen, RI blocks adopt a larger dynamically parameterized involution kernel, which could adaptively summarize the spatial context information in a wider spatial range, thus capture long-range spatial interactions well. Specifically, in the proposed DRIN (see Figure 2), a Conv layer with 1 × 1 kernel size and 96 filters is first used to reduce the spectral dimension of original HSI cubes with the size of 11 × 11 × 204, generating feature maps with the size of 11 × 11 × 96. Then, the obtained feature maps are transmitted to three cascaded RI blocks, which are used to further learn discriminative spectral-spatial features. In each RI block, the first 1 × 1 Conv layer is utilized to condense features along the spectral dimension, and the size of input feature maps is changed from 11 × 11 × 96 to 11 × 11 × 24. Next, the channelreduced feature maps are processed by the involution layer with 9 × 9 kernel size. For the Salinas Valley dataset, the default reduction ratio r and the group number G are 2 and 12, respectively. Due to the fact that the involution operation does not change the size of input feature maps, the size of its output remains 11 × 11 × 24. After that, a Conv layer with 1 × 1 kernel size increases the spectral dimension of feature maps from 24 to 96. Note that the proposed DRIN does not employ any spatial downsampling operation in the RI block, hence the spatial resolution of feature maps keeps unchanged, in order to preserve spatial context information that is important for pixel-level object recognition. Finally, global average pooling (GAP) is adopted to transform the learned features with size of 11 × 11 × 96 into a 1 × 1 × 96 feature vector. The FC layer takes the obtained feature vector as input and outputs a feature tensor with the dimension of c, where c denotes the land-cover classes. For the Salinas Valley HSI, c is 16.
The objective function of training the proposed DRIN is the categorical cross entropy loss and is defined as where p i denotes the output of the final classification layer, that is, the output of the last FC layer with a softmax function, y i ∈ {0, 1} refers to the label value (y i = 0 when a sample does not belong to the ith category, and y i = 1 otherwise), and c denotes the number of land-cover categories in a hyperspectral scene. Considering that the involution operation comprising of two FC layers is differentiable, the proposed DRIN can be optimized in the same way as the typical CNN. To be specific, the training procedure of the proposed DRIN lasts for 100 epochs, using the Adam optimizer with the weight decay of 0.0001 and the mini-batch size of 100. In addition, the learning rate starts from 0.001 and gradually approaches zero following a half-cosine function shaped schedule. The code of our DRIN model is released at: https://github.com/zhe-meng/DRIN (accessed on 3 June 2021).

Datasets Description
To verify the effectiveness of the proposed DRIN, we conducted experiments on four hyperspectral benchmark datasets: the University of Pavia (UP), University of Houston (UH), Salinas Valley (SV), and HyRANK datasets.
(1) UP: It was collected by the ROSIS sensor and contains 610 × 340 spectral samples with 103 bands. The spatial resolution is 1.3 m/pixel and the wavelength range of bands is between 0.43 and 0.86 µm. The corresponding ground truth map consists of nine classes of land cover.
(2) UH: It was captured by the CASI sensor and has 349 × 1905 spectral samples with 144 bands. The spatial resolution is 2.5 m/pixel and the wavelength range of bands is between 0.38 and 1.05 µm. The corresponding ground truth map consists of 15 classes of land cover.
(3) SV: It was gathered by the AVIRIS sensor over Salinas Valley, CA, USA, containing 512 × 217 pixels and 204 available spectral bands. The wavelength range is between 0.4 and 2.5 µm. The spatial resolution is 3.7 m/pixel. The corresponding ground truth map consists of 16 different land-cover classes.
(4) HyRANK: The ISPRS HyRANK dataset is a recently released hyperspectral benchmark dataset. Different from the widely used hyperspectral benchmark datasets that consist of a single hyperspectral scene, the HyRANK dataset comprises two hyperspectral scenes, namely Dioni and Loukia. The available labeled samples in the Dioni scene are used for training, while those in the Loukia scene are used for test. The Dioni and Loukia scenes comprise 250 × 1376 and 249 × 945 spectral samples, respectively, and they have the same number of spectral reflectance bands, i.e., 176.
Note that the widespread random sampling strategy overlooks the spatial dependence between training and test samples, which usually leads to information leakage (i.e., overlap between the training and test HSI patches) and overoptimistic results when performing spectral-spatial classification [65]. To reduce the overlap and select spatially separated samples, researchers propose using spatially disjoint training and test sets to evaluate the HSI classification performance [66,67].
For the UP, UH, and HyRANK datasets, the official disjoint training and test sets were considered. The standard fixed training and test sets for the UP scene are available at: http://dase.grss-ieee.org (accessed on 3 June 2021). The UH dataset is available at: https: //www.grss-ieee.org/resources/tutorials/data-fusion-tutorial-in-spanish/ (accessed on 3 June 2021). The HyRANK dataset is available at https://www2.isprs.org/commissions/ comm3/wg4/hyrank/ (accessed on 3 June 2021). Taking the HyRANK dataset as an example, the available labeled samples in the Dioni scene are used for training, while those in the Loukia scene are used for test. Therefore, there is no information leakage between the patches contained within the training and test sets. Since the spatial distribution of the training and test samples is fixed for these three datasets, we executed our experiments five times over the same split, in order to avoid the influence of random initialization of network parameters on its performance.
For the SV dataset, since it does not have official spatially disjoint training and test sets, we first randomly selected 30 samples from each class in the ground truth for network training, and the remaining were used for test. Tables 2-5 summarize the detail information of each category in these four datasets. Figures 8-11 show the false color image and the distribution of available labeled samples of the four hyperspectral scenes.
The classification performance of all approaches was evaluated quantitatively with per-class classification accuracy, overall classification accuracy (OA), average classification accuracy (AA), and kappa coefficients (κ).

Parameters Analysis
The classification performance of our DRIN is affected by the parameter selection to a certain extent. We therefore experimentally analyzed the influences of some main parameters involved in the proposed network for the optimal results, including the involution kernel size, the reduction ratio r, and the group number G. Figure 12 presents  (c) When employing a larger involution kernel, the proposed DRIN could utilize spatial context information in a wider spatial arrangement, capturing long-range spatial dependencies for correctly identifying the land cover type of each pixel. To prove this benefit, DRINs with different sizes of involution kernel (i.e., 3 × 3, 5 × 5, 7 × 7, and 9 × 9) were implemented, and the obtained results are shown in Figure 12a. As can be seen, the best OA values are achieved when the involution kernel size is set to 5 × 5, 5 × 5, 9 × 9, and 9 × 9 for the UP, UH, SV, and HyRANK datasets, respectively, which suggests that utilizing a larger involution kernel is helpful for increasing the classification accuracy. Note that the UP and UH scenes have more detailed regions, thus considering too large spatial context could weaken the information of the target pixel. As for the SV and HyRANK scenes, since they have larger smooth regions, dynamically reasoning in an enlarged spatial range could capture long-range dependencies well and hence offer better classification performance.
Reduction ratio r controls the intermediate channel dimension between the two linear transformations in the kernel generation function φ. Specifically, appropriate parameter r could reduce the parameter count and permit the usage of larger involution kernel size under the same budget. Figure 12b illustrates the influence of different values of reduction ratio r on the OA of the our DRIN. For different datasets, the degree of influence and the optimal value are different. Based on the classification outcomes, the optimum values of reduction ratio r for the UP, UH, SV, and HyRANK datasets are 6, 4, 2, and 4, respectively.
Group number G is also a crucial factor in feature representation of DRIN. The more groups there are, the more distinct involution kernels are involved for discriminative spatial feature extraction. We select the optimum group number G from {4, 8, 12, 24} for each dataset. As shown in Figure 12c, the proposed DRIN with G = 12 achieves the highest OA on the UP, SV, and HyRANK datasets, and the best performance is obtained when G = 24 on the UH dataset. As a result, to obtain the optimal results, the G is set to 12, 24, 12, and 12 for the four datasets, respectively.
Taking the HyRANK dataset as an example, the influence of these three hyperparameters on the number of parameters and the computational complexity (in terms of the number of multiply-accumulate operations (MACs)) was further analyzed. Tables 6-8 show the classification performance and the number of parameters and MACs of the proposed DRIN with different involution kernel sizes, G, and r on the HyRANK dataset, respectively. As can be seen in Table 6, increasing involution kernel size can improve the classification accuracies. Note that the parameter count of convolution kernels increases quadratically with the increase of the kernel size. However, for the involution operation, we can harness large kernel while avoid introducing too many extra parameters. In addition, we adopt a bottleneck architecture to achieve efficient kernel generation. As shown in Table 7, although aggressive channel reduction (r = 12) significantly reduces the number of parameters and MACs, it harms the classification performance (i.e., the lowest OA score of 49.9% is obtained). As long as r is set to an acceptable range, the proposed DRIN not only obtains good performance but also reduces the parameter count and computational cost. In addition, for the proposed DRIN, we share the involution kernels across different channels to reduce the parameter count and computational cost. The smaller is the G, the more channels share the same involution kernels. As shown in Table 8, the non-shared DRIN (G = 24) incurs more parameters and MACs, while obtaining the second best OA score. The possible reason of this phenomenon is that the limited training samples in the HSI classification task are not enough to train networks with excessive parameters.
DRN and the proposed DRIN have a similar architecture. The DRN is regarded as a baseline model and can be obtained by replacing all involution layers in the DRIN with the standard 3 × 3 Conv layer. For all network models, the input patch size was fixed to be 11 × 11 to make a fair comparison. All the network models were implemented with the deep-learning framework of PyTorch and accelerated with an NVIDIA GeForce RTX 2080 GPU (with 8-GB GPU memory).
For the UP dataset, the proposed DRIN achieves the highest OA, as shown in Table 9. To be specific, the OA obtained by our DRIN is 96.4%, which is higher than that of SVM In comparison with the traditional machine learning approaches, SVM and EMP, the proposed network significantly increases the OA by 16.4% and 13.2%, respectively. In addition, compared with the DenseNet , DPRN, FDMFN, MSRN, LWRN, SSSERN, and DRN, the improvements in κ achieved by the proposed DRIN are 6.7%, 6.0%, 4.9%, 8.1%, 4.2%, 2.7%, and 0.6%, respectively. Table 10 presents the classification results for the UH dataset. The proposed DRIN attains the highest overall metrics. Specifically, the OA, AA, and κ values are 86.5%, 88.6%, and 85.4%, respectively. In comparison with the second best approach, i.e., the MSRN, the performance improvements for OA, AA, and κ metrics are +0.6%, +0.5%, and +0.6%, respectively. The baseline model DRN is able to achieve satisfactory performance on this dataset. In comparison with it, the proposed involution-powered DRIN achieves higher OA, AA, and κ values, proving the superiority of larger dynamic involution kernel over compact and static 3 × 3 convolution kernel. To be specific, our proposed model improves OA, AA, and κ values from 85.2% to 86.5%, 87.5% to 88.6%, and 84.0% to 85.4%, respectively. Table 11 reports the classification results for the SV dataset. SVM shows the worst performance and EMP is better than SVM. Thanks to the excellent nonlinear and hierarchical feature extraction ability, the other deep learning-based approaches, i.e., DenseNet ,  DPRN, FDMFN, MSRN, LWRN, SSSERN, DRN, and the proposed DRIN, outperform the SVM and EMP on the SV dataset. Specifically, compared to spectral classification approach SVM, DRIN significantly enhances OA and κ by about 10%. Besides, the proposed DRIN achieves 96.7% OA, 98.6% AA, and 96.3% κ, which are higher than those obtained by DRN. This again demonstrates the effectiveness and superiority of involution.
The classification accuracies of different approaches for the HyRANK dataset are summarized in Table 12, and the proposed DRIN achieves the highest overall accuracy among all the compared approaches. In comparison with DRN, our DRIN significantly increases the OA by 2.3%. In addition, the AA and κ values obtained by our DRIN are nearly 3% higher than those achieved by DRN. This experiment on the recently released HyRANK benchmark again verifies that the proposed involution-powered DRIN can achieve better classification performance than the convolutional baseline DRN.
The classification maps of different approaches are illustrated in Figures 13-16. Taking the SV dataset as an example, it is obvious that the classification maps generated by the SVM and EMP present many noisy scatter points, while the other deep learning-based methods mitigate this problem through the elimination of noisy scattered points of misclassification (see Figure 15). Clearly, our DRIN's classification map is close to the ground truth map (see Figure 10b), especially in the region of grapes-untrained (Class 8). This is in consistent with the quantitative results reported in Table 11, where our DRIN model achieves the highest classification accuracy of 88.7% for the grapes-untrained category.    Focusing on the obtained overall classification accuracies, the proposed involutionbased DRIN model consistently outperforms its convolution-based counterpart DRN on the four datasets. Specifically, in comparison with DRN, the proposed network increases the OA by 0.5%, 1.3%, 0.4%, and 2.3% on the UP, UH, SV, and HyRANK datasets, respectively, demonstrating the effectiveness of our DRIN model. Figure  In addition, our proposed DRIN could deliver enhanced performance at reduced parameter count compared to its counterpart DRN. Table 13 gives information about the OA and the corresponding model parameters. Specifically, for the UP dataset, DRIN is able to achieve a margin of 0.5% higher OA over DRN, using 26.07% fewer parameters. For the UH dataset, the proposed DRIN significantly increases the OA by 1.3% as compared with the DRN model, while requiring fewer parameters. For the SV dataset, the proposed DRIN with 3 × 3 involution kernel (denoted as DRIN*) can also obtain higher OA than DRN, with 23.27% fewer parameters. As for the HyRANK dataset, the proposed DRIN* with 14,430 fewer parameters can obtain 1.0% gains in terms of OA as compared with DRN. Moreover, with only 5328 more parameters, the OA of the proposed DRIN is 2.3% better than that of DRN. Regarding the runtimes, the proposed DRIN consumes more time than the DRN. Taking the UP data set as an example, the running time of the DRN is 119.54 s, while the proposed DRIN consumes 183.19 s. The reason behind this is that the convolution layer is trained more efficiently than involution layer on GPU. In the future, a customized CUDA kernel implementation of the involution operation will be explored, to efficiently utilize computational hardware (i.e., GPU) and reduce the time cost.
To understand the contributions of the proposed DRIN's two core components, residual learning and involution operation, we further constructed a plain CNN model (without residual learning) for comparison, which is obtained by eliminating the skip connections from the DRN model. The classification results obtained by CNN, DRN, and DRIN are summarized in Table 14. Note that, for a fair comparison, these three networks have the same number of layers and are trained under exactly the same experimental setting. As can be seen in Table 14, in comparison with the CNN, the performance improvements (in term of OA) obtained by DRN are +2.1%, +3.0%, +0.4%, and +2.1% on the UP, UH, SV, and HyRANK datasets, respectively, which demonstrate that residual learning plays a positive role in enhancing the HSI classification performance. Compared with DRN, the proposed involution-powered DRIN model further increases the OA by +0.5%, +1.3%, +0.4%, and +2.3% on the four datasets, which shows that the involution operation is also useful to improve the HSI classification accuracy. As can be seen, the improvement achieved by residual learning technique outperforms that obtained by the involution operation on the UP and UH datasets. However, compared with DRN, the proposed involution-powered DRIN can significantly increase the OA by +1.3% and +2.3% on the UH and HyRANK datasets, respectively, which demonstrates the huge potential of designing involutionbased networks. More importantly, our DRIN model using both techniques performs the best on all four HSI datasets, demonstrating its effectiveness for HSI classification. In addition, considering that the random sampling strategy usually leads to overoptimistic results, we further compared the proposed DRIN with other deep models on the SV dataset using spatially disjoint training and test samples. We used a novel sampling strategy [65] to select training and test samples, which can avoid the overlap between the training and test sets and hence the information leakage. We extracted six-fold training-test splits for cross validation (see Figure 18) and repeated the experiments five times for each fold. Therefore, we performed 30 independent Monte Carlo runs for the proposed DRIN and the overall classification accuracies are reported. In this experiment, the proportion of training samples in each fold is about 3%, and the training HSI patches can only be selected within the black area, as shown in Figure 18. For more details of this sampling strategy, please refer to the work of Nalepa et al. [65]. As shown in Table 15, the proposed DRIN again achieves better performance over the other compared approaches. In addition, we further analyzed the influence of involution kernel size on the classification performance using these six non-overlapping training-test splits. The experimental results are summarized in Table 16. We can easily observe that using a larger involution kernel could deliver improvements in accuracy.  In summary, extensive experimental results on four benchmark datasets demonstrate that employing DRIN could deliver very promising results in terms of increasing classification accuracy and reducing model parameters. In addition, in order to present a global view of the classification performance of the compared deep models, the aggregated results across all four benchmarks are presented. Specifically, we average the OA scores obtained by different deep models on the four HSI datasets. As can be seen in Figure 19, the proposed DRIN achieves the highest averaged OA score (i.e., 80.8%). Besides, to verify the importance of the differences across different CNN models, we executed Wilcoxon signed-rank tests over OA. As can be observed in Table 17, the differences between the proposed DRIN and other models are statistically important. For instance, the proposed DRIN delivers significant improvements over DRN in accuracy, according to the Wilcoxon tests at p < 0.005.

Discussion
Different from the convolution operation, the involution operation can adaptively allocate weights over different spatial positions. To verify the effectiveness of the dynamic involution operation, we further compared the proposed DRIN with the baseline DRN model that uses the same size of the static convolution kernel for spatial feature learning. As shown in Table 18, the proposed DRIN consistently achieves better performance than DRN while containing fewer parameters, again demonstrating the effectiveness and efficiency of the proposed spectral feature-based dynamic involution kernel. Taking the HyRANK dataset as an example, the proposed DRIN is able to achieve 3.5% higher OA over DRN while using 3.23× fewer parameters. Note that the spectral feature-based involution kernels used in the proposed DRIN are distinct over different spatial positions (i.e., spatial specific), which could prioritize the informative visual elements. For each involution kernel, the sum of the K × K kernel weights can be taken as its representative value [41]. By plotting the representative values at diverse spatial positions, we can frame a heat map to dissect the learned involution kernels. Figure 20 shows the heat maps of the involution kernels learned in the third RI block, separated by groups. As can be seen, different kernels automatically attend to varying semantic concepts for correct pixel recognition. For example, sharp edges or corners, the outlines of different objects, smoother regions, and peripheral parts are highlighted in different heat maps. In addition, the enlarged involution kernel could capture long-range spatial interactions, which enhances the utilization of the key spatial context information for target pixel recognition. Moreover, each involution kernel corresponds to an individual feature channel (similar to the depthwise convolution operation [68]) and different channels could share the same kernel, which enables the involution operation to be implemented in a fairly lightweight manner. Therefore, our proposed involution-based DRIN model could achieve better classification performance than its convolution-based counterpart with fewer parameters.

Conclusions
In this paper, an involution-powered network, DRIN, is proposed for HSI classification. As a notable characteristic, the proposed DRIN could capture long-range spatial interactions through harnessing large involution kernels, while avoiding prohibitive memory consumption. Moreover, the dynamically parameterized involution kernel could adapt to different visual patterns with respect to diverse spatial locations. Experiments conducted on four benchmark datasets demonstrate that DRIN not only offers better performance than the convolution-based counterparts but also outperforms other state-of-the-art HSI classification algorithms.
Future work can be devoted to exploring more exquisite kernel generation functions for enhancing the discriminative feature learning ability of the involution kernel. In addi-tion, we believe that exploring more effective involution-equipped neural networks will aid future research for accurate HSI classification.