SiamPKHT: Hyperspectral Siamese Tracking Based on Pyramid Shuffle Attention and Knowledge Distillation

Qian, Kun; Wang, Shiqing; Zhang, Shoujin; Shen, Jianlu

doi:10.3390/s23239554

Open AccessArticle

SiamPKHT: Hyperspectral Siamese Tracking Based on Pyramid Shuffle Attention and Knowledge Distillation

School of Artifical Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(23), 9554; https://doi.org/10.3390/s23239554

Submission received: 31 October 2023 / Revised: 22 November 2023 / Accepted: 29 November 2023 / Published: 1 December 2023

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Versions Notes

Abstract

:

Hyperspectral images provide a wealth of spectral and spatial information, offering significant advantages for the purpose of tracking objects. However, Siamese trackers are unable to fully exploit spectral features due to the limited number of hyperspectral videos. The high-dimensional nature of hyperspectral images complicates the model training process. In order to address the aforementioned issues, this article proposes a hyperspectral object tracking (HOT) algorithm callled SiamPKHT, which leverages the SiamCAR model by incorporating pyramid shuffle attention (PSA) and knowledge distillation (KD). First, the PSA module employs pyramid convolutions to extract multiscale features. In addition, shuffle attention is adopted to capture relationships between different channels and spatial positions, thereby obtaining good features with a stronger classification performance. Second, KD is introduced under the guidance of a pre-trained RGB tracking model, which deals with the problem of overfitting in HOT. Experiments using HOT2022 data indicate that the designed SiamPKHT achieves better performance compared to the baseline method (SiamCAR) and other state-of-the-art HOT algorithms. It also achieves real-time requirements at 43 frames per second.

Keywords:

hyperspectral video; target tracking; SiamCAR network; pyramid shuffle attention; knowledge distillation

1. Introduction

Single object tracking is a significant research direction in the field of computer vision [1,2,3,4], which aims to continuously track a specific object based on the information from the initial frame. Visual object tracking has made certain progress, yet it still encounters challenges, including target occlusion, appearance variations, fast motion, and scale changes. This is due to the fact that RGB-based trackers are limited to using only the red, green, and blue (RGB) color bands, which constrains their capability in feature representation. Hyperspectral imaging technology, on the other hand, captures the reflected or emitted spectra of objects at various wavelengths, offering more comprehensive spectral information [5] than RGB. This feature enables hyperspectral tracking to excel in distinguishing objects with subtle color differences and in scenarios where environmental lighting conditions change. Nowadays, it has become easier and more cost-effective to acquire hyperspectral videos (HSVs), thus providing opportunities for hyperspectral object tracking (HOT). As shown in Figure 1, an HOT algorithm that obtains material information is capable of identifying objects with similar appearances.

Early HOT algorithms were based on correlation filtering (CF). Uzkent et al. [7] introduced an HOT method based on deep kernelized correlation filters (DeepHKCF) that effectively tracks airborne targets with an adaptive multimodal hyperspectral sensor. However, the method leads to the reduction of spectral information when processing hyperspectral images (HSIs), thereby reducing the performance of the tracker. Qian et al. [8] utilized convolutional kernels for feature extraction by extracting a collection of patches around the object in each spectral band. However, this approach neglects the correlations that exist between different spectral bands. Xiong et al. [5] proposed a material-based hyperspectral tracker (MHT) that incorporates the spectral-spatial information of HSVs into a multidimensional oriented gradient histogram. This technique enables an accurate representation of the target by utilizing global material features, but the performance of the tracker needs to be improved in scenes of illumination variations. Zhang et al. [9] took advantage of multi-feature integration using spatial, spectral, and temporal information. Although the integrated feature is strong in discriminating the target from the background, parameter tuning requires a large amount of experimental optimization. Recently, Zhao et al. [10] proposed a method based on pixel-wise spectral matching reduction and deep cascading of spectral textural features, successfully mitigating the effects caused by illumination.

Compared to CF trackers, deep learning (DL)-based trackers demonstrate superior performance in terms of accuracy. Li et al. [11] introduced an HSV tracker based on the Band Attention-aware Ensemble Network (BAENet), which can obtain several groups of spectral bands. However, BAENet tends to lose the target in the case of scale variations. Wang et al. [12] combined band selection with an improved Siamese [3] network. Their genetic optimization model helped to obtain three bands according to the optimal joint entropy, and then transfer learning (TL) enhanced the antideformation capability referring to object tracking. Nevertheless, the generalizability of this method is not strong enough. Furthermore, Li et al. [13] proposed a Siamese network of Band Attention-based Grouping (SiamBAG), which combines all bands into many 3-channel images to exploit all spectral features, but it is time consuming. In addition, Li et al. [14] introduced a deep ensemble Network using Spectral Self-Expression (SEENet), with which the importance of each spectral band can be obtained. It is worth mentioning that a dynamic aggregation module is used to integrate the prediction results of each recomposed image, suppressing unreliable tracking referring to low-importance images.

Today, researchers address HOT by integrating deep features into CF-based trackers. The core of DL-based trackers is based on training large amounts of data, whose adequacy has a substantial effect on tracking performance. Most DL-based HOT methods use models developed on RGB datasets as backbone networks. However, it is not possible to directly apply models trained on visual images to HSVs due to the difference in the number of bands. Generally, two approaches are utilized to deal with this issue. One approach is to convert HSV to false-color images [13,15]. However, this method loses some spectral information. An alternative strategy tends to decompose the HSI into a series of images [14,16], which is accomplished by sequentially organizing the spectral band. However, the information becomes redundant as a result of its strong similarity to adjacent bands. Additionally, the high computational cost poses challenges in meeting real-time requirements. Due to the scarcity of hyperspectral data, fine-tuning for TL tends to compromise the original robustness of the model, rendering the tracker more susceptible to noise or other forms of interference.

From the above analysis, we introduce an HOT approach that combines pyramid shuffle attention (PSA) and knowledge distillation (KD) in a Siamese framework, namely SiamPKHT. The proposed framework is illustrated in Figure 2 and mainly includes four components: feature extraction, similarity enhancement, target box prediction, and hyperspectral knowledge distillation (HKD). In our previous work, as detailed in [17], we employed a genetic algorithm to identify three spectral bands with the highest joint information entropy. These bands were used to create false-color images, which we then input into the ResNet50 network [18]. This method effectively captures essential spectral data while minimizing redundant information in HSVs, thus speeding up the computations. Then, the initial similarity features were obtained by cross-correlating the template and search region features. Furthermore, a PSA module was applied to improve similarity features. The PSA module, by integrating multi-scale information regarding similarity features and capturing pixel-level pairwise relationships and channel dependencies, enabled the model to more effectively handle the complexity of hyperspectral data. This also enhanced the model’s generalization capability, allowing it to better cope with a diverse range of real-world application scenarios. Therefore, classification and regression maps were obtained to forecast the target state. Moreover, the proposed model incorporated KD to transfer hyperspectral information through TL. A teacher model, trained on a color image dataset [19], served as a guide in the training phase. Additionally, a distillation loss was designed to avoid overfitting problems from insufficient training samples. Compared to previous works [12,16], employing KD techniques aids the model in learning more profound levels of feature abstraction, significantly enhancing training efficiency. In summary, the proposed SiamPKHT tracker trained by this method effectively achieves better generalization.

As shown in Figure 3, the proposed algorithm outperforms other state-of-the-art (SOTA) trackers significantly with respect to the area under the curve (AUC) [20] while meeting the requirements of real-time tracking. The key contributions of this paper are described below.

The hyperspectral target tracking algorithm, which uses SiamCAR as the backbone, works at 43 FPS, meeting real-time requirements.
A feature enhancement module based on pyramid shuffle attention is designed to increase the representation capability of similarity maps by establishing relationships between features and fusing multiscale information.
To effectively alleviate the problem of overfitting, we design a hyperspectral knowledge distillation training approach that benefits from RGB datasets.
Comprehensive experiments carried out with the HOT2022 benchmark demonstrate that PSA and HKD can improve the performance of the baseline method. Furthermore, the proposed tracker achieves exceptional performance in all HSVs.

Figure 3. The performance comparison with several SOTA trackers on HOT2022. We visualize the area under the curve (AUC) relative to frames per second (FPS). The radii of the circles represent the speed of the trackers.

2. Related Work

This section will review three main aspects relevant to our work: Siamese trackers, the mechanism of attention, and the distillation of knowledge.

2.1. Siamese Trackers

Recently, Siamese networks have been widely used in single-object tracking [3,21,22,23,24,25,26,27,28,29]. The primary idea behind the Siamese tracker is to track a target by comparing similarities between features from both the template and the search region. This method employs a Siamese network architecture with shared weights, where two identical networks jointly learn the representation of the target. Siamese Instance Search Tracker (SINT) [21] is a pioneering Siamese tracker that utilizes dual branches of an identical backbone network to produce feature maps. Fully convolutional Siamese (SiamFC) networks [22] are innovative Siamese trackers that first utilize the cross-correlation layer to merge information from two different branches. They employ template features as convolutional kernels to perform convolutional operations over the search area, obtaining similarity features between both parts. This similarity map is utilized for the prediction of the target position. The Siamese Region Proposal Network (SiamRPN) [3] introduces a region proposal network to achieve more precise tracking with high speed. However, the task faces the challenge that a distractor is similar in appearance to the object of interest. Zhu et al. [24] proposed Distractor-aware SiamRPN (DaSiamRPN), which utilizes data augmentation strategies to address the issue of imbalanced distribution referring to training data. The introduction of deep networks into the Siamese framework for feature extraction was pioneered by SiamRPN++ [23]. In addition, an innovative deep correlation layer proposed in SiamRPN++ can efficiently incorporate information from branches. These anchor-based methods have to spend much time searching for the appropriate anchor parameters.

To avoid these issues, anchor-free models have been adopted to directly forecast bounding boxes rather than relying on pre-established anchor boxes. Ocean et al. [30] proposed an anchor-free network with object awareness, which estimates the position of the target by forecasting the distance of pixels from the object box. This mechanism enables the learning of object-aware features, thus improving the tracking performance. Chen et al. [31] proposed a flexible Siamese Box Adaptive Network (SiamBAN) that allows for end-to-end offline training, thus avoiding parameter tuning referring to candidate boxes. Furthermore, SiamFC++ [32] adopts a privileged estimation quality assessment branch, which focuses on high-quality bounding boxes. SiamCAR [6] describes a supplementary centerness branch alongside the classification branch and helps to deal with the problem of predicted bounding boxes that are substandard. In order to reduce the time required for the anchor point parameter search, we have adopted SiamCAR as the backbone of our approach. Generally speaking, our strategy is applicable to all tracking methods based on Siamese networks, as Siamese networks share common characteristics, such as target feature extraction and similarity measurement. Further experimental validation is needed to confirm this.

2.2. The Attention Mechanism

The study of attention mechanisms can be traced back to early research on visual attention, which focuses on important objects in visual images. As DL gains popularity, attention mechanisms have been introduced into neural networks, providing the ability for models to dynamically focus attention. Recently, attention mechanisms have gained widespread applications and research interest in computer vision [33,34,35,36,37,38]. Specifically, Squeeze-and-Excitation Networks (SENet) [39] enhance representational capacity by integrating spatial and channel features in local receptive fields to extract informative features. The Convolutional Block Attention Module (CBAM) [40] effectively enhances intermediate features by sequentially deducing attention maps over both the spatial and the channel dimensions. The Residual Attentional Siamese Network (RASNet) [41] integrates spatial and channel attention to increase the capacity to discern features.

In the field of HOT, attention mechanisms have also found widespread applications in enhancing performance. For instance, SiamBGA [13] and BAENet [11] utilize band attention blocks to discern relationships among spectral bands, segmenting HSIs into multi-channel images for tracking tasks. However, this approach can lead to a reduction in tracking speed. In SiamHYPER [42], a spatial spectral cross-attention module was designed for the fusion of spectral features. However, this module only works on one branch. To obtain cross-band information from HSI, CBFF-Net [15] introduced a cross-band group attention (CBGA) module. However, the lack of spatial attention in this approach could lead to an inability to capture essential details within specific regions. Inspired by shuffle attention networks (SANet) [43], we designed the HOT method using the pyramid shuffle attention module, with the aim of improving similarity features to address challenges related to target scale variations and interference from similar objects.

2.3. Knowledge Distillation

Knowledge distillation has been extensively used for model compression and transfer learning. Its primary objective is to impart the knowledge from a complex model (teacher model) to a simplified model (student model). Hinton et al. [44] presented a groundbreaking KD design, which uses the output of the teacher network to help train the student network. Tarvainen et al. [45] and Liu et al. [46] proposed strategies for distilling knowledge from multiple teacher models to improve the efficiency of KD. Lin et al. [47] introduced a comprehensive framework that utilizes low-rank decomposition to eliminate redundancy between fully connected layers and convolutional layers. Yim et al. [48] adopted an information measurement-based approach to improve the efficiency of knowledge transfer. Sepahv et al. [49] decomposed the characteristic of their teacher model to obtain a core tensor, allowing the student to better comprehend the knowledge.

In summary, KD has been extensively applied in the field of DL, including in image classification [50,51,52], object detection [53,54,55], and semantic segmentation [56,57,58]. However, there is limited research that investigates the implementation of KD within the realm of hyperspectral target tracking.

3. Methodology

In this section, the proposed HOT is described in detail, including the Siamese model, the PSA model, and the KD of the hyperspectral data.

3.1. The SiamCAR Tracker

Crucial to the successful implementation of DL models is the ability to perform offline learning on massive datasets, enabling them to learn complex and intricate relationships from extensively annotated data. Generally, Siamese networks consider tracking a similarity learning task, which utilizes end-to-end offline training to comprehend the similarity between target images and search regions. Typically, tracking methods based on Siamese networks consist of two stages: feature extraction and bounding-box prediction. Initially, the template Z and the search region X are fed into two separate branches sharing the same parameter of the backbone network. It is crucial to fuse visual attributes (represented by low-level features) with semantic information referring to high-level features, which helps to accurately recognize the target. Specifically, the fused feature

φ (X) \in R^{7 \times 7 \times 768}

of the search region concatenates the output of the final three residual blocks of the model (ResNet-50) [3], which is represented by

φ (X) = C (F_{3} (X), F_{4} (X), F_{5} (X))

(1)

where

C (\cdot)

denotes the concatenation function and

F_{3} (X)

,

F_{4} (X)

, and

F_{5} (X)

represent the outputs of the final three blocks in backbone network. In addition, the fused feature

φ (Z)

, which refers to the target template, is obtained similarly.

In order to incorporate information from the template and search branches, a deep cross-correlation is applied to

φ (X)

and

φ (Z)

to obtain a response feature map M with 256 channels.

M = φ (X) ★ φ (Z)

(2)

where ★ denotes channel-wise correlation operations. M retains abundant information, which also increases the difficulty of prediction at a later stage.

The next stage of the prediction of the target box focuses on the task of classification and regression with extracted features. Another branch predicts the bounding box for the target. The classification branch outputs feature maps of size

25 \times 25

with 2 channels. Each point on the classification feature map has a 2-D vector. This vector represents the scores for the target and background. The classification loss

L_{c l s}

utilizes the cross-entropy loss, the formula for which is as follows:

L_{c l s} = \frac{1}{N} \sum_{d} g_{d} log (p_{d})

(3)

where N represents the total number of points,

g_{d}

is the ground-truth label for the d-th point, and

p_{d}

represents the predictive probability for the d-th point.

The search branch outputs feature maps of size

25 \times 25

with 4 channels. As a result, each point

(i, j)

on the regression feature map has a four-dimensional vector

v (i, j) = (l, t, r, b)

, where l denotes the distance between a point within the search area and the left boundary of the real target. Similarly, t, r, and b refer to the upper boundary, the right boundary, and the lower boundary, respectively.

The center-ness branch assigns lower scores to points farther away from the target center, concentrating the predicted boxes around the target. This branch outputs a

25 \times 25 \times 1

feature map, where each value at a point

(i, j)

on the map corresponds to the centrality score of the corresponding position on the search map. The centrality score

S_{Cen} (i, j)

at the corresponding position in the predicted feature map is computed by

S_{Cen} (i, j) = Q \times \sqrt{\frac{min (l, r)}{max (l, r)} \times \frac{min (t, b)}{max (t, b)}}

(4)

where Q takes values of 0 or 1. If a point

(i, j)

is not within the manually marked bounding box of initial frame, Q is 0.

The loss function

L_{reg}

referring to the regression branch is calculated using the intersection over union (IOU) loss [59]

L_{IOU}

to regress the position of the predicted bounding box, as shown in Equation (5).

L_{reg} = \frac{1}{\sum Q} \sum_{i, j} Q \times L_{IOU} (u (x, y), v (i, j))

(5)

where

u (x, y)

represents the distance from the real target position

(x, y)

to the four sides of the ground-truth bounding box.

The loss function for the center-ness branch

L_{c e n}

is represented by

\begin{matrix} L_{cen} = \frac{- 1}{\sum Q} \sum (S_{Cen} (i, j) \times log D_{Cen} (i, j) \\ + (1 - S_{Cen} (i, j)) \times log (1 - D_{Cen} (i, j))) \end{matrix}

(6)

where

D_{Cen} (i, j)

denotes the value at the corresponding position in the center-ness branch map.

Therefore, the final loss L is obtained by

L = L_{cls} + λ_{1} L_{cen} + λ_{2} L_{reg}

(7)

where

λ_{1}

and

λ_{2}

are penalty factors.

3.2. The Pyramid Shuffle Attention Module

Significant differences between color and HSIs make it difficult for most trackers to exploit the abundant spectral information. HSIs increase the complexity of similarity features in Siamese trackers, which in turn limits the performance of prediction modules. This limitation is particularly evident when dealing with scale variations, resulting in poor tracker performance. To address these issues, we introduce a pyramid shuffle attention module for similarity matching, the specific design of which is depicted in Figure 4.

Firstly, multiscale similarity features extracted through pyramid convolution [60] are grouped together. Next, the shuffle unit [43] is used to incorporate spatial and channel attention into a block. Finally, all sub-features are combined, benefiting from the channel shuffle.

A set of convolution kernels

K_{I = 1, 2, 3, 4}

with different sizes,

3 \times 3

,

5 \times 5

,

7 \times 7

, and

9 \times 9

, is utilized to obtain fused features of multiple scales, as seen in Figure 5, which can be described as

M_{I}^{o} = F_{gc} (M, K_{I}, G_{I})

(8)

where

G_{I}

represents the number of groups in the grouped convolution (GConv)

F_{gc} (\cdot)

, which can control the connectivity in the convolution operation. Subsequently, the response feature maps

M_{I}^{o}

of each pyramid level are concatenated along the channel dimension.

M^{o} = C (M_{1}^{o}, M_{2}^{o}, M_{3}^{o}, M_{4}^{o})

(9)

where the fused feature map

M^{o} \in R^{V \times H \times W}

, V, H, and W, respectively, represent the number of channels and the spatial height and width. Then,

M^{o}

is divided into J groups along the channel dimension, that is,

M^{o} = [M_{1}^{o}, \dots, M_{J}^{o}]

. The shape of

M_{k}^{o}

becomes

(V / J) \times H \times W

. Next,

M_{k}^{o}

splits into two branches along the channel dimension, namely

M_{k 1}^{o}, M_{k 2}^{o} \in R^{(V / 2 J) \times H \times W}

. Therefore, the first branch uses channel attention, while the other branch uses spatial attention.

For the channel attention branch, we first apply global average pooling (GAP)

F_{GAP}

to

M_{k 1}^{o}

, reducing the feature map’s size from

[(V / 2 J), H, W]

to

[(V / 2 J), 1, 1]

. This is performed to integrate information from each channel and transform it into a global channel importance weight.

F_{GAP} (M_{k 1}^{o}) = \frac{1}{H \times W} \sum_{m = 1}^{H} \sum_{n = 1}^{W} M_{k 1}^{o} (m, n)

(10)

where

F_{GAP} (M_{k 1}^{o})

represents channel-wise statistical data, indicating the average importance of each channel in

M_{k 1}^{o}

.

The channel attention

M_{k 1}^{'}

is calculated by

M_{k 1}^{'} = σ (F_{c} (F_{GAP} (M_{k 1}^{o}))) M_{k 1}^{o}

(11)

where

F_{c}

denotes the convolution operation and

σ (\cdot)

is a non-linear activation function.

Channel attention is used to improve the associations between channels, while spatial attention serves to understand the importance of different spatial positions. The spatial information

M_{k 2}^{'}

is obtained based on the second branch

M_{k 2}^{o}

with group normalization [61]

F_{G N}

.

M_{k 2}^{'} = σ (F_{c} (F_{GN} (M_{k 2}^{o}))) M_{k 2}^{o}

(12)

In the end, the output of the channel and the spatial attention branches is connected, that is,

M_{k}^{'} = [M_{k 1}^{'}, M_{k 2}^{'}] \in R^{(V / J) \times H \times W}

. We aggregate all the sub-features

M_{k}^{'}

and then perform channel shuffling [62] on the aggregated features, which is represented by

M_{a}^{o} = S (C (M_{1}^{'}, M_{2}^{'}, \dots, M_{J}^{'}))

(13)

where

S (\cdot)

denotes the channel shuffling Figure 5, which enhances the information exchange between channels in the aggregated feature.

M_{a}^{o}

is the final similarity feature, which provides a more powerful representation of the features for subsequent prediction of the target boxes.

3.3. Hyperspectral Knowledge Distillation

Due to the insufficient training data in the field of hyperspectral imaging, the overfitting problem easily arises when training DL-based trackers. To address these challenges, we designed a novel hyperspectral knowledge distillation approach aimed at transferring knowledge from an RGB model to a hyperspectral model. The objective was to improve the performance and generalizability of the hyperspectral model in situations where data availability is limited. HKD is a special knowledge distillation technique in which the teacher and student models have the same network architecture. As shown in Figure 6, we provide a tracking model that is specified as a teacher model trained on a comprehensive RGB dataset.

The student model designed in our study is an HOT model. In the realm of HKD, the process of training the student model is directed by the teacher model through the utilization of activation values derived from its output layer. In this study, we consider the final output of the prediction head in the tracking model to be logits encompassing both the classification and regression branches. Therefore, the output of the teacher model is regarded as soft targets, and the student model is trained by minimizing the discrepancy between its own output and that of the teacher model.

In the classification branch, we consider the classification results of the RGB tracking model as soft labels, which are used to instruct the training of the HOT model. Furthermore, the introduction of a temperature parameter T serves to improve the classification results, allowing them to encompass a greater amount of information. Here, the soft classification loss function

L_{cls}^{soft}

can be calculated according to the Kullback–Leibler divergence [63].

\begin{matrix} L_{cls}^{soft} = T^{2} \times \sum (S M (A_{t}^{cls} / T) log (\frac{S M (A_{t}^{cls} / T)}{S M (A_{s}^{cls} / T)})) \end{matrix}

(14)

where

S M

denotes the softmax function and

A_{s}^{cls}

and

A_{t}^{cls}

are classification branching results referring to the student and teacher models.

Subsequently, the soft regression loss is obtained by

L_{reg}^{soft} = \frac{1}{\sum Q} \sum_{i, j} Q \times L_{IOU} (u (x, y), v^{soft} (i, j))

(15)

where

v^{soft} (i, j)

represents the distances between the feature map position

(i, j)

and the four edges of the soft target.

Similarly, the calculation of centrality loss is represented by

\begin{matrix} L_{cen}^{soft} = \frac{- 1}{\sum Q} \sum (S_{Cen}^{soft} (i, j) \times log D_{Cen} (i, j) \\ + (1 - S_{Cen}^{soft} (i, j)) \times log (1 - D_{Cen} (i, j))) \end{matrix}

(16)

where

S_{Cen}^{soft} (i, j)

represents the centrality score calculated using the soft target.

Therefore, the total loss calculation is formulated as

L_{total} = λ_{3} L + λ_{4} (L_{cls}^{soft} + L_{cen}^{soft} + 3 L_{reg}^{soft})

(17)

where the hard loss L is calculated by Equation (7). Furthermore,

λ_{3}

and

λ_{4}

denote penalty factors.

To prove the performance of HKD, we visualize the response maps under three different strategies in the toy1, kangaroo, and coke videos in Figure 7. In the response map of the baseline algorithm, similar objects also exhibit high response values, which may lead to deviations in the tracker. When training the tracker using the TL and HKD methods, a significant reduction in response to distracting objects was observed, especially with the HKD training method, where the effect was more pronounced.

4. Experiments

4.1. Experimental Data

The GOT10K [19] dataset was taken as the training set for the teacher model in this article. The dataset consists of over 1000 unique color video sequences, totaling more than 180,000 frames. For the training of the HOT model, the HOT2022 [5] dataset was used. This was provided by the Hyperspectral Target Tracking 2022 (HOT2022) Challenge, located at hsitracking.com, (accessed on 28 November 2023). The dataset consists of 40 training sequences and 35 test sequences, with a mean size of 500 frames per sequence. It encompasses a diverse range of objects, scenes, and activities. Additionally, each video is annotated with 11 challenging attributes [20], covering factors such as occlusion (OCC), out-of-view (OV), background clutter (BC), deformation (DEF), motion blur (MB), fast motion (FM), low resolution (LR), out-of-plane rotation (OPR), illumination variation (IV), in-plane rotation (IPR), and scale variation (SV). These datasets provide us with rich, diverse, and challenging benchmarks that contribute to the advancement of HOT technology.

4.2. Experimental Setup

The proposed method is executed using PyTorch and runs on a GTX 3080 GPU. We set the input size of the template and search area to 127 pixels and 255 pixels, respectively. We used ResNet-50, as proposed in [18], which served as our Siamese subnetwork. This network was pre-trained on ImageNet [64] and employed its parameters for retraining within our model. In the teacher model, the batch size was set as 80, the epoch number was 20, and the initial learning rate (LR) was 0.001. During the initial 10 epochs, the training process involved keeping the parameters of the Siamese subnetwork fixed while focusing on training the classification and regression subnetworks. However, in the concluding 10 epochs, the concurrent training approach was adopted, which involved unfreezing the last three blocks of ResNet-50. The teacher model was trained with the GOT-10K dataset, while the HOT2022 hyperspectral dataset was utilized to train our tracker. Furthermore, the LR was then changed to 0.0005. T in Equation (14) was set to 2. In the loss function,

λ_{1} = 1

,

λ_{2} = 3

,

λ_{3} = 0.9

, and

λ_{4} = 0.1

. The testing details were identical to those in [6], employing an offline tracking strategy. Only the objects in the initial frame of the video were considered templates.

4.3. Experimental Results and Analysis

4.3.1. Qualitative Comparison

We conducted a qualitative comparison between SOTA trackers, including DaSiamRPN [24], SiamRPN++ [23], DeepHKCF [7], BS-SiamRPN [65], ADSiamRPN [12], MHT [5], MFIHVT [9], and BAENet [11]. The first two are RGB trackers, while the rest are hyperspectral trackers. BAENet is a method that divides HSIs into multiple three-channel images by learning the nonlinear relationship between different spectral bands and generating band weights. It employs ensemble learning to fuse weak trackers, enabling end-to-end training and achieving excellent performance. MFIHVT [9] uses a pre-trained VGG-19 network and an oriented gradient histogram to extract convolutional features, which are merged to produce the ultimate feature representation. MFIHVT employs a kernelized correlation filter framework for target detection in hyperspectral videos and utilizes adaptive weight coefficients to fuse response maps generated from different features. Additionally, MFIHVT incorporates scale estimation and dimensionality reduction strategies to improve tracking robustness and efficiency. MHT [5] integrates spatial spectral information into multidimensional gradient histograms and global material features to construct hand-made features, which are combined with the BACF [66] framework to fuse material-based features and locate the target accurately. MHT effectively encodes the spatial spectral structure and material composition of HSIs using material information, thereby improving tracking performance. ADSiamRPN [12] uses a genetic optimization algorithm to select informative spectral bands. Additionally, it utilizes the SiamRPN model to extract hyperspectral features for target localization and classification. In addition, HSUpdateNet, a designed network, is introduced to obtain accumulated templates and address target deformation issues. BS-SiamRPN [65] utilizes intelligent optimization algorithms to determine key spectral bands. It employs TL to acquire semantic information and utilizes specific spectral band information as a network input for target matching. DeepHKCF [7] is an HOT based on deep kernel-based correlation filtering. It involves the transformation of HSIs into false-color images, followed by the extraction of deep features with a VGG-19 network. SiamRPN++ [23] is a Siamese network-based tracking algorithm that uses deep models to learn similarity maps. Deep network architectures and feature pyramid networks can enhance detection performance. The primary contribution of SiamRPN++ lies in its effective reduction in the impact of negative samples through the hard negative mining training strategy, improving the tracker’s performance. The DaSiamRPN method [24] improves the discriminative capacity of the Siamese network, introduces an interference-aware module to suppress the influence of distractors, and utilizes a local-to-global search strategy to achieve offline training and online tracking objectives.

Figure 8 presents the comparative result of these trackers in six challenging video sequences, namely car2, forest2, paper, pedestrian2, rider1, and student. The resolution and frame count of the six sequences are shown in Table 1. In the car2 video, a white car approaches from a distance, gradually increasing in size. Simultaneously, there are two similar cars parked on the roadside. Both SiamRPN++ and DeepHKCF fail to achieve effective discrimination and separation between targets and distractors. This finding demonstrates that hyperspectral features can help to effectively distinguish color-similar distractors. The DeepHKCF model is based on a single depth feature, resulting in the insufficient use of hyperspectral features. In frame #120, it is evident that the aspect ratio of the white car changes, causing all algorithms except our proposed one to fail in accurately enclosing the target. Our algorithm’s success is attributed to the PSA module, which fuses multi-scale features into the similarity map. This enables the tracker to more accurately capture target changes.

Within the forest2 video, the target subject wearing a green jacket navigates through a wooded area. For target occlusion, the absence of the utilization of spectral information in DaSiamRPN and SiamRPN++ results in the loss of target tracking. After frame

# 313

, both BAENet and MFIHVT fail to maintain target tracking and are unable to reestablish tracking in their subsequent course of action.

In the paper video, a blank sheet of paper continuously flips and moves against a similar background. When the target undergoes frequent changes, most of the trackers do not respond adequately to these changes. However, our anchor-free tracker is not limited by specific anchor boxes. Therefore, it exhibits good robustness to target deformations, rotations, and other appearance variations, allowing accurate target tracking in complex scenarios.

In the pedestrian2 video, the pedestrian in front of the tree transitions from bright to dark. In the subsequent target motion, there are occurrences of tree occlusion, causing the target to temporarily disappear from the screen, which presents significant challenges for the trackers. During this tracking process, BAENet struggles to avoid interference from similar objects. When the target experiences occlusion, the HOTs (MFIHVT and MHT) also exhibit tracking migration, making it difficult to reestablish tracking in the following frames. Our method can track the target again even after the target is blocked. This is because the training method of HKD uses color data to learn general computer vision knowledge, such as context information and semantic information.

Within the rider1 video, the target encounters challenges of illumination variations, scale changes, and low resolution. The DeepHKCF tracker struggles to deal effectively with scale variation because it is reliant upon a sole depth feature. BAENet lost the target after 292 frames due to interference in the redundant bands, which caused the tracker to not be able to discriminate similar objects well enough. Due to the enhanced similarity features of the PSA module, our tracker can distinguish targets even in complex environments.

In the student video, a student target continues to move; its shape changes and gradually fades into darkness. All trackers were able to track the target, but after 188 frames, none of the trackers could fit the target except our tracker and BS-SiamRPN. In summary, the proposed tracker shows robustness in handling these challenges and achieves good tracking performance in this video.

4.3.2. Quantitative Comparison

In this section, we used four comprehensive evaluation metrics, namely precision plots, success plots, average distance precision at a threshold of 20 pixels (DP@20P), and AUC. The success plot illustrates the percentage curve of overlap scores (OS) greater than a designated threshold, where the OS represents the IOU ratio between the predicted bounding box and the ground-truth bounding box. The precision plot is a percentage curve of video frames in which the distance between the predicted bounding box and the ground truth bounding box is less than a given threshold. Furthermore, DP@20P denotes the value of the precision plot when the threshold is set to 20. The AUC computes the area beneath the success plot curve, reflecting the tracker’s average performance across various overlap ratios. All results are obtained by one-pass evaluation (OPE), which initializes the tracker using the target ground truth from the first frame of the tracking sequence.

The success and precision plots of these trackers are shown in Figure 9. In Table 2, it can be observed that compared to the BAENet method, SiamPKHT achieves a

2.4 %

increase in the success rate and a

1.4 %

improvement in precision. Although the precision of MFIHVT is slightly higher than that of our method, its success rate is lower, with a value of 0.03. Moreover, this method suffers from a slow tracking speed and does not meet real-time requirements. To gain deeper insights into the performance of SiamPKHT, we evaluated 9 trackers among the 11 attributes of the sequences. Success plots of these trackers for the 11 distinct attributes are shown in Figure 10. In Table 3 and Table 4, it is observed that SiamPKHT performs the best in several attributes (DEF, IV, LR, OCC, OV, and SV) and second best in FM and MB when referring to both the success rate and the precision. This is due to the guidance of a large color dataset during the HKD process, which helps the model learn common features and capabilities. In Figure 11a, it can also be seen that SiamPKHT exhibits excellent performance in all challenges, particularly for LR and SV, which significantly outperforms other SOTA tracking algorithms. This is attributed to the pyramid shuffle attention module. This module enhances differences in similarity and captures changes in target appearance by fusing information from different scales.

4.3.3. The Ablation Study

To assess the effectiveness of each part of the SiamPKHT tracker, an ablation study was performed on the HOT2022 benchmark. The five different testing models were as follows.

Baseline: The SiamCAR model trained solely on the GOT10K dataset was used as the baseline model.
Baseline-PSA: The baseline exclusively utilizing the PSA module.
Baseline-PSA-TL: The baseline with both the PSA module and transfer learning.
Baseline-PSA-SelfKD: The baseline with both the PSA module and self-knowledge distillation.
Baseline-PSA-HKD: The baseline with both the PSA module and hyspectral knowledge distillation.

In the baseline model, the tracker was first trained on the GOT10K dataset and then tested using false-color sequences from HOT2022. In the Baseline-PSA model, only the PSA module was utilized without employing TL with hyperspectral data. The Baseline-PSA-TL model fine-tuned the Baseline-PSA model using hyperspectral data and served as the student model in the Baseline-PSA-HKD model. The Baseline-PSA-SelfKD model incorporated hyperspectral data and applied self-knowledge distillation [67] for TL from the Baseline-PSA model. Finally, the Baseline-PSA-HKD model combined all the proposed tracker components.

The evaluation curves and the effectiveness of each component referring to the proposed method are depicted in Figure 12. Compared to the first two models, models 3, 4, and 5 exhibit better AUC performance because these models take advantage of the hyperspectral data. In Table 5, with PSA, our method obtains a better result (+0.018) with respect to AUC scores. Furthermore, it can be seen that the integration of PSA and HKD obtains a notable increase (

0.032

) in the AUC score, improving overall performance. We attribute these benefits to the inclusion of more spectral information in HSIs compared to color images. The additional spectral information provides a richer feature representation, which helps the tracker better understand the nature of the observed objects. Due to the complexity of HSVs, traditional prediction and decoding methods may struggle to effectively extract accurate information from targets. The proposed PSA module explores feature information at different depths along the channel dimension, allowing the tracker to more fully utilize the abundant information in the HSVs during the subsequent prediction stage. In addition, the HKD module overcomes the limitation of limited hyperspectral data, which mitigates the risk of overfitting in transfer learning and effectively obtains valuable information from the hyperspectral data.

5. Conclusions and Future Work

In this article, we introduced the SiamPKHT method, which integrates pyramid shuffle attention (PSA) and knowledge distillation (HKD) into the SiamCAR framework for enhanced object tracking. The PSA module significantly improves the model’s perception capabilities, while the HKD addresses overfitting issues due to limited hyperspectral data, thereby improving the tracker’s generalizability. Our experimental results on HOT2022 demonstrate that SiamPKHT outperforms baseline trackers and shows substantial competitiveness against state-of-the-art (SOTA) HOT methods, achieving a remarkable tracking speed of 43 FPS, which is suitable for real-time applications.

However, the proposed SiamPKHT, despite its promising performance, encounters challenges in scenarios involving occlusion and low-resolution environments. These limitations highlight the need for further research in these areas. Future work will also focus on refining feature extraction techniques for hyperspectral videos (HSVs), aiming to mitigate the limitations posed by insufficient visible data. This exploration is crucial for advancing the field and ensuring that our method remains relevant and effective in a broader range of tracking scenarios.

Author Contributions

Conceptualization, K.Q.; data curation, S.W. and S.Z.; investigation, S.Z.; methodology, K.Q. and S.W.; resources, S.W.; software, K.Q., S.W. and J.S.; supervision, K.Q. and J.S.; visualization, S.Z.; writing—original draft, K.Q.; writing—review and editing, K.Q. and J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by the National Natural Science Foundation of China (62076110) and the Natural Science Foundation of Jiangsu Province (BK20181341).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; IEEE: New York, NY, USA, 2010; pp. 2544–2550. [Google Scholar] [CrossRef]
Dai, K.; Wang, D.; Lu, H.; Sun, C.; Li, J. Visual tracking via adaptive spatially-regularized correlation filters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4670–4679. [Google Scholar] [CrossRef]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar] [CrossRef]
Yang, K.; He, Z.; Zhou, Z.; Fan, N. SiamAtt: Siamese attention network for visual tracking. Knowl.-Based Syst. 2020, 203, 106079. [Google Scholar] [CrossRef]
Xiong, F.; Zhou, J.; Qian, Y. Material based object tracking in hyperspectral videos. IEEE Trans. Image Process. 2020, 29, 3719–3733. [Google Scholar] [CrossRef]
Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6269–6277. [Google Scholar] [CrossRef]
Uzkent, B.; Rangnekar, A.; Hoffman, M.J. Tracking in aerial hyperspectral videos using deep kernelized correlation filters. IEEE Trans. Geosci. Remote Sens. 2018, 57, 449–461. [Google Scholar] [CrossRef]
Qian, K.; Zhou, J.; Xiong, F.; Zhou, H.; Du, J. Object tracking in hyperspectral videos with convolutional features and kernelized correlation filter. In Proceedings of the Smart Multimedia: First International Conference, ICSM 2018, Toulon, France, 24–26 August 2018; Revised Selected Papers 1. Springer: Cham, Switzerland, 2018; pp. 308–319. [Google Scholar]
Zhang, Z.; Qian, K.; Du, J.; Zhou, H. Multi-features integration based hyperspectral videos tracker. In Proceedings of the 2021 11th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), Amsterdam, The Netherlands, 24–26 March 2021; IEEE: New York, NY, USA, 2021; pp. 1–5. [Google Scholar] [CrossRef]
Zhao, D.; Zhu, X.; Zhang, Z.; Arun, P.V.; Cao, J.; Wang, Q.; Zhou, H.; Jiang, H.; Hu, J.; Qian, K. Hyperspectral video target tracking based on pixel-wise spectral matching reduction and deep spectral cascading texture features. Signal Process. 2023, 209, 109033. [Google Scholar] [CrossRef]
Li, Z.; Xiong, F.; Zhou, J.; Wang, J.; Lu, J.; Qian, Y. BAE-Net: A band attention aware ensemble network for hyperspectral object tracking. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; IEEE: New York, NY, USA, 2020; pp. 2106–2110. [Google Scholar] [CrossRef]
Wang, S.; Qian, K.; Shen, J.; Ma, H.; Chen, P. AD-SiamRPN: Anti-Deformation Object Tracking via an Improved Siamese Region Proposal Network on Hyperspectral Videos. Remote Sens. 2023, 15, 1731. [Google Scholar] [CrossRef]
Li, W.; Hou, Z.; Zhou, J.; Tao, R. SiamBAG: Band Attention Grouping-based Siamese Object Tracking Network for Hyperspectral Videos. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5514712. [Google Scholar] [CrossRef]
Li, Z.; Xiong, F.; Zhou, J.; Lu, J.; Qian, Y. Learning a Deep Ensemble Network with Band Importance for Hyperspectral Object Tracking. IEEE Trans. Image Process. 2023, 32, 2901–2914. [Google Scholar] [CrossRef]
Gao, L.; Liu, P.; Jiang, Y.; Xie, W.; Lei, J.; Li, Y.; Du, Q. CBFF-Net: A New Framework for Efficient and Accurate Hyperspectral Object Tracking. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5506114. [Google Scholar] [CrossRef]
Lei, J.; Liu, P.; Xie, W.; Gao, L.; Li, Y.; Du, Q. Spatial–spectral cross-correlation embedded dual-transfer network for object tracking using hyperspectral videos. Remote Sens. 2022, 14, 3512. [Google Scholar] [CrossRef]
Qian, K.; Chen, P.; Zhao, D. GOMT: Multispectral video tracking based on genetic optimization and multi-features integration. IET Image Process. 2023, 17, 1578–1589. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2411–2418. [Google Scholar] [CrossRef]
Tao, R.; Gavves, E.; Smeulders, A.W. Siamese instance search for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1420–1429. [Google Scholar] [CrossRef]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the Computer Vision—ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10, 15–16 October 2016; Proceedings, Part II 14. Springer: Cham, Switzerland, 2016; pp. 850–865. [Google Scholar]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4282–4291. [Google Scholar] [CrossRef]
Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
Zhang, Z.; Liu, Y.; Wang, X.; Li, B.; Hu, W. Learn to match: Automatic matching network design for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13339–13348. [Google Scholar] [CrossRef]
Zhang, L.; Gonzalez-Garcia, A.; Weijer, J.v.d.; Danelljan, M.; Khan, F.S. Learning the model update for siamese trackers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4010–4019. [Google Scholar] [CrossRef]
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar] [CrossRef]
Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4660–4669. [Google Scholar] [CrossRef]
Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar] [CrossRef]
Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXI 16. Springer: Cham, Switzerland, 2020; pp. 771–787. [Google Scholar]
Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6668–6677. [Google Scholar] [CrossRef]
Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12549–12556. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Yuan, Y.; Huang, L.; Guo, J.; Zhang, C.; Chen, X.; Wang, J. Ocnet: Object context network for scene parsing. arXiv 2018, arXiv:1809.00916. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar] [CrossRef]
Yang, J.; Ren, P.; Zhang, D.; Chen, D.; Wen, F.; Li, H.; Hua, G. Neural aggregation network for video face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4362–4371. [Google Scholar] [CrossRef]
Dai, T.; Cai, J.; Zhang, Y.; Xia, S.T.; Zhang, L. Second-order attention network for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11065–11074. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Wang, Q.; Teng, Z.; Xing, J.; Gao, J.; Hu, W.; Maybank, S. Learning attentions: Residual attentional siamese network for high performance online visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4854–4863. [Google Scholar] [CrossRef]
Liu, Z.; Wang, X.; Zhong, Y.; Shu, M.; Sun, C. SiamHYPER: Learning a hyperspectral object tracker from an RGB-based tracker. IEEE Trans. Image Process. 2022, 31, 7116–7129. [Google Scholar] [CrossRef] [PubMed]
Zhang, Q.L.; Yang, Y.B. Sa-net: Shuffle attention for deep convolutional neural networks. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 2235–2239. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of the Advances in Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Liu, I.J.; Peng, J.; Schwing, A.G. Knowledge flow: Improve upon your teachers. arXiv 2019, arXiv:1904.05878. [Google Scholar]
Lin, S.; Ji, R.; Chen, C.; Tao, D.; Luo, J. Holistic cnn compression via low-rank decomposition with knowledge transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2889–2905. [Google Scholar] [CrossRef]
Yim, J.; Joo, D.; Bae, J.; Kim, J. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4133–4141. [Google Scholar] [CrossRef]
Sepahvand, M.; Abdali-Mohammadi, F.; Taherkordi, A. An adaptive teacher—student learning algorithm with decomposed knowledge distillation for on-edge intelligence. Eng. Appl. Artif. Intell. 2023, 117, 105560. [Google Scholar] [CrossRef]
Gao, W.; Xu, C.; Li, G.; Zhang, Y.; Bai, N.; Li, M. Cervical Cell Image Classification-Based Knowledge Distillation. Biomimetics 2022, 7, 195. [Google Scholar] [CrossRef]
Zhao, H.; Sun, X.; Gao, F.; Dong, J. Pair-Wise Similarity Knowledge Distillation for RSI Scene Classification. Remote Sens. 2022, 14, 2483. [Google Scholar] [CrossRef]
Kim, S. A virtual knowledge distillation via conditional GAN. IEEE Access 2022, 10, 34766–34778. [Google Scholar] [CrossRef]
Liu, T.; Lam, K.M.; Zhao, R.; Qiu, G. Deep cross-modal representation learning and distillation for illumination-invariant pedestrian detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 315–329. [Google Scholar] [CrossRef]
Chai, Y.; Fu, K.; Sun, X.; Diao, W.; Yan, Z.; Feng, Y.; Wang, L. Compact cloud detection with bidirectional self-attention knowledge distillation. Remote Sens. 2020, 12, 2770. [Google Scholar] [CrossRef]
Wang, W.; Hong, W.; Wang, F.; Yu, J. Gan-knowledge distillation for one-stage object detection. IEEE Access 2020, 8, 60719–60727. [Google Scholar] [CrossRef]
Park, S.; Heo, Y.S. Knowledge distillation for semantic segmentation using channel and spatial correlations and adaptive cross entropy. Sensors 2020, 20, 4616. [Google Scholar] [CrossRef]
Qin, D.; Bu, J.J.; Liu, Z.; Shen, X.; Zhou, S.; Gu, J.J.; Wang, Z.H.; Wu, L.; Dai, H.F. Efficient medical image segmentation based on knowledge distillation. IEEE Trans. Med. Imaging 2021, 40, 3820–3831. [Google Scholar] [CrossRef] [PubMed]
An, S.; Liao, Q.; Lu, Z.; Xue, J.H. Efficient semantic segmentation via self-attention and self-distillation. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15256–15266. [Google Scholar] [CrossRef]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 516–520. [Google Scholar] [CrossRef]
Duta, I.C.; Liu, L.; Zhu, F.; Shao, L. Pyramidal convolution: Rethinking convolutional neural networks for visual recognition. arXiv 2020, arXiv:2006.11538. [Google Scholar]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar] [CrossRef]
Hershey, J.R.; Olsen, P.A. Approximating the Kullback Leibler divergence between Gaussian mixture models. In Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, Honolulu, HI, USA, 15–20 April 2007; IEEE: New York, NY, USA, 2007; Volume 4, p. IV-317. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef]
Wang, S.; Qian, K.; Chen, P. BS-SiamRPN: Hyperspectral video tracking based on band selection and the Siamese region proposal network. In Proceedings of the 2022 12th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), Rome, Italy, 13–16 September 2022; IEEE: New York, NY, USA, 2022; pp. 1–8. [Google Scholar] [CrossRef]
Kiani Galoogahi, H.; Fagg, A.; Lucey, S. Learning background-aware correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1135–1143. [Google Scholar] [CrossRef]
Kim, K.; Ji, B.; Yoon, D.; Hwang, S. Self-knowledge distillation: A simple way for better generalization. arXiv 2020, arXiv:2006.12000. [Google Scholar]

Figure 1. Our tracker compared to the baseline algorithm [6]. Our algorithm effectively addresses challenges, such as deformation, severe occlusion, and fast motion.

Figure 2. The framework of the proposed SiamPKHT algorithm. It consists of four components: feature extraction, a PSA module for similarity enhancement, prediction heads for target box prediction, and hyperspectral knowledge distillation (HKD).

Figure 4. The PSA module. M represents the initial similarity features. Multi-scale features are obtained by pyramid convolution. These are further divided into multiple sub-features along the channel dimension. The dependencies between pixels and channels are captured through the parallel use of spatial and channel attentions.

Figure 5. (a) Pyramid convolution. BN stands for Batch Normalization, and ReLU stands for Rectified Linear Unit. (b) Channel shuffle. The operation handles intergroup information exchange by dividing each group into several smaller blocks and recombining those blocks between the different groups.

Figure 6. Hyperspectral knowledge distillation.

Figure 7. Visualization of response maps in three videos (from top to bottom, toy1, kangaroo, and coke). The red box indicates the ground truth. Specifically, ‘None’, ‘TL’, and ‘HKD’ represent the baseline tracker, the baseline tracker with TL, and the baseline tracker with HKD, respectively.

Figure 8. The qualitative comparison of our tracker and other trackers in several videos (from top to bottom, car2, forest2, paper, pedestrain2, rider1, and student).

Figure 9. Success rate and precision rate referring to overall videos.

Figure 10. The success plot for the HOT2022 dataset for eleven challenges.

Figure 11. The success rate and precision of eleven challenges for the overall videos.

Figure 12. The ablation study with all test HSVS.

Table 1. The details of the experimental videos.

Video	Car2	Forest2	Paper	Pedestrian2	Rider1	Student
Frame	131	363	278	363	336	396
Resolution	351 × 167	512 × 256	446 × 224	512 × 256	512 × 256	438 × 256

Table 2. Performance comparison with other SOTA trackers (Red and brown fonts denote the best and sub-optimal results, respectively).

Algorithm	Video Type	AUC	DP@20P	FPS
DaSiamRPN	false-color	0.558	0.831	48
SiamRPN++	false-color	0.529	0.834	41
DeepHKCF	HSV	0.385	0.737	2
BS-SiamRPN	HSV	0.533	0.845	55
ADSiamRPN	HSV	0.575	0.861	35
MHT	HSV	0.584	0.876	2
MFIHVT	HSV	0.601	0.891	2
BAENet	HSV	0.616	0.876	≈1
Ours	HSV	0.631	0.888	43

Table 3. Attribute-based comparisons of the success rate (Red and brown fonts denote the best and sub-optimal results, respectively).

Attributes	Ours	BAENet	MFIHVT	MHT	ADSiamRPN	BSSiamRPN	DeepHKCF	SiamRPN++	DaSiamRPN
BC	0.624	0.631	0.627	0.594	0.625	0.495	0.422	0.555	0.623
DEF	0.699	0.679	0.639	0.664	0.689	0.613	0.542	0.648	0.667
FM	0.590	0.607	0.589	0.541	0.528	0.576	0.377	0.539	0.532
IPR	0.676	0.699	0.692	0.670	0.632	0.565	0.549	0.569	0.632
IV	0.603	0.440	0.472	0.474	0.473	0.551	0.154	0.473	0.360
LR	0.579	0.491	0.513	0.478	0.535	0.558	0.187	0.531	0.449
MB	0.584	0.594	0.570	0.560	0.562	0.554	0.411	0.553	0.564
OCC	0.565	0.555	0.546	0.565	0.528	0.467	0.312	0.493	0.497
OPR	0.654	0.693	0.675	0.631	0.653	0.570	0.503	0.588	0.650
OV	0.641	0.516	0.606	0.620	0.356	0.554	0.312	0.505	0.353
SV	0.624	0.608	0.596	0.564	0.560	0.557	0.353	0.544	0.531

Table 4. Attribute-based comparisons of the precision (Red and brown fonts denote the best and sub-optimal results, respectively).

Attributes	Ours	BAENet	MFIHVT	MHT	ADSiamRPN	BSSiamRPN	DeepHKCF	SiamRPN++	DaSiamRPN
BC	0.886	0.908	0.918	0.942	0.901	0.795	0.755	0.85	0.899
DEF	0.98	0.94	0.885	0.901	0.975	0.908	0.82	0.963	0.937
FM	0.877	0.871	0.832	0.841	0.859	0.872	0.874	0.807	0.859
IPR	0.935	0.985	0.951	0.964	0.918	0.873	0.846	0.868	0.914
IV	0.887	0.745	0.939	0.85	0.791	0.876	0.56	0.881	0.621
LR	0.855	0.733	0.872	0.801	0.853	0.912	0.563	0.846	0.699
MB	0.845	0.881	0.841	0.844	0.855	0.96	0.851	0.869	0.867
OCC	0.818	0.79	0.811	0.812	0.81	0.756	0.601	0.801	0.754
OPR	0.922	0.978	0.942	0.958	0.906	0.875	0.809	0.883	0.898
OV	0.855	0.864	0.895	0.851	0.488	0.859	0.918	0.846	0.488
SV	0.909	0.907	0.905	0.895	0.856	0.864	0.748	0.873	0.811

Table 5. The ablation study with all test HSVs (Red fonts denote the best results).

Algorithm	Video Type	AUC	DP@20P
Baseline	false-color	0.599	0.859
Baseline-PSA	HSV	0.617	0.881
Baseline-PSA-TL	HSV	0.619	0.873
Baseline-PSA-SelfKD	HSV	0.622	0.884
Baseline-PSA-HKD	HSV	0.631	0.888

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qian, K.; Wang, S.; Zhang, S.; Shen, J. SiamPKHT: Hyperspectral Siamese Tracking Based on Pyramid Shuffle Attention and Knowledge Distillation. Sensors 2023, 23, 9554. https://doi.org/10.3390/s23239554

AMA Style

Qian K, Wang S, Zhang S, Shen J. SiamPKHT: Hyperspectral Siamese Tracking Based on Pyramid Shuffle Attention and Knowledge Distillation. Sensors. 2023; 23(23):9554. https://doi.org/10.3390/s23239554

Chicago/Turabian Style

Qian, Kun, Shiqing Wang, Shoujin Zhang, and Jianlu Shen. 2023. "SiamPKHT: Hyperspectral Siamese Tracking Based on Pyramid Shuffle Attention and Knowledge Distillation" Sensors 23, no. 23: 9554. https://doi.org/10.3390/s23239554

APA Style

Qian, K., Wang, S., Zhang, S., & Shen, J. (2023). SiamPKHT: Hyperspectral Siamese Tracking Based on Pyramid Shuffle Attention and Knowledge Distillation. Sensors, 23(23), 9554. https://doi.org/10.3390/s23239554

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SiamPKHT: Hyperspectral Siamese Tracking Based on Pyramid Shuffle Attention and Knowledge Distillation

Abstract

1. Introduction

2. Related Work

2.1. Siamese Trackers

2.2. The Attention Mechanism

2.3. Knowledge Distillation

3. Methodology

3.1. The SiamCAR Tracker

3.2. The Pyramid Shuffle Attention Module

3.3. Hyperspectral Knowledge Distillation

4. Experiments

4.1. Experimental Data

4.2. Experimental Setup

4.3. Experimental Results and Analysis

4.3.1. Qualitative Comparison

4.3.2. Quantitative Comparison

4.3.3. The Ablation Study

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI