SPTrack: Spectral Similarity Prompt Learning for Hyperspectral Object Tracking

Guo, Gaowei; Li, Zhaoxu; An, Wei; Wang, Yingqian; He, Xu; Luo, Yihang; Ling, Qiang; Li, Miao; Lin, Zaiping

doi:10.3390/rs16162975

Open AccessArticle

SPTrack: Spectral Similarity Prompt Learning for Hyperspectral Object Tracking

by

Gaowei Guo

,

Zhaoxu Li

,

Wei An

,

Yingqian Wang

,

Xu He

,

Yihang Luo

,

Qiang Ling

^*,

Miao Li

and

Zaiping Lin

College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(16), 2975; https://doi.org/10.3390/rs16162975

Submission received: 28 June 2024 / Revised: 29 July 2024 / Accepted: 8 August 2024 / Published: 14 August 2024

(This article belongs to the Special Issue Advances in Hyperspectral Data Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Compared to hyperspectral trackers that adopt the “pre-training then fine-tuning” training paradigm, those using the “pre-training then prompt-tuning” training paradigm can inherit the expressive capabilities of the pre-trained model with fewer training parameters. Existing hyperspectral trackers utilizing prompt learning lack an adequate prompt template design, thus failing to bridge the domain gap between hyperspectral data and pre-trained models. Consequently, their tracking performance suffers. Additionally, these networks have a poor generalization ability and require re-training for the different spectral bands of hyperspectral data, leading to the inefficient use of computational resources. In order to address the aforementioned problems, we propose a spectral similarity prompt learning approach for hyperspectral object tracking (SPTrack). First, we introduce a spectral matching map based on spectral similarity, which converts 3D hyperspectral data with different spectral bands into single-channel hotmaps, thus enabling cross-spectral domain generalization. Then, we design a channel and position attention-based feature complementary prompter to learn blended prompts from spectral matching maps and three-channel images. Extensive experiments are conducted on the HOT2023 and IMEC25 data sets, and SPTrack is found to achieve state-of-the-art performance with minimal computational effort. Additionally, we verify the cross-spectral domain generalization ability of SPTrack on the HOT2023 data set, which includes data from three spectral bands.

Keywords:

hyperspectral object tracking; prompt learning; prompt template; hyperspectral video; vision transformer

1. Introduction

Visual object tracking (VOT), a crucial technology in intelligent transportation [1], human–computer interaction [2], security surveillance [3], and autonomous driving [4] contexts, involves predicting the states of objects in videos using prior knowledge. Benefiting from large-scale benchmark data sets provided by the community [5,6,7], many outstanding trackers [8,9,10] have emerged over the past few decades. Despite achieving promising results, trackers still fail in particular complex and edge cases, such as partial occlusion, cluttered backgrounds, and motion blur [11,12,13]. However, the spectral differences between the objects and background can be used to distinguish the objects in these cases [14].

Hyperspectral images (HSIs), with a 3D structure, offer a unique capability to capture spatial and spectral information simultaneously, with each pixel being representative of the spectral reflectance at a distinct wavelength [15,16]. Thanks to appropriate data processing, approaches, and techniques, HSIs have found extensive applications across diverse fields, including anomaly detection [17,18,19], unmixing [20,21,22], classification [23,24,25], and tracking [26,27]. As shown in Figure 1, there are spectral differences between objects in a scene, and the spectral information provided by HSIs enhances the capability to discriminate between objects [28], indicating the potential of HSIs to address challenges in the VOT field. Therefore, hyperspectral data provide more opportunities for VOT and have high research value.

Hyperspectral trackers can be divided into correlation filtering (CF)-based and deep learning (DL)-based trackers. CF-based trackers primarily involve hand-crafted feature extraction and object location. However, hand-crafted characteristics often fall short in accurately representing the intrinsic semantic features of objects. To obtain a more robust object appearance model, researchers have introduced DL-based methods into hyperspectral trackers [29,30,31,32]. Additionally, due to the difficulties related to collecting hyperspectral videos (HSVs) and the high cost of annotation, there is a scarcity of training data sets for hyperspectral object tracking (HOT). This limitation makes it difficult to train HOT models that are both accurate and generalized. To address this issue, a natural idea is to use large-scale data sets to pre-train the network. Therefore, refs. [27,30,31] have reduced high-dimensional HSIs to three-channel images, which allows for the utilization of feature extractors designed for color videos. However, these methods all lead to different degrees of loss of valuable spectral information. State-of-the-art (SOTA) DL-based hyperspectral trackers use the “pre-training then fine-tuning” training paradigm. While this paradigm is simple, powerful, and widely used, it may lead to overfitting on small data sets [33]. The dimensionality of HSVs differs according to the data dimensionality and sensitivity to light of the sensors used, resulting in a domain gap between hyperspectral and pre-trained data for pre-trained models. Simple full fine-tuning struggles to bridge the domain gap, and can even lead to catastrophic forgetting [34]. Additionally, full fine-tuning requires adjustments to a large number of training parameters, consuming substantial computational resources. To solve this problem, the “pre-training then prompt-tuning” training paradigm has been introduced in the context of HOT [35,36,37]. These methods utilize HSV data as prompt templates; however, the data format of HSV differs from that of the pre-trained data, which will impact the tracking performance.

Based on the above analyses, DL-based trackers mainly face challenges related to a need for large volumes of data [26] and domain gaps [38]. The prompt learning paradigm effectively alleviates the issue of limited training data through freezing the base model parameters and maximizing the inheritance of the model’s representational capabilities. However, trackers that adopt the prompt learning paradigm face difficulties in bridging the domain gap in terms of band numbers, primarily due to the lack of specifically designed prompt templates. This also results in trackers being unable to adapt to different types of data across various bands, leading to weak network generalization. For instance, the methods considered in [35,37] can only handle hyperspectral data with a certain type of band, and cannot adapt to spectral data from different bands with a single network model. To adapt to different bands of data, new models must be re-trained, consuming a significant amount of computational resources. To solve the above problems, we propose a spectral similarity prompt learning method for HOT (SPTrack). This method can achieve good tracking performance while maintaining low computational costs. In our method, we design a spectral matching map as the prompt template, ingeniously transforming spectral information into a pixel-level spectral similarity map with spatial characteristics. Spectral matching maps can help the network to better learn the spectral information of objects and, through converting hyperspectral data with different bands into a unified data format, enable tracking across spectral domains. Therefore, we design an attention-based feature complementary prompter (FCP) to generate complementary features for spectral matching maps and three-channel images, and use these complementary features as prompt information. The three-channel image and spectral matching map are more similar in data format to the pre-trained data, allowing the generated complementary features to bridge the domain gap between the pre-trained data and the HSV.

The main achievements of this work are outlined as follows:

We propose a spectral similarity prompt learning method that only trains the FCP to generate complementary features that are similar in format to the pre-trained data, achieving better alignment with the pre-trained model while consuming fewer computational resources;
We design a spectral matching map as the prompt template. Through converting data from different bands into a unified prompt template form, our tracker can adapt to hyperspectral data with different spectral ranges and numbers of bands without the need for re-training or additional network structures;
The experimental results show that SPTrack achieves SOTA performance with low computational costs on hyperspectral video data sets captured using three different camera types.

The remainder of this paper is organized as follows: Section 2 reviews the related works. Section 3 discusses the ideas motivating the proposed design and its technical details. Section 4 presents an analysis of the experimental outcomes and parameters. Section 7 concludes the study.

2. Related Work

2.1. Hyperspectral Object Tracking

2.1.1. Object Tracking Based on Generative Models

Early hyperspectral trackers primarily include generative models. For example, Amit Banerjee et al. [39] considered radiometric theory to calculate the reflectance spectrum of each pixel, then distinguished between objects and backgrounds using the spectral angle mapper (SAM). Subsequently, Van Nguyen et al. [40] took into account radiative transfer theory for the estimation of reflectance spectra, and concurrently applied particle filters to track the object utilizing these reflectance spectra. However, these generative models only focus on object modeling and neglect background information. Describing the object being tracked through a single mathematical model has significant limitations, which can result in poor tracking performance, especially in challenging scenarios.

2.1.2. Object Tracking Based on Discriminative Models

Two main types of trackers have been developed more recently, including DL-based and CF-based trackers. The key to developing CF-based trackers is designing powerful feature extractors. Qian et al. [41] used a convolutional neural network (CNN) to compute features for each band, but this method ignored the relationships between spectral bands. Xiong et al. [42] proposed a method that combines local spectral–spatial gradient histograms and material component distribution to represent hyperspectral material features. In addition, Tang et al. [43] developed a unified HOT model that harnesses both spatial and spectral information. The aforementioned CF trackers do not require training for object tracking and have the advantage of fast inference speed [44]; however, being constrained by their handcrafted design, these extractors are only able to partially capture the intrinsic semantic features. The superiority of deep features in RGB trackers has led to the widespread application of deep learning methods in HOT. Uzkent et al. [45] utilized CNN features and kernelized correlation filters (KCF) to track objects. However, the constrained quantity of training samples hindered their effectiveness. To address this issue, researchers have introduced foundation models trained on large amounts of RGB data for HOT. Li et al. [38] first segmented HSIs into numerous sets of three-channel images, then input them into a SOTA RGB tracker. Similarly, Li et al. [46] employed an attention module to optimize inter-band dependencies in HSIs. Unlike in [46], Li et al. [27] aimed to make the total weights of the grouped three-channel images as close as possible. Tang et al. [30] proposed a method in which the bands with the highest discriminative ability between the object and background are selected. The aforementioned methods only focus on the spectral information of the object part, which results in low feature discrimination capability. To address this issue, Li et al. [47] fused features from multiple three-channel images. Tang et al. [48] introduced a Siamese network that incorporates distinct encoder–decoder modules along with specialized spectral representation modules to delineate spatial and spectral characteristics. Zhao et al. [35] were the first to introduce multi-modal fusion into HOT. They proposed a Transformer-based multi-modal information transfer network, which includes two sub-networks to extract RGB and hyperspectral multi-modal fusion information. Refs. [36,37] introduced the prompt learning paradigm into HOT. The network achieved SOTA performance solely through learning hyperspectral data while freezing the foundation model.

In this study, we introduce a spectral matching map to bridge the domain gap related to the difference in band number between pre-trained data and HSV.

2.2. Transformer-Based Tracking

The Vision Transformer (ViT [49]) was the first approach used to apply a pure vision transformer structure to the task of image classification. In recent years, some studies [10,50,51] have introduced Transformer-based network frameworks for object tracking, which achieved better accuracy than Siamese network-based frameworks [52]. Benefiting from a ViT structure that is capable of simultaneous feature extraction and relation modeling for both template and search region images, OSTrack achieved SOTA performance. OSTrack is a one-stream, one-stage framework whose network structure includes a ViT, a fully convolutional head, and an early candidate elimination module that progressively eliminates candidates belonging to the background in the early stages of the ViT. Our SPTrack uses OSTrack [50] as the foundation model, but there are some fundamental differences: (i) OSTrack employs the “pre-training then fine-tuning” training paradigm, while our network adopts the “pre-training then prompt-tuning” training paradigm, which can achieve better training accuracy with fewer training resources; (ii) OSTrack utilizes ViT to extract object spatial features, whereas SPTrack uses ViT to extract mixed features reflecting the spatial and spectral information of objects; (iii) We added a trainable FCP to OSTrack, which was designed to learn the combined features of target spatial and spectral information.

In this study, we added a trainable FCP to the structure of OSTrack, which is used to learn object spatial and spectral information.

2.3. Visual Prompt Learning

Over time, the “pre-training then fine-tuning” training paradigm has increasingly taken the dominant role in the learning of natural language processing (NLP) models, replacing the fully supervised learning paradigm [53]. Typically, researchers update all model parameters when applying the foundation model to downstream tasks. This strategy requires storing many parameters for each downstream task, which consumes significant computational resources. The prompt learning approach has recently improved the performance in many downstream NLP tasks [54,55]. The prompt learning paradigm enhances the utilization of information from the pre-trained model through adding additional prompts to the input. Prompt learning has been shown to be effective in multiple computer vision tasks [56]. Jia et al. [57] introduced learnable parameters at the beginning of each Transformer encoder. During training, they froze the entire Transformer encoder and updated the parameters related to prompts. Bahng et al. [56] introduced the concept of visual prompting, a method that modifies pre-trained models for use in downstream tasks through acquiring image perturbations that are tailored to each specific task. Chen et al. [58] proposed a plug-and-play bottleneck module equipped in the ViT, which largely improved ViT in the object domains. Yang et al. [59] first introduced the concept of prompting into the tracking domain and designed a novel multi-modal prompting tracker. Zhu et al. [60] developed a visual prompt multi-modal tracker, which achieved a balance between precision and speed in multiple downstream multi-modal tasks through learning complementarities between modalities.

In this study, we design a prompt-learning based network for HOT in order to replace the paradigm of fine-tuning with prompt-tuning gradually.

3. Methods

First, we present the overall structure of SPTrack and highlight the role of FCP. Then, we describe the process of prompt learning using the spectral matching map.

3.1. Preliminaries

Considering a video sequence with a ground truth bounding box,

B_{0}

, RGB tracking aims to train a tracker to predict the object’s bounding box, B in the subsequent search frames,

X_{R G B}

. This process can be described as

F_{R G B} : {X_{R G B}, B_{0}} \to B

. For visual prompt tracking, an additional spatiotemporal synchronized input stream is introduced, extending the model input to

(X_{R G B}

,

X_{A})

, in which A denotes the prompt template. Therefore, visual prompt tracking can be expressed as

F_{F M} : {X_{R G B}, X_{A}, B_{0}} \to B

, in which

F_{F M}

is a foundation tracker. In our network, we use OSTrack [50] as the foundation model.

3.2. Network Structure

3.2.1. Overall Architecture

As shown in Figure 2, SPTrack can be decomposed into a ViT [49] for characteristic extraction and relation modeling, an FCP for feature complementarity fusion, and a box head for final result prediction.

The network’s input consists of a three-channel image flow and a temporally synchronized, spatially aligned spectral matching map flow. Each flow consists of images with a height of H and a width of W. The three-channel image flow consists of the template image

Z_{3 C} \in R^{3 \times H_{Z} \times W_{Z}}

and the search region image

X_{3 C} \in R^{3 \times H_{X} \times W_{X}}

. The spectral matching map flow consists of the template image

Z_{A} \in R^{3 \times H_{Z} \times W_{Z}}

and the search region image

X_{A} \in R^{3 \times H_{X} \times W_{X}}

. First, we handle the two pairs of images into flattened template patch

Z_{3 C}^{P} \in R^{N_{Z} \times (3 \times P^{2})}

,

Z_{A}^{P} \in R^{N_{Z} \times (3 \times P^{2})}

and flattened search region patch

X_{3 C}^{P} \in R^{N_{X} \times (3 \times P^{2})}

sequences, where

P \times P

refers to the patch resolution, and

N_{Z} = H_{Z} W_{Z} / P^{2}

and

N_{X} = H_{X} W_{X} / P^{2}

refer to the template and search region patch numbers, respectively. Subsequently, the patches are flattened and projected into a latent space. After that, the learnable position embeddings are added to produce the final template token embeddings

H_{3 C}^{Z} \in R^{N_{Z} \times D}

and

H_{A}^{Z} \in R^{N_{Z} \times D}

, and search region token embeddings

H_{3 C}^{X} \in R^{N_{X} \times D}

and

H_{A}^{X} \in R^{N_{X} \times D}

. Then, the token sequences are concatenated to

H_{3 C}^{1} = [H_{3 C}^{Z}; H_{3 C}^{X}]

and

H_{A}^{1} = [H_{A}^{Z}; H_{A}^{X}]

.

The FCPs are integrated into the

l^{th}

stages of the Transformer encoder (denoted as

ϕ^{(l)}

) in order to acquire the prompts from the two input streams, which can be expressed as follows:

M^{l} = ϕ^{(l)} (H_{3 C}^{l}, H_{A}^{l}), l = 0, 1, \dots, L,

(1)

where

M^{l}

symbolizes the sequence of prompt tokens output from the

l^{th}

FCP. The blended representation can achieve complementarity in spatial and spectral features, generating more robust prompts.

The three-channel token sequences,

H_{3 C}^{l}

, are input into an l-layer Transformer encoder,

E^{(l)}

, and, concurrently,

H_{3 C}^{l}

along with

H_{A}^{l}

are processed through a sequential FCP. The resultant learned prompts,

M^{l}

, are incorporated into the three-channel image stream as follows:

H^{(l + 1)} = E^{(l + 1)} (H_{3 C}^{l} + M^{l}),

(2)

where

l = 0, 1, \dots, L

and L represent the total number of Transformer encoder layers, while

H^{(l + 1)}

is the sequence of prompt tokens that will be fed into the next layer of Transformer encoder. The final encoder layer’s results are fed into the box head to predict the object’s position.

3.2.2. Vision Transformer Block

As shown in Figure 3, we use a standard Transformer encoder that includes Multi-Head Self-Attention (MSA) and Multi-layer Perceptron (MLP) blocks.

The MLP consists of two layers with the GELU activation function. Before each block, there is layer normalization (LN), followed by residual connections. The token sequences input to the Transformer encoder layer are denoted as

t o k e n

, and the output is

t o k e n^{″}

. The process can be written as follows:

\begin{matrix} t o k e n^{'} & = M S A (L N (t o k e n)) + t o k e n, \\ t o k e n^{″} & = M L P (L N (t o k e n^{'})) + t o k e n^{'} . \end{matrix}

(3)

3.2.3. Feature Complementary Prompter

In prompt learning, designing an appropriate prompter is crucial. Inspired by [57,60], we design an attention-based FCP that can effectively learn complementary features of the HSI’s spatial and spectral information, generating informative prompt features. Considering that different layers of the Transformer encoder extract semantic information at varying levels, we incorporate FCPs before each Transformer encoder layer to enhance complementarity between foundational and prompt features.

The architecture of the FCP is depicted in Figure 4a. Our FCP has two concurrent input branches, dedicated to processing the sequences of the three-channel image token,

H_{3 C}^{l}

, and the spectral matching map token,

H_{A}^{l}

, respectively. First, the three-channel image tokens are sent to the channel and position attention module (CPAM) for position and channel attention computation. The process is described as follows:

H_{3 C}^{'} = F_{C P A M} (H_{3 C}^{l}),

(4)

where

H_{3 C}^{'}

is the final refined output. Simultaneously, the spectral matching map tokens undergo a spatial foveation process, which involves applying

λ

-smoothing to a spatial softmax across all spatial dimensions to produce the spatial attention mask

H_{m a s k}

. This procedure can be outlined as follows:

H_{m a s k} = \{\frac{H_{A}^{[:, i, j]}}{\sum H_{A}^{[:, i, j]}} λ\}, i = 1, 2, \dots, H, j = 1, 2, \dots, W,

(5)

where

λ

is a learnable weighted parameter per block. Then, we apply the spatial attention mask,

H_{m a s k}

, over

H_{A}

to generate the enhanced embedding

H_{A}^{'}

:

H_{A}^{'} = H_{A} ⊙ H_{m a s k} .

(6)

Then, we add the fused embeddings of the two types of features, input them into CPAM, and generate multi-feature visual prompts. The process is defined as follows:

M^{l} = F_{C P A M} ({H^{'}}_{3 C} + {H^{'}}_{A}) .

(7)

The CPAM is used for the adaptive feature enhancement of both the three-channel image features and the final fused features.

As shown in Figure 4b, the input feature, F, is first passed through the channel attention module,

C A

, to produce a 1D channel attention map,

M_{C} \in R^{C \times 1 \times 1}

. Then,

M_{C}

is element-wise multiplied with the input feature F to obtain refined features

F^{'}

with channel attention. The refined feature

F^{'}

is then passed through the position attention module,

P A

, to generate a 2D position attention map,

M_{P} \in R^{1 \times H \times W}

. Multiplying each element of

M_{P}

with the corresponding element of

F^{'}

, we obtain the final output feature

F^{″}

. The overall attention process can be summarized as follows:

\begin{matrix} F^{'} & = C A (F) ⊙ F^{'}, \\ F^{″} & = P A (F^{'}) ⊙ F^{'}, \end{matrix}

(8)

where ⊙ denotes element-wise multiplication. The channel and position attention module process is as follows:

\begin{matrix} M_{C} (F) & = σ [M L P (ρ_{max} (F)) + M L P (ρ_{a v g} (F))], \\ M_{P} (F^{'}) & = σ (f^{7 \times 7} ([ρ_{max} (F^{'}); ρ_{a v g} (F^{'})])), \end{matrix}

(9)

where

ρ_{m a x} (\cdot)

and

ρ_{a v g} (\cdot)

denote the max and mean pooling operations, respectively;

f^{7 \times 7}

denotes a convolutional operation utilizing a filter size of

7 \times 7

; and

σ

denotes the sigmoid function.

3.2.4. Head and Loss Function

We use a simple fully convolutional network as the prediction head, which includes three branches. Each branch has the same network structure, consisting of a stack of four Conv–BN–ReLU layers. The three branches include predicting the object classification score map,

S = {[0, 1]}^{\frac{H_{X}}{P} \times \frac{W_{X}}{P}}

, the local offsets

O = {[0, 1)}^{2 \times \frac{H_{X}}{P} \times \frac{W_{X}}{P}}

to compensate for the relevant error due to reduced resolution, and the normalized bounding box size,

B_{n o r m} = {[0, 1]}^{2 \times \frac{H_{X}}{P} \times \frac{W_{X}}{P}}

. The final object bounding box,

(x, y, w, h)

, is obtained as follows:

(x_{d}, y_{d}) = arg max S_{x y},

(10)

\begin{matrix} x & = x_{d} + O (0, x_{d}, y_{d}), \\ y & = y_{d} + O (1, x_{d}, y_{d}), \\ w & = B_{n o r m} (0, x_{d}, y_{d}), \\ h & = B_{n o r m} (1, x_{d}, y_{d}) . \end{matrix}

(11)

During training, data flow throughout the entire model. We only update a subset of parameters relevant to prompt learning, including spectral matching map patch embedding and FCP. All parameters related to the three-channel image are frozen. We use the same loss function as OSTrack, and the whole loss function of SPTrack is described as follows:

L = L_{c l s} + λ_{i o u} L_{i o u} + λ_{L_{1}} L_{1},

(12)

where

λ_{i o u} = 2

and

λ_{L_{1}} = 5

are the regularization parameters [61].

L_{c l s}

represents the weighted focal loss [62], for each ground truth target center, S, and its corresponding low-resolution equivalent,

S (S_{x}, S_{y})

; the ground truth heatmap can be generated using a Gaussian kernel as

{\tilde{S}}_{x y} = exp (- \frac{{(x - S_{x})}^{2} + {(y - S_{y})}^{2}}{2 σ_{S}^{2}})

, where

σ

is an object size-adaptive standard deviation [62]. The Gaussian weighted focal loss,

L_{c l s}

, can be formulated as follows:

\begin{matrix} L_{c l s} = - \sum_{x y} \{\begin{matrix} {(1 - S_{x y})}^{\partial} log (S_{x y}) & if {\tilde{S}}_{xy} = 1 \\ {(1 - S_{x y})}^{β} {(S_{x y})}^{\partial} log (1 - S_{x y}) & otherwise \end{matrix}, \end{matrix}

(13)

where ∂ and

β

are hyper-parameters, and we set ∂ = 2 and

β

= 4 as they are set in [62,63].

L_{i o u}

represents the generalized IoU loss [64]. Predicted

B^{p} = (x_{1}^{p}, y_{1}^{p}, x_{2}^{p}, y_{2}^{p})

and ground truth

B^{g} = (x_{1}^{g}, y_{1}^{g}, x_{2}^{g}, y_{2}^{g})

bounding box coordinates. First, calculate the area of

B^{p}

:

A^{p} = ({\hat{x}}_{2}^{p} - {\hat{x}}_{1}^{p}) \times ({\hat{y}}_{2}^{p} - {\hat{y}}_{1}^{p})

, and the area of

B^{g}

:

A^{g} = (x_{2}^{g} - x_{1}^{g}) \times (y_{2}^{g} - y_{1}^{g})

, where

x_{1}^{p} = m i n (x_{1}^{p}, x_{2}^{p})

,

x_{2}^{p} = m a x (x_{1}^{p}, x_{2}^{p})

,

y_{1}^{p} = m i n (y_{1}^{p}, y_{2}^{p})

, and

y_{2}^{p} = m a x (y_{1}^{p}, y_{2}^{p})

. Then, calculate intersection

I

between

B^{p}

and

B^{g}

:

\begin{matrix} I = \{\begin{matrix} (x_{2}^{I} - x_{1}^{I}) \times (y_{2}^{I} - y_{1}^{I}) & if x_{2}^{I} > x_{1}^{I}, y_{2}^{I} > y_{1}^{I} \\ o & otherwise . \end{matrix} \end{matrix}

(14)

where

x_{1}^{I} = max ({\hat{x}}_{1}^{p}, x_{1}^{g})

,

x_{2}^{I} = min ({\hat{x}}_{2}^{p}, x_{2}^{g})

,

y_{1}^{I} = max ({\hat{y}}_{1}^{p}, y_{1}^{g})

,

y_{2}^{I} = min ({\hat{y}}_{2}^{p}, y_{2}^{g})

. Then, find the coordinate of the smallest enclosing box,

B^{c}

:

A^{c} = (x_{2}^{c} - x_{1}^{c}) \times (y_{2}^{c} - y_{1}^{c})

, where

x_{1}^{c} = min ({\hat{x}}_{1}^{p}, x_{1}^{g})

,

x_{2}^{c} = min ({\hat{x}}_{2}^{p}, x_{2}^{g})

,

y_{1}^{c} = min ({\hat{y}}_{1}^{p}, y_{1}^{g})

, and

y_{2}^{c} = min ({\hat{y}}_{2}^{p}, y_{2}^{g})

. Finally, calculate the IoU and GIoU.

I o U = \frac{I}{U}

,

G I o U = I o U - \frac{A^{c} - U}{A^{c}}

, where

U = A^{p} + A^{g} - I

. Thus,

L_{i o u}

can be calculated as follows:

\begin{matrix} L_{i o u} = 1 - G I o U \end{matrix}

(15)

3.3. Spectral Matching Map Generation

To better encapsulate spectral information from HSIs into prompt templates and align them more effectively with pre-trained data sets, we propose a spectral matching algorithm based on spectral similarity to generate novel visual data as a prompt template. Inspired by hyperspectral object detection, we calculate the spectral similarity between the target pixel and all pixels in each frame of the image to achieve spectral discrimination.

The spectrum of pixel M in image

P_{n}

of the

n^{th}

frame is

M_{1}^{n}, M_{2}^{n}, \dots, M_{l}^{n}

, where

l = W \times H

, and W and H represent the width and height of the HSI, respectively. To eliminate background interference, we set a threshold with which the width and height of the initial bounding box are reduced to

80 %

of their original values. The spectrum of each pixel within the reduced initial bounding box can be represented as

T_{1}, T_{2}, \dots T_{k}

, where

k = w \times h

, and w and h represent

80 %

of the width and height of the initial bounding box, respectively.

First, we compute the spectral similarity,

D_{1}^{n}

, between the spectrum of the first pixel in the

n^{th}

frame,

M_{1}^{n}

, and the first object pixels,

T_{1}

, of the initial frame. Common spectral similarity metrics include Euclidean distance (ED) [65], defined as

E D (M_{1}^{n}, T_{1})

in Equation (16), spectral angle mapper (SAM) [66], defined as

S A M (M_{1}^{n}, T_{1})

in Equation (17), and the spectral similarity scale (SSS) [67], defined as

S S S (M_{1}^{n}, T_{1})

in Equation (18).

\begin{matrix} D_{1}^{n} = E D (M_{1}^{n}, T_{1}) = ∥M_{1}^{n} - T_{1}∥, \end{matrix}

(16)

\begin{matrix} D_{1}^{n} = S A M (M_{1}^{n}, T_{1}) = \frac{{(M_{1}^{n})}^{⊤} T_{1}}{∥M_{1}^{n}∥ ∥T_{1}∥}, \end{matrix}

(17)

\begin{matrix} D_{1}^{n} = S S S (M_{1}^{n}, T_{1}) = \sqrt{{(\frac{1 - ρ}{2})}^{2} + ∥M_{1}^{n} - T_{1}∥}, \\ ρ = \frac{{(T_{1} - {\bar{T}}_{1})}^{⊤} (M_{1}^{n} - {\bar{M}}_{1}^{n})}{∥T_{1} - {\bar{T}}_{1}∥ ∥M_{1}^{n} - {\bar{M}}_{1}^{n}∥}, \end{matrix}

(18)

where

ρ

is the Pearson correlation coefficient.

Then, we sequentially compute the spectral similarity between the spectrum of the first pixel in the

n^{th}

frame of the HSI and the spectra of all pixels in the bounding box, resulting in k spectral similarity values,

D_{k}^{n}

. We select the minimum value as the spectral similarity score for that point. This process is described as follows:

\begin{matrix} R_{1}^{n} = min (D_{1}^{n}, D_{2}^{n}, \dots, D_{k}^{n}), \end{matrix}

(19)

where

R_{1}^{n}

represents the spectral similarity score of the first pixel in the

n^{th}

HSI. Afterward, we repeat the above operation pixel-by-pixel to obtain image

P_{n}

of the

n^{th}

frame in the HSV. This process is described as follows:

\begin{matrix} P_{n} = {R_{1}^{n}, R_{2}^{n}, \dots, R_{l}^{n}} . \end{matrix}

(20)

Finally, we repeat the above operation frame-by-frame for the HSV, with n frames, to obtain the final spectral matching map,

S M M

. This process is described as follows:

\begin{matrix} S M M = {P_{1}, P_{2}, \dots, P_{n}} . \end{matrix}

(21)

The formal format of the algorithm is shown in Algorithm 1. Figure 5 displays spectral matching maps generated with our proposed algorithm. The figures show that converting HSIs into spectral matching maps allows for us to clearly distinguish between the object and background. Transforming spectral information into spectral similarity and spatial morphological features facilitates feature complementarity for FCP and enables better alignment with pre-trained models.

Algorithm 1 Spectral matching map generation algorithm.

Input:: N frames of HSV $H \in R^{W \times H \times C}$ , where W, H, and C represent the width, height, and channel of the HSI, respectively.
Output:: N frames of spectral matching map $S M M$ .

1:: Initialization: Reduce the ground truth box’s height $h = 80 % H$ and width $w = 80 % W$ .
2:: for $n = 1, 2, \dots, N$ do
3:: Take the background point $M_{l}^{n}$ and the ground truth box point $T_{k}$ , where $l = [1, W \times H]$ , $k = [1, w \times h]$ ;
4:: for $l = 1, 2, \dots, W \times H$ do
5:: for $k = 1, 2, \dots, w \times h$ do
6:: Calculate the spectral similarity $D_{k}^{n}$ :
$D_{k}^{n} = \{\begin{matrix} E D (M_{l}^{n}, T_{k}) = ∥M_{l}^{n} - T_{k}∥ \\ S A M (M_{l}^{n}, T_{k}) = \frac{{(M_{l}^{n})}^{⊤} T_{k}}{∥M_{l}^{n}∥ ∥T_{k}∥} \\ S S S (M_{l}^{n}, T_{k}) = \sqrt{{(\frac{1 - ρ}{2})}^{2} + ∥M_{l}^{n} - T_{k}∥}, ρ = \frac{{(T_{k} - {\bar{T}}_{k})}^{⊤} (M_{l}^{n} - {\bar{M}}_{l}^{n})}{∥T_{k} - {\bar{T}}_{k}∥ ∥M_{l}^{n} - {\bar{M}}_{l}^{n}∥} \end{matrix}$
7:: Select the smallest similarity as the score $R_{l}^{n}$ : $R_{l}^{n} = min (D_{k}^{n})$
8:: end for
9:: end for
10:: end for
11:: Spectral matching map $S M M$ : $S M M = {{R_{1}^{1}, R_{2}^{1}, \dots, R_{l}^{1}}_{1}, {R_{1}^{2}, R_{2}^{2}, \dots, R_{l}^{2}}_{2}, \dots,$ ${R_{1}^{n}, R_{2}^{n}, \dots, R_{l}^{n}}}$

As shown in Figure 5a, utilizing the object spectra from the initial frame enables the effective matching of subsequent frames. Even in the final frame (i.e., the

105 th

frame in the figure), the morphological features of the object remain clear.

We compared the spectral matching maps based on ED, SAM, and SSS for the same video sequence frame. From Figure 5a,b, it can be observed that the ED-based spectral matching map exhibits higher discrimination between the object and background, compared to the SAM-based map. Conversely, in Figure 5d,e, the spectral matching map based on SAM offers superior distinction between the object and the background, when contrasted with the one based on ED. Due to minimal differences in the spectral curves of the object and background under single-dimensional similarity metrics (e.g., ED and SAM), SSS considers the size of the spectral vector, the shape of the spectral curve, and the amount of spectral information, resulting in more distinguishable object–background spectral information. As shown in Figure 5c,f, the spectral matching maps based on different spectral similarities each have their own advantages when dealing with various types of spectral data.

To confirm the advantage of our approach, we carried out experiments across the three varieties of spectral matching maps (ED-based, SAM-based, and SSS-based). More details are provided in Section 4.

4. Experiments

In this part, we first illustrate the two public HOT data sets and four evaluation metrics. Subsequently, we describe the experimental settings and provide a thorough analysis of the obtained results.

4.1. Experimental Settings

4.1.1. Data Sets

In this study, we tested our method on two open-source hyperspectral data sets. One data set is from the HOT2023 (https://www.hsitracking.com/, accessed on 12 July 2023 challenge, while the other data set is IMEC25 [68]. The HOT2023 data set was collected using three XIMEA snapshot cameras, featuring near-infrared (NIR), visible (VIS), and red-near-infrared (RedNIR) bands. Each camera is calibrated to produce high-quality images with 16, 25, and 15 bands. As shown in Table 1, the whole data set comprises 110 sequences for training and 87 for validation, with each set including hyperspectral data and three-channel data. The IMEC25 data set was collected using a XIMEA snapshot camera, which has 25 bands ranging from 680 nm to 960 nm, with a spatial resolution of

409 \times 216

pixels. It consists of 135 surveillance video sequences, with 55 for training and 80 for validation.

The validation sets of both data sets, including 87 validation videos from HOT2023 and 80 validation videos from IMEC25, were applied to verify our method. Notably, these video scenes cover challenging factors, such as scale variation, out-of-view, deformation, illumination variation, occlusion, and background clutters.

4.1.2. Evaluation Metrics

In this study, we calculated four object tracking assessment metrics [69], including success plot, precision plot, mean distance precision with a threshold of 20 Pixels (DP@20P), and area under the curve (AUC). We utilized success and precision plots to benchmark tracking effectiveness, gain insights, and present the results visually for analysis and evaluation.

In the success plot, the horizontal and vertical axes represent the overlap threshold,

τ

, and the success rate (

S R

), respectively. The intersection over union (

I o U

) between the predicted bounding box,

B_{p}

, and the ground truth,

B_{t}

, is calculated as follows:

I o U = \frac{B_{t} \cap B_{p}}{B_{t} \cup B_{p}},

(22)

where ∪ and ∩ represent the intersection and the union operations of two rectangular boxes, respectively. The success rate represents the proportion of successfully tracked frames among all frames, which is defined as follows:

S R (τ) = \frac{N_{I o U > τ}}{N_{t o t a l}} .

(23)

When

I o U

is larger than

τ

, the frame is regarded as correctly tracked.

N_{I o U > τ}

refers to the number of frames with

I o U

greater than

τ

, and

N_{t o t a l}

refers to the total number of frames.

In the precision plot, the horizontal and vertical axes represent the location error threshold,

ξ

, and the precision rate (

P R

), respectively. The center location error (

C L E

) between the centers of the predicted bounding box and the ground truth is calculated as follows:

C L E = \sqrt{{(x_{1} - x_{2})}^{2} + {(y_{1} - y_{2})}^{2}},

(24)

where

(x_{1} - x_{2})

and

(y_{1} - y_{2})

refer to the coordinates of the central location corresponding to the predicted bounding box and the ground truth, respectively. The precision rate represents the proportion of successfully tracked frames among all frames, which is defined as follows:

P R (ξ) = \frac{N_{C L E < ξ}}{N_{t o t a l}} .

(25)

When

C L E < ξ

, this frame is considered to be successfully tracked.

N_{C L E < ξ}

refers to the number of frames with

C L E

less than

ξ

, and

N_{t o t a l}

refers to the total number of frames.

Note that curves closer to the top-right in the success plot and closer to the top-left in the precision plot indicate better performance. In the experiment, all compared trackers were run on the complete validation set, which includes 87 video sequences, and all results are determined under the one-pass evaluation (OPE). Moreover, we use the AUC value to rank all trackers.

4.1.3. Implementation Details

Our SPTrack utilizes PyTorch and was trained on two NVIDIA RTX3090 GPUs on HOT2023. The template image sizes is

192 \times 192

pixels, and the search region image is

384 \times 384

pixels. We trained the model on the training set of 110 video sequences for 25 epochs with a batch size of 8. Optimization of the model was based on AdamW, incorporating a weight decay of

10^{- 4}

and a learning rate of

4 \times 10^{- 4}

. We sampled 60,000 pairs of hyperspectral data sets and spectral matching maps for each training epoch. The data augmentations included horizontal flip and brightness jittering. Additionally, we selected the validation videos from the IMEC25 data set to evaluate the network’s adaptability in remote sensing scenarios.

5. Results and Analysis

5.1. Quantitative Comparison with State-of-the-Art Trackers

5.1.1. Quantitative Comparison on HOT2023

In this section, we compare SPTrack with four SOTA hyperspectral trackers (i.e., MHT [42], SSAMB [36], SIPHOT [36], and RawTrack [70]) and six SOTA color trackers (i.e., SiamBAN [71], SiamCAR [72], SiamGAT [73], STARK [61], TranST [10], and OSTrack [50]) on the HOT2023 data set. Note that all hyperspectral trackers were tested on the hyperspectral validation set with the optimal weights, while all color trackers were tested on the three-channel validation set with the official weights. Other than SSAMB, SIPHOT, RawTrack, and OSTrack, the results for the other comparative methods were retrieved from the HOT2023 website (https://www.hsitracking.com/result/, accessed on 12 July 2023).

As shown in Figure 6, we analyzed the performance of the trackers using success and precision plots, providing a qualitative and quantitative assessment of their effectiveness. In Figure 6a, the red curve denoting SPTrack is the one closest to the top-right corner. However, the performance after coordinate 0.6 is not as good as that of RawTrack. This indicates that the intersection over union (IoU) between the prediction box and the ground truth box in RawTrack’s tracking results is more prevalent when greater than 0.6. The main reason is likely that RawTrack converts the pre-trained data and the hyperspectral data into raw images and uniformly performs channel-wise target-aware normalization. As a result, the data distribution of the pre-trained data is closer to that of the hyperspectral data, leading to a higher overlap rate between the predicted target box and the ground truth box. In contrast, SPTrack does not require separate pre-training and only uses spectral matching maps for prompt-tuning. This approach saves a significant amount of training resources while achieving overall better training results. In Figure 6b, SPTrack is the closest to the top-left, indicating that SPTrack obtained the best tracking performance.

Table 2 presents the AUC and DP@20 metrics for all trackers evaluated on the HOT2023 validation set, with the results arranged according to their AUC values. SPTrack achieved the best overall performance, with an AUC of

0.644

and a DP@20 of

0.841

, while MHT performed the worst. The RawTrack and SIPHOT AUC scores were

0.639

and

0.635

, ranking second and third, respectively. For the DP@20 metric, SPTrack, SIPHOT, and RawTrack produced scores of

0.846

,

0.844

, and

0.840

, respectively, comprising the top three places.

In Table 2, it can be seen that trackers applying hyperspectral data were superior, compared to those using three-channel data. For instance, the AUCs of SPTrack, SIPHOT, and SSAMB surpassed those of OSTrack (

0.597

), TranST (

0.569

), and SiamGAT (

0.561

). Furthermore, we observed that the prompt-tuning paradigm led to better performance than the full fine-tuning paradigm. For example, SPTrack and SIPHOT, which used prompt-tuning, achieved first and third place, respectively. In particular, our proposed SPTrack performed better than SIPHOT due to its superior inheritance of the underlying model’s representational capabilities.

Figure 7 shows the success and precision plots for all compared trackers on three validation sets: VIS, RedNIR, and NIR. In Figure 7a, the blue curve represents SPTrack, positioned lower-left compared to the red and green curves, which indicates its inferior performance compared to SIPHOT and Rawtrack on the VIS data. In Figure 7b, the red curve, corresponding to SPTrack, is closest to the top-left, which indicates that SPTrack has the best tracking accuracy; however, the predicted bounding boxes for the VIS data are not precise enough. The possible reason is that during the data preprocessing, we apply the same standardization and normalization to the three-channel hyperspectral images of different bands, whereas the spectral matching maps have different data distributions across different bands. This leads to a certain discrepancy in the data distribution between the three-channel images and the spectral matching maps, which affects the tracking performance. In Figure 7c–f, the red curves depicts SPTrack, exhibiting the highest proximity to the top-right in the success plots and the top-left in the precision plots. These findings indicate that SPTrack surpasses the other trackers on the NIR and RedNIR data.

To further evaluate the tracker’s performance, we conducted a thorough analysis, including comparisons of SPTrack and other trackers across various hyperspectral data sets, as detailed in Table 3. The tables lead to the conclusion that the SPTrack is the best tracker for NIR and RedNIR types. These experimental results indicate that SPTrack presents SOTA performance in the context of HOT.

5.1.2. Quantitative Comparison on IMEC25

In this section, we compare our SPTrack with ten trackers (i.e., CSRDCF [74], CSK [75], HOMG [76], BACF [77], MCCT [78], SiamCAR [72], SiamFC [8], UpdateNet [79], TranST [10], and SSTFT [80]) on the IMEC25 data set.

The AUC and DP@20 performances of all trackers on IMEC25 are shown in Table 4, sorted by their AUC values. SPTrack ranked third in AUC, with a value of

0.588

, and first in DP@20, with a value of

0.927

, while CSK performed the worst. The HOMG and SSTFT AUC scores were

0.746

and

0.619

, ranking first and second, respectively. For the DP@20 metric, SPTrack, SSTFT, and TranST produced scores of

0.921

,

0.888

, and

0.871

, respectively, leading to the top three places. Notably, SPTrack achieved the best performance in the DP@20 metric. This evaluation metric indicates that the probability of the object being correctly detected is very high.

5.2. Visual Comparison

In this section, we first take the NIR-basketball3, RedNIR-pool11, RedNIR-rainystreet16, VIS-campus, VIS-cards16, and VIS-pool11 scenes in HOT2023 as examples. The visual results are shown in Figure 8, in which SPTrack performed the best among all trackers.

In the NIR-basketball3 video, as shown in Figure 8a, it can be noted that SIPHOT and SSAMB lost track of the object from frame #24 onwards. From frame #332 onwards, SiamBAN, SiamCAR, and MHT lost track of the object due to the rapid motion of the object, while OSTrack and SiamGAT mistakenly tracked incorrect objects due to interference from similar objects. In the RedNIR-pool11 video, as shown in Figure 8b. At frame #84, only SPTrack, RawTrack, and TranST could track the object due to noise interference and changes in illumination. At frame #145, after image restoration, only SPTrack, SiamGAT, and OSTrack were able to correctly track the target.

In the RedNIR-rainystreet16 video, as shown in Figure 8c, at frame #250, all trackers except for SPTrack, OSTrack, and STARK exhibited varying degrees of drift. In the other cases shown in Figure 8, SPTrack could effectively track the target object, which demonstrates its better performance in treating challenging attributes.

In addition, we conducted a visual comparison with the SOTA tracker HOMG, which achieved the highest AUC on IMEC25. We took the airplane, boat2, car1, doublecar3, human3, and human8 scenes in IMEC25 as examples, and the results are shown in Figure 9. In the airplane and boat2 videos, as shown in Figure 9a,b, respectively, SPTrack had a lower AUC than HOMG, but a higher DP@20. The visual result indicates that the bounding boxes in SPTrack are larger than those in HONG, but their centers are closer to those of the ground truth bounding boxes. The reason for this may be the difference in annotation habits between the ground truth labeling of IMEC25 and the labeling habits of our network’s pre-training data: the ground truth bounding boxes in the evaluation results are overly large, and small objects are more sensitive to the IoU threshold, resulting in a lower AUC performance. However, in frame #138 of the airplane scene, HOMG shows object drift, while SPTrack was always able to track the object, indicating better tracking performance. In the car1 video, the doublecar3 video, and the human3 video, as shown in Figure 9c–e, respectively, due to occlusion, HOMG lost the object in frame #62 of the car1 video, frame #94 of the doublecar3 video, and frame #150 of the human3 video.

5.3. Ablation Analysis

We performed many ablation experiments on HOT2023 to verify the effectiveness of the ED-, SAM-, and SSS-based spectral matching maps, as well as the effectiveness of FCP. We chose OSTrack as the baseline and defined

Δ

AUC and

Δ

DP@20 as the percentage improvement relative to the baseline’s AUC and DP@20, respectively.

5.3.1. With and without Spectral Matching Map

To analyze the effect of different spectral similarities on the spectral matching map, we recorded the performance of our approach with different spectral similarities and different prompters. In Table 5, it can be observed that, compared to the baseline without spectral matching maps or a prompter, there were varying degrees of improvement in both AUCs and DP@20 when using the spectral matching maps. Based on these experimental results, the spectral matching map plays a significant role in improving the accuracy of our method.

5.3.2. With and without Feature Complementary Prompter

An ablation study on FCP is also provided in Table 5. Our method introduces the FCP, instead of the modality-complementary prompter (MCP) [60]. The results demonstrate that, regardless of whether it was based on ED, SAM, or SSS, the spectral matching map with the FCP outperformed the MCP in terms of AUC and DP@20. Notably, despite having one order of magnitude fewer trainable parameters than the baseline and slightly more than MCP, SPTrack still outperformed both the baseline and MCP. Furthermore, the results indicate that FCP is more beneficial for improving accuracy than the spectral matching map.

Both the spectral matching map and FCP improved the tracking performance of SPTrack. Compared to the baseline, using the SAM-based spectral matching map and FCP, the proposed SPTrack achieved the most significant improvements in AUC and DP@20, with increases of

7.203 %

and

6.858 %

, respectively. These experimental results further validate that performing these two operations in HOT can improve the tracking performance while maintaining low computational costs.

6. Discussion

We discuss the performance of ED-, SAM-, and SSS-based spectral matching maps in the different spectral type data of HOT2023. In Table 6, the best results for each type of data are highlighted in red. It can be observed that the SSS-based model performed best in the VIS, the SAM-based model performed best in the NIR, and the ED-based model performed best in the RedNIR. Considering that NIR has 25 bands, VIS has 16 bands, and RedNIR has 15 bands, it can be observed that the performance of the SAM-based model improved with a higher number of bands, whereas the performance of the ED-based model was better with fewer bands. Furthermore, our proposed SSS-based method simultaneously considers the influence of both ED and SAM. In the table, it can be observed that, on data sets with fewer bands (such as VIS and RedNIR), the performance of the ED- and SSS-based models was similar.

We also conducted experiments on IMEC25. As IMEC25 has 25 bands, we used SAM-based spectral matching maps, which improved the AUC by

0.006

and

0.007

compared to ED- and SSS-based methods, respectively, and the DP@20 was improved by

0.018

and

0.007

, respectively. The results on the two public data sets validate our conclusions.

7. Conclusions

In this study, a spectral similarity prompt learning method was proposed to solve the problem of the insufficient generalizability of networks across different spectral bands in the context of HOT prompt learning. The core idea of SPTrack is to train a network with cross-spectral generalization using the least number of parameters. Through the process of learning a pre-defined prompt template, we substantially enhanced the network’s capacity for generalization and adaptability. Furthermore, the effectiveness of the proposed attention-based prompter in extracting additional features related to the spatial and spectral dimensions of the object was demonstrated. Experimental validation on two public data sets validated the superior performance of the proposed SPTrack to that of state-of-the-art models. However, the tracking performance was unsatisfactory in scenarios characterized by lighting changes and object disappearance due to variations in the object’s spectral information and the absence of spectral data. In the future, we intend to design template strategies to accommodate spectral feature variations, enabling various applications.

Author Contributions

Conceptualization, G.G. and Z.L. (Zhaoxu Li); methodology, G.G. and Z.L. (Zhaoxu Li); software, G.G.; validation, G.G., Y.W. and X.H.; formal analysis, G.G.; data curation, Y.L.; writing—original draft preparation, G.G.; writing—review and editing, Y.W., M.L., Z.L. (Zhaoxu Li) and Q.L.; funding acquisition, W.A., M.L., Z.L. (Zaiping Lin) and Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the China Postdoctoral Science Foundation (GZB20230982 2023M744321) and the National University of Defense Technology Independent Innovation Science Foundation (22-ZZCX-042).

Data Availability Statement

In this work, the HOT2023 data set was obtained from https://www.hsitracking.com/ (accessed on 12 July 2023), and the IMEC25 data set was obtained from https://github.com/Chenlulu1993/HOMG (accessed on 24 November 2021).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhu, L.; Yu, F.R.; Wang, Y.; Ning, B.; Tang, T. Big data analytics in intelligent transportation systems: A survey. IEEE Trans. Intell. Transp. Syst. 2018, 20, 383–398. [Google Scholar] [CrossRef]
Nazar, M.; Alam, M.M.; Yafi, E.; Su’ud, M.M. A systematic review of human–computer interaction and explainable artificial intelligence in healthcare with artificial intelligence techniques. IEEE Access 2021, 9, 153316–153348. [Google Scholar] [CrossRef]
Zhou, J.T.; Du, J.; Zhu, H.; Peng, X.; Liu, Y.; Goh, R.S.M. Anomalynet: An anomaly detection network for video surveillance. IEEE Trans. Inf. Forensics Secur. 2019, 14, 2537–2550. [Google Scholar] [CrossRef]
Guo, J.; Kurup, U.; Shah, M. Is it safe to drive? An overview of factors, metrics, and datasets for driveability assessment in autonomous driving. IEEE Trans. Intell. Transp. Syst. 2019, 21, 3135–3151. [Google Scholar] [CrossRef]
Fan, H.; Bai, H.; Lin, L.; Yang, F.; Ling, H. LaSOT: A High-quality Large-scale Single Object Tracking Benchmark. Int. J. Comput. Vis. 2020, 129, 439–461. [Google Scholar] [CrossRef]
Huang, L.; Zhao, X.; Huang, K. GOT-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 850–865. [Google Scholar]
Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar]
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 8126–8135. [Google Scholar]
Marvasti-Zadeh, S.M.; Cheng, L.; Ghanei-Yakhdan, H.; Kasaei, S. Deep learning for visual tracking: A comprehensive survey. IEEE Trans. Intell. Transp. Syst. 2021, 23, 3943–3968. [Google Scholar] [CrossRef]
Li, P.; Wang, D.; Wang, L.; Lu, H. Deep visual tracking: Review and experimental comparison. Pattern Recognit. 2018, 76, 323–338. [Google Scholar] [CrossRef]
Kumar, M.; Mondal, S. Recent developments on target tracking problems: A review. Ocean. Eng. 2021, 236, 109558. [Google Scholar] [CrossRef]
Liang, J.; Zhou, J.; Tong, L.; Bai, X.; Wang, B. Material based salient object detection from hyperspectral images. Pattern Recognit. 2018, 76, 476–490. [Google Scholar] [CrossRef]
Gu, Y.; Chanussot, J.; Jia, X.; Benediktsson, J.A. Multiple kernel learning for hyperspectral image classification: A review. IEEE Trans. Geosci. Remote Sens. 2017, 55, 6547–6565. [Google Scholar] [CrossRef]
Li, S.; Song, W.; Fang, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Deep learning for hyperspectral image classification: An overview. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6690–6709. [Google Scholar] [CrossRef]
Li, Z.; Wang, Y.; Xiao, C.; Ling, Q.; Lin, Z.; An, W. You only train once: Learning a general anomaly enhancement network with random masks for hyperspectral anomaly detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5506718. [Google Scholar] [CrossRef]
He, X.; Wu, J.; Ling, Q.; Li, Z.; Lin, Z.; Zhou, S. Anomaly detection for hyperspectral imagery via tensor low-rank approximation with multiple subspace learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5509917. [Google Scholar] [CrossRef]
Li, Z.; An, W.; Guo, G.; Wang, L.; Wang, Y.; Lin, Z. SpecDETR: A Transformer-based Hyperspectral Point Object Detection Network. arXiv 2024, arXiv:2405.10148. [Google Scholar]
Zhang, X.; Zhang, J.; Li, C.; Cheng, C.; Jiao, L.; Zhou, H. Hybrid unmixing based on adaptive region segmentation for hyperspectral imagery. IEEE Trans. Geosci. Remote Sens. 2018, 56, 3861–3875. [Google Scholar] [CrossRef]
Cui, C.; Zhong, Y.; Wang, X.; Zhang, L. Realistic mixing miniature scene hyperspectral unmixing: From benchmark datasets to autonomous unmixing. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5502515. [Google Scholar] [CrossRef]
Hong, D.; Chanussot, J.; Yokoya, N.; Heiden, U.; Heldens, W.; Zhu, X.X. WU-Net: A weakly-supervised unmixing network for remotely sensed hyperspectral imagery. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 373–376. [Google Scholar]
Li, W.; Wang, J.; Gao, Y.; Zhang, M.; Tao, R.; Zhang, B. Graph-feature-enhanced selective assignment network for hyperspectral and multispectral data classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5526914. [Google Scholar] [CrossRef]
Wang, Y.; Mei, J.; Zhang, L.; Zhang, B.; Zhu, P.; Li, Y.; Li, X. Self-supervised feature learning with CRF embedding for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2018, 57, 2628–2642. [Google Scholar] [CrossRef]
Hong, D.; Gao, L.; Wu, X.; Yao, J.; Zhang, B. Revisiting Graph Convolutional Networks with Mini-Batch Sampling for Hyperspectral Image Classification. In Proceedings of the IEEE Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), Amsterdam, The Netherlands, 24–26 March 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–5. [Google Scholar]
Chen, Y.; Yuan, Q.; Tang, Y.; Xiao, Y.; He, J.; Zhang, L. SPIRIT: Spectral Awareness Interaction Network with Dynamic Template for Hyperspectral Object Tracking. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5503116. [Google Scholar] [CrossRef]
Li, W.; Hou, Z.; Zhou, J.; Tao, R. SiamBAG: Band Attention Grouping-based Siamese Object Tracking Network for Hyperspectral Videos. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5514712. [Google Scholar] [CrossRef]
Uzair, M.; Mahmood, A.; Mian, A. Hyperspectral face recognition with spatiospectral information fusion and PLS regression. IEEE Trans. Image Process. 2015, 24, 1127–1137. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Ye, X.; Xiong, F.; Lu, J.; Zhou, J.; Qian, Y. Spectral-Spatial-Temporal attention network for hyperspectral tracking. In Proceedings of the IEEE Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), Amsterdam, The Netherlands, 24–26 March 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–5. [Google Scholar]
Tang, Y.; Liu, Y.; Ji, L.; Huang, H. Robust hyperspectral object tracking by exploiting background-aware spectral information with band selection network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6013405. [Google Scholar] [CrossRef]
Wang, S.; Qian, K.; Shen, J.; Ma, H.; Chen, P. AD-SiamRPN: Anti-Deformation Object Tracking via an Improved Siamese Region Proposal Network on Hyperspectral Videos. Remote Sens. 2023, 15, 1731. [Google Scholar] [CrossRef]
Liu, Z.; Wang, X.; Zhong, Y.; Shu, M.; Sun, C. SiamHYPER: Learning a hyperspectral object tracker from an RGB-based tracker. IEEE Trans. Image Process. 2022, 31, 7116–7129. [Google Scholar] [CrossRef]
Dodge, J.; Ilharco, G.; Schwartz, R.; Farhadi, A.; Hajishirzi, H.; Smith, N. Fine-Tuning Pretrained Language Models: Weight Initializations, data orders, and early stopping. arXiv 2020, arXiv:2002.06305. [Google Scholar]
McCloskey, M.; Cohen, N.J. Catastrophic interference in connectionist networks: The sequential learning problem. Psychol. Learn. Motiv. 1989, 24, 109–165. [Google Scholar]
Zhao, C.; Liu, H.; Su, N.; Xu, C.; Yan, Y.; Feng, S. TMTNet: A Transformer-Based Multimodality Information Transfer Network for Hyperspectral Object Tracking. Remote Sens. 2023, 15, 1107. [Google Scholar] [CrossRef]
Liu, H.; He, J.; Wang, J.; Su, N.; Zhao, C.; Yan, Y.; Feng, S.; Liu, Z.; Liu, J.; Zhao, Z. Multi-Band Hyperspectral Object Tracking: Leveraging Spectral Information Prompts and Spectral Scale-Aware Representation. In Proceedings of the IEEE Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), Amsterdam, The Netherlands, 24–26 March 2021; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Xie, S.; Li, J.; Zhao, L.; Hu, W.; Zhang, G.; Wu, J.; Li, X. VP-HOT: Visual Prompt for Hyperspectral Object Tracking. In Proceedings of the IEEE Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), Amsterdam, The Netherlands, 24–26 March 2021; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Li, Z.; Xiong, F.; Zhou, J.; Lu, J.; Qian, Y. Learning a deep ensemble network with band importance for hyperspectral object tracking. IEEE Trans. Image Process. 2023, 32, 2901–2914. [Google Scholar] [CrossRef]
Banerjee, A.; Burlina, P.; Broadwater, J. Hyperspectral video for illumination-invariant tracking. In Proceedings of the IEEE Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), Amsterdam, The Netherlands, 24–26 March 2021; IEEE: Piscataway, NJ, USA, 2009; pp. 1–4. [Google Scholar]
Nguyen, H.V.; Banerjee, A.; Burlina, P.; Broadwater, J.; Chellappa, R. Tracking and identification via object reflectance using a hyperspectral video camera. Mach. Vis. Beyond Visible Spectr. 2011, 1, 201–219. [Google Scholar]
Qian, K.; Zhou, J.; Xiong, F.; Zhou, H.; Du, J. Object tracking in hyperspectral videos with convolutional features and kernelized correlation filter. In Proceedings of the Smart Multimedia: First International Conference, Toulon, France, 24–26 August 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 308–319. [Google Scholar]
Xiong, F.; Zhou, J.; Qian, Y. Material based object tracking in hyperspectral videos. IEEE Trans. Image Process. 2020, 29, 3719–3733. [Google Scholar] [CrossRef] [PubMed]
Tang, Y.; Liu, Y.; Huang, H. Target-aware and spatial-spectral discriminant feature joint correlation filters for hyperspectral video object tracking. Comput. Vis. Image Underst. 2022, 223, 103535. [Google Scholar] [CrossRef]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
Uzkent, B.; Rangnekar, A.; Hoffman, M.J. Tracking in aerial hyperspectral videos using deep kernelized correlation filters. IEEE Trans. Geosci. Remote Sens. 2018, 57, 449–461. [Google Scholar] [CrossRef]
Li, Z.; Xiong, F.; Zhou, J.; Wang, J.; Lu, J.; Qian, Y. BAE-Net: A band attention aware ensemble network for hyperspectral object tracking. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 2106–2110. [Google Scholar]
Li, Z.; Xiong, F.; Lu, J.; Zhou, J.; Qian, Y. Material-guided siamese fusion network for hyperspectral object tracking. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 4–10 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2809–2813. [Google Scholar]
Tang, Y.; Huang, H.; Liu, Y.; Li, Y. A Siamese network-based tracking framework for hyperspectral video. Neural Comput. Appl. 2023, 35, 2381–2397. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 341–357. [Google Scholar]
Cui, Y.; Jiang, C.; Wang, L.; Wu, G. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618. [Google Scholar]
Cui, Y.; Song, T.; Wu, G.; Wang, L. Mixformerv2: Efficient fully transformer tracking. Adv. Neural Inf. Process. Syst. 2024, 36, 58736–58751. [Google Scholar]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Liu, X.; Ji, K.; Fu, Y.; Tam, W.L.; Du, Z.; Yang, Z.; Tang, J. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv 2021, arXiv:2110.07602. [Google Scholar]
Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv 2021, arXiv:2101.00190. [Google Scholar]
Bahng, H.; Jahanian, A.; Sankaranarayanan, S.; Isola, P. Exploring visual prompts for adapting large-scale models. arXiv 2022, arXiv:2203.17274. [Google Scholar]
Jia, M.; Tang, L.; Chen, B.C.; Cardie, C.; Belongie, S.; Hariharan, B.; Lim, S.N. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 709–727. [Google Scholar]
Chen, S.; Ge, C.; Tong, Z.; Wang, J.; Song, Y.; Wang, J.; Luo, P. AdaptFormer: Adapting vision transformers for scalable visual recognition. Adv. Neural Inf. Process. Syst. 2022, 35, 16664–16678. [Google Scholar]
Yang, J.; Li, Z.; Zheng, F.; Leonardis, A.; Song, J. Prompting for multi-modal tracking. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 3492–3500. [Google Scholar]
Zhu, J.; Lai, S.; Chen, X.; Wang, D.; Lu, H. Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 9516–9526. [Google Scholar]
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10448–10457. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 658–666. [Google Scholar]
Schlamm, A.; Messinger, D. Improved detection and clustering of hyperspectral image data by preprocessing with a euclidean distance transformation. In Proceedings of the IEEE Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), Amsterdam, The Netherlands, 24–26 March 2021; IEEE: Piscataway, NJ, USA, 2011; pp. 1–4. [Google Scholar]
Kruse, F.A.; Lefkoff, A.; Boardman, J.; Heidebrecht, K.; Shapiro, A.; Barloon, P.; Goetz, A. The spectral image processing system (SIPS)-interactive visualization and analysis of imaging spectrometer data. Remote Sens. Environ. 1993, 44, 145–163. [Google Scholar] [CrossRef]
Sweet, J.N. The spectral similarity scale and its application to the classification of hyperspectral remote sensing data. In Proceedings of the IEEE Workshop on Advances in Techniques for Analysis of Remotely Sensed Data, Greenbelt, MD, USA, 27–28 October 2003; IEEE: Piscataway, NJ, USA, 2003; pp. 92–99. [Google Scholar]
Chen, L.; Zhao, Y.; Yao, J.; Chen, J.; Li, N.; Chan, J.C.W.; Kong, S.G. Object tracking in hyperspectral-oriented video with fast spatial-spectral features. Remote Sens. 2021, 13, 1922. [Google Scholar] [CrossRef]
Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 2411–2418. [Google Scholar]
Li, Z.; Guo, G.; He, X.; Xu, Q.; Wang, W.; Ling, Q.; Lin, Z.; An, W. RawTrack: Toward Single Object Tracking on Mosaic Hyperspectral Raw Data. In Proceedings of the IEEE Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), Amsterdam, The Netherlands, 24–26 March 2021; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 6668–6677. [Google Scholar]
Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 6269–6277. [Google Scholar]
Guo, D.; Shao, Y.; Cui, Y.; Wang, Z.; Zhang, L.; Shen, C. Graph attention tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 9543–9552. [Google Scholar]
Lukezic, A.; Vojir, T.; Cehovin Zajc, L.; Matas, J.; Kristan, M. Discriminative correlation filter with channel and spatial reliability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6309–6318. [Google Scholar]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. Exploiting the circulant structure of tracking-by-detection with kernels. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, 7–13 October 2012; pp. 702–715. [Google Scholar]
Chen, L.; Zhao, Y.; Chan, J.C.W.; Kong, S.G. Histograms of oriented mosaic gradients for snapshot spectral image description. ISPRS J. Photogramm. Remote Sens. 2022, 183, 79–93. [Google Scholar] [CrossRef]
Kiani Galoogahi, H.; Fagg, A.; Lucey, S. Learning background-aware correlation filters for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1135–1143. [Google Scholar]
Wang, N.; Zhou, W.; Tian, Q.; Hong, R.; Wang, M.; Li, H. Multi-cue correlation filters for robust visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4844–4853. [Google Scholar]
Zhang, L.; Gonzalez-Garcia, A.; Weijer, J.V.D.; Danelljan, M.; Khan, F.S. Learning the model update for siamese trackers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4010–4019. [Google Scholar]
Wang, Y.; Liu, Y.; Ma, M.; Mei, S. A spectral–spatial transformer fusion method for hyperspectral video tracking. Remote Sens. 2023, 15, 1735. [Google Scholar] [CrossRef]

Figure 1. An example of spectral curves for different objects in an HSI: (a) an RGB image with different objects; (b) spectral curves of two pixels from different objects over the wavelength range of 680–960 nm.

Figure 2. Overall network architecture. First, the three-channel patches and the spectral matching map patches are put into the linear projection. After that, 1D position embeddings that can be learned are independently incorporated into the patch embeddings to generate the ultimate token embeddings.

Figure 3. Structure of the standard Transformer block.

Figure 4. The structure of FCP. Three-channel input flows and spectral matching map input flows are sent to the FCP to learn the fused feature.

Figure 5. Spectral matching maps: (a–c) The NIR-car31 video sequence is used as an example. The first image is a three-channel image with the initial bounding box, and the 105th frame is the video sequence’s final frame. (d–f) The VIS-rider4 video sequence is used as an example. The first image is a three-channel image with the initial bounding box provided, and the 378th frame is the video sequence’s final frame. The red box indicates the target.

Figure 6. Success and precision plots for trackers on the total HOT2023 validation set: (a) success plot on the three types of data, and (b) precision plot on the three types of data.

Figure 7. Success and precision plots for trackers on three types of HOT2023 validation sets: (a) success plot on the VIS data; (b) precision plot on the VIS data; (c) success plot on the NIR data; (d) precision plot on the NIR data; (e) success plot on the RedNIR data; and (f) precision plot on the RedNIR data.

Figure 8. Visual comparisons of SPTrack with 10 algorithms on HOT2023. The results are visualized as three-channel images.

Figure 9. Visual comparisons of SPTrack with the SOTA tracker HOMG on IMEC25. The results are visualized as three-channel images.

Table 1. Total numbers of the three sequence types in HOT2023.

	VIS	NIR	RedNIR	Total
Training set	55	40	15	110
Validation set	46	30	11	87

Table 2. Overall performance of compared algorithms on three video types of HOT2023.

Methods	MHT	SiamBAN	SiamCAR	STARK	SiamGAT	TranST	OSTrack	SSAMB	SIPHOT	RawTrack	SPTrack (Ours)
AUC	0.453	0.532	0.554	0.557	0.561	0.569	0.597	0.623	0.635	0.638	0.640
DP@20	0.717	0.756	0.762	0.768	0.770	0.777	0.802	0.837	0.844	0.840	0.857

The top three positions are represented by the colors red, green, and blue.

Table 3. Overall performance of compared algorithms on each type of video in HOT2023.

Methods		MHT	SiamBAN	SiamCAR	STARK	SiamGAT	TranST	OSTrack	SSAMB	SIPHOT	RawTrack	SPTrack (Ours)
VIS	AUC	0.507	0.534	0.565	0.561	0.552	0.576	0.601	0.612	0.625	0.622	0.616
VIS	DP@20	0.760	0.763	0.787	0.783	0.760	0.808	0.818	0.829	0.842	0.834	0.858
NIR	AUC	0.425	0.567	0.600	0.594	0.620	0.617	0.646	0.698	0.709	0.723	0.726
NIR	DP@20	0.756	0.808	0.834	0.805	0.858	0.816	0.852	0.932	0.928	0.938	0.929
RedNIR	AUC	0.317	0.441	0.387	0.439	0.445	0.389	0.451	0.461	0.471	0.482	0.509
RedNIR	DP@20	0.427	0.573	0.504	0.550	0.566	0.522	0.599	0.611	0.623	0.598	0.657

The top three positions are represented by the colors red, green, and blue.

Table 4. Overall performance of different algorithms on IMEC25.

Methods	CSK	CSRDCF	BACF	MCCT	UpdateNet	TranST	SiamCAR	SiamFC	SSTFT	HOMG	SPTrack (Ours)
AUC	0.222	0.527	0.540	0.552	0.530	0.547	0.481	0.582	0.619	0.746	0.588
DP@20	0.429	0.775	0.797	0.797	0.859	0.871	0.777	0.832	0.888	0.823	0.927

The top three positions are represented by the colors red, green, and blue.

Table 5. The AUC and DP@20 performance under different spectral matching maps and prompters.

SMM	Prompter	AUC	ΔAUC	DP@20	ΔDP@20	FPS	Params
w.o.	w.o.	0.597	n/a	0.802	n/a	27.4	92.83M
ED	MCP [60]	0.612	2.513%	0.826	2.993%	24.2	0.84 M
ED	FCP	0.638	6.868%	0.846	5.486%	22.8	1.49 M
SAM	MCP [60]	0.610	2.178%	0.820	2.244%	24.2	0.84 M
SAM	FCP	0.640	7.203%	0.857	6.858%	22.8	1.49 M
SSS	MCP [60]	0.614	2.848%	0.817	1.870%	24.2	0.84 M
SSS	FCP	0.630	5.528%	0.834	3.990%	22.8	1.49 M

w.o. represents without; SMM represents the spectral matching map; Params represents trainable parameters.

Table 6. The AUC and DP@20 performance across three spectral bands under three different spectral matching maps on HOT2023.

Methods		ED	SAM	SSS
VIS	AUC	0.615	0.616	0.618
VIS	DP@20	0.845	0.858	0.844
NIR	AUC	0.710	0.726	0.684
NIR	DP@20	0.898	0.929	0.867
RedNIR	AUC	0.543	0.509	0.535
RedNIR	DP@20	0.707	0.657	0.699

The best AUC result has been marked in red.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, G.; Li, Z.; An, W.; Wang, Y.; He, X.; Luo, Y.; Ling, Q.; Li, M.; Lin, Z. SPTrack: Spectral Similarity Prompt Learning for Hyperspectral Object Tracking. Remote Sens. 2024, 16, 2975. https://doi.org/10.3390/rs16162975

AMA Style

Guo G, Li Z, An W, Wang Y, He X, Luo Y, Ling Q, Li M, Lin Z. SPTrack: Spectral Similarity Prompt Learning for Hyperspectral Object Tracking. Remote Sensing. 2024; 16(16):2975. https://doi.org/10.3390/rs16162975

Chicago/Turabian Style

Guo, Gaowei, Zhaoxu Li, Wei An, Yingqian Wang, Xu He, Yihang Luo, Qiang Ling, Miao Li, and Zaiping Lin. 2024. "SPTrack: Spectral Similarity Prompt Learning for Hyperspectral Object Tracking" Remote Sensing 16, no. 16: 2975. https://doi.org/10.3390/rs16162975

APA Style

Guo, G., Li, Z., An, W., Wang, Y., He, X., Luo, Y., Ling, Q., Li, M., & Lin, Z. (2024). SPTrack: Spectral Similarity Prompt Learning for Hyperspectral Object Tracking. Remote Sensing, 16(16), 2975. https://doi.org/10.3390/rs16162975

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SPTrack: Spectral Similarity Prompt Learning for Hyperspectral Object Tracking

Abstract

1. Introduction

2. Related Work

2.1. Hyperspectral Object Tracking

2.1.1. Object Tracking Based on Generative Models

2.1.2. Object Tracking Based on Discriminative Models

2.2. Transformer-Based Tracking

2.3. Visual Prompt Learning

3. Methods

3.1. Preliminaries

3.2. Network Structure

3.2.1. Overall Architecture

3.2.2. Vision Transformer Block

3.2.3. Feature Complementary Prompter

3.2.4. Head and Loss Function

3.3. Spectral Matching Map Generation

4. Experiments

4.1. Experimental Settings

4.1.1. Data Sets

4.1.2. Evaluation Metrics

4.1.3. Implementation Details

5. Results and Analysis

5.1. Quantitative Comparison with State-of-the-Art Trackers

5.1.1. Quantitative Comparison on HOT2023

5.1.2. Quantitative Comparison on IMEC25

5.2. Visual Comparison

5.3. Ablation Analysis

5.3.1. With and without Spectral Matching Map

5.3.2. With and without Feature Complementary Prompter

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI