Text-Guided Multi-Class Multi-Object Tracking for Fine-Grained Maritime Rescue

Li, Shuman; Lin, Zhipeng; Wang, Haotian; Yang, Wenjing; Liu, Hengzhu

doi:10.3390/rs16193684

Open AccessArticle

Text-Guided Multi-Class Multi-Object Tracking for Fine-Grained Maritime Rescue

by

Shuman Li

,

Zhipeng Lin

,

Haotian Wang

^*,

Wenjing Yang

and

Hengzhu Liu

Department of Intelligent Data Science, College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(19), 3684; https://doi.org/10.3390/rs16193684

Submission received: 13 August 2024 / Revised: 28 September 2024 / Accepted: 1 October 2024 / Published: 2 October 2024

Download

Browse Figures

Versions Notes

Abstract

The rapid development of remote sensing technology has provided new sources of data for marine rescue and has made it possible to find and track survivors. Due to the requirement of tracking multiple survivors at the same time, multi-object tracking (MOT) has become the key subtask of marine rescue. However, there exists a significant gap between fine-grained objects in realistic marine rescue remote sensing data and the fine-grained object tracking capability of existing MOT technologies, which mainly focuses on coarse-grained object scenarios and fails to track fine-grained instances. Such a gap limits the practical application of MOT in realistic marine rescue remote sensing data, especially when rescue forces are limited. Given the promising fine-grained classification performance of recent text-guided methods, we delve into leveraging labels and attributes to narrow the gap between MOT and fine-grained maritime rescue. We propose a text-guided multi-class multi-object tracking (TG-MCMOT) method. To handle the problem raised by fine-grained classes, we design a multi-modal encoder by aligning external textual information with visual inputs. We use decoding information at different levels, simultaneously predicting the category, location, and identity embedding features of objects. Meanwhile, to improve the performance of small object detection, we also develop a data augmentation pipeline to generate pseudo-near-infrared images based on RGB images. Extensive experiments demonstrate that our TG-MCMOT not only performs well on typical metrics in the maritime rescue task (SeaDronesSee dataset), but it also effectively tracks open-set categories on the BURST dataset. Specifically, on the SeaDronesSee dataset, the Higher Order Tracking Accuracy (HOTA) reached a score of 58.8, and on the BURST test dataset, the HOTA score for the unknown class improved by 16.07 points.

Keywords:

multi-object tracking; text-guided; multi-class multi-object tracking; maritime search and rescue based on UAV remote sensing; fine-grained object tracking

1. Introduction

In recent years, marine rescue, an emerging task in remote sensing tracking areas, has attracted increasing attention due to the importance of preserving the lives of human beings and the development of remote sensing techniques for data collection. To be specific, marine rescue aims to provide fine-grained tracking of people and boats over a large area, which can be formulated as a multi-object tracking (MOT) problem. However, different from land rescue, maritime rescue exhibits unique challenges, including the unique environment, far distances, limited resources, and high physical demands [1]. Specifically, maritime search and rescue missions face challenges, such as dealing with a variety of object categories, each with different sizes, shapes, and motion characteristics [2]; the rapid movement of drones carrying cameras causing motion blur of objects [3]; rapid changes in maritime lighting conditions [4]; and the varying size of objects due to changes in distance. Consequently, developing MOT techniques in the field of marine rescue requires new paradigms.

To be specific, the challenges of adapting existing MOT methods to marine rescue tasks fall into three folds: (1) The size and viewpoint of objects to be detected. Compared with the current typical multi-object tracking (MOT) methods, the objects to be tracked in a maritime rescue scene are small in size and are predominantly captured from a downward angle [5,6]. (2) Open-set tracking due to the limit of pretrained data. Typical MOT methods [7,8,9,10] mainly rely on object detection models that are typically pretrained on the COCO dataset [11] (the most common object detection models available today can detect a maximum of categories in the COCO dataset). However, in reality, including in the task of marine rescue, many unseen categories exist, which are not included in the COCO dataset (as shown in Figure 1). (3) Fine-grained tracking. In marine rescue, much more detailed tracking and identification are required. For example, the person category is usually considered an integral class in traditional MOT tasks, while the marine rescue task will consider it a combination of two subclasses, i.e., swimmer and swimmer with life jacket.

To address the above challenges, researchers have recently attempted to bridge MOT models with the maritime rescue task. One of the most impactful activities is the challenge organized by the Workshop on Maritime Computer Vision (MaCVi) [13,14]. To support better adaptation in the marine rescue missions, Zhang et al. [15] jointly trained a detection model using the SeaDronesSee dataset (an unmanned aerial vehicle remote sensing dataset specifically designed for maritime search and rescue) [5] and the MOBDrone dataset (a dataset specifically designed to detect people falling into the water) [16] and integrated this model with the MOT model named ByteTrack [8] for multi-object tracking in marine rescue.

Unfortunately, although previous MOT-based marine rescue (MOTMR) methods have made remarkable progress, two major gaps still exist between these techniques and a realistic marine rescue scene:

During the SeaDronesSee dataset testing process, the evaluator does not even categorize objects, treating them all as a single class. By contrast, in the task of realistic maritime rescue, prioritizing the rescue of people in the water and those without life jackets is crucial [5] due to the limitations of rescue resources and forces (for instance, in Figure 2, the objects in the figure are classified as “person”, but due to their consistent tracking trajectories, they will be considered correct trajectories during testing). The above-mentioned gap between the competition evaluation and the realistic scenario leads to the fact that all existing MOTMR methods do not consider the different states of the same category, such as swimmer and swimmer with life jacket, nor do they distinguish between people on board and people in the water.
All existing MOTMR methods [13,14,15] still rely on the COCO dataset to pretrain the detection branch such that the open-set problem caused by the data shift is not migrated.

To overcome the above-mentioned challenges, this paper develops a novel Text-Guided Multi-Class Multi-Object Tracking model, named TG-MCMOT, for trustworthy maritime rescue and open-set detection and tracking. To solve the main challenge, i.e., the fine-grained tracking challenge, we consider a multi-modal setup by introducing external textual information to guide the visual tracker. We note that, unlike the traditional MOT scene, external textual information is usually available in the case of rescue, e.g., some descriptions characterizing people needing to be rescued. Meanwhile, to better support rescue as an instance-level task, we separate image features from semantic-level text features to extract instance-level features, which differ from the paradigm of traditional detection models [17]. In addition, to overcome the object size challenge in marine rescue, we develop a data preprocessing method to explore the role of multi-spectral data in maritime multi-object tracking. Consequently, our proposed model can track small objects in the vast ocean background under different lighting conditions. With these special designs, our proposed TG-MCMOT will help improve rescue efficiency and increase survivor detection rates.

Overall, our contributions are as follows:

We design a text-guided multi-class multi-object tracking model (TG-MCMOT) that realizes a multi-class multi-object tracking task by introducing the text information of categories—decoding information at different levels—and simultaneously predicting the category, location, and identity embedding features of objects.
In maritime rescue missions, we design a data preprocessing method that introduces pseudo-near-infrared (PNIR) information into RGB images. This technique is used to filter out the background and lead to improved detection and tracking performance of small objects captured by drones, without the need for specialized near-infrared cameras.
Our TG-MCMOT demonstrates superior performance on the SeaDronesSee dataset compared to traditional multi-object tracking methods. It can effectively distinguish between fine-grained categories such as swimmer and swimmer with life jacket, as well as deal with transitions between these categories.
Our TG-MCMOT shows its effectiveness on the BURST dataset, where it can track not only common but also uncommon categories. Notably, the model achieves state-of-the-art performance in tracking uncommon categories.

The rest of this paper is organized as follows: Section 2 introduces related works. Section 3 presents our proposed multi-object tracking method for maritime rescue. Section 4 provides experiments and results. Section 5 provides a detailed analysis of the experimental results. Section 6 is the conclusion.

2. Related Works

2.1. Typical Multi-Object Tracking

Current multi-object tracking (MOT) methods are mostly based on MOTChallenge datasets, such as MOT15, MOT16, MOT17, and MOT20 [18,19,20]. These datasets focus on crowded pedestrians. These MOT methods are typically divided into two categories: tracking-by-detection (TBD) methods and joint detection and tracking (JDT) methods. TBD methods usually employ advanced detection algorithms (such as Fast R-CNN [21], SSD [22], and the YOLO family [23,24,25,26,27,28]) to detect objects in each frame, and they then utilize a feature extraction network to extract relevant features [29,30,31] from these objects. Subsequently, correlation algorithms (such as the Hungarian algorithm [32], network flow [33], and graph learning [34,35,36]) are used to connect object detections across all frames, generating tracking trajectories. Classic TBD methods include the SORT series [7,37,38], ByteTrack [8], etc. JDT methods simultaneously generate bounding boxes for objects and extract features for association. Common classic methods for this purpose include Tracktor++ [9] and FairMOT [10]. There are also some datasets that involve pedestrian and vehicle tracking, such as KITTI [39] and BDD100K [40]. However, these datasets and methods focus on objects from a horizontal perspective, and pedestrians and vehicles are too small relative to the entire image.

2.2. Multi-Object Tracking in Aerial Imagery

In recent years, some researchers have also focused on multi-object tracking (MOT) in aerial remote sensing images. Du et al. [41] proposed the UAVDT dataset, which is a drone-captured dataset, and Zhu et al. [42] proposed a large drone-captured dataset called VisDrone. These datasets can be used for object detection, single-object tracking (SOT), and multi-object tracking (MOT). For these two datasets, Liu et al. [43] and Mazzeo et al. [44] utilized specific approaches to construct a multi-object tracking model for drone views separately. Liu et al. [43] used a feature update module and a motion filter module, while Mazzeo et al. [44] combined YOLOv8 and strongSORT [38]. In addition to the above datasets, there are several other aerial image datasets available, such as AerialMPT [45] and DLR-ACD [46]. Azimi et al. [47] proposed a multi-object tracking method called AerialMPTNet, which achieved good results on these datasets. However, it is worth noting that these datasets are generally captured on land, and methods based on these datasets may not completely consider the unique environment and objects in the ocean. In this context, certain objects may be partially submerged below the surface of the water, and due to strong light reflection, this may cause interference with the drone’s imaging and thus affect the accuracy of object detection and tracking. Consequently, in view of the special circumstances of the marine environment, further research and development of ocean-appropriate multi-object tracking methods are required.

2.3. Multi-Object Tracking in Maritime Rescue

In response to the special environment of the ocean and special objects such as swimmer and swimmer with life jacket in maritime rescue scenarios, Varga et al. proposed a large open-water people dataset called SeaDroneSee [5]. This dataset is an unmanned aerial vehicle (UAV) remote sensing dataset that bridges the gap between onshore and offshore UAV-captured images and offers a number of valuable applications, including object detection [48,49], single-object tracking (SOT), and multi-object tracking (MOT). In order to encourage researchers to apply computer vision to the maritime field and stimulate their interest in the SeaDroneSee dataset, Kiefer et al. organized the Workshop on Maritime Computer Vision (MaCVi), which has been held twice so far [13,14].

Many researchers have participated in the MaCVi Challenge and contributed MOT methods to maritime rescue. Song et al. [13] proposed MoveSORT, which uses YOLOv7 [27] as the detector and implements tracking using DeepSORT-related [37] technologies. They also introduced an enhanced correlation coefficient maximization module to improve DeepSORT. Li et al. [13] applied the well-known MOT model named ByteTrack [8] to the SeaDroneSee dataset and adjusted the threshold in ByteTrack to suit oceanic scenarios. Somers et al. [13] improved StrongSORT [38] by using spatio-temporal information to associate objects across different frames. Yang et al. [14,50] proposed Metadata-Guided MOT (MG-MOT), which is motion-based and makes full use of key drone metadata to effectively combine short-term tracking with long-term tracking. Stadler et al. [14] used the classic tracking-by-detection architecture, with VarifocalNet [51] used as the detector and ByteTrack [8] used as the association method, while also considering camera motion compensation. Huang et al. [14] achieved multi-object tracking using only the CBNetV2 Swin-B [52] high-performance detection model and a self-supervised appearance model, omitting traditional motion information.

The above methods treat all objects in water as one category, without adequately distinguishing between the four categories in the SeaDroneSee-MOT dataset: swimmer, swimmer with life jacket, boat, and life jacket. Moreover, the SeaDroneSee dataset also provides multi-spectral images. To our knowledge, no researchers have yet used SeaDroneSee multi-spectral images to assist MOT.

2.4. Multi-Class Multi-Object Tracking

Dave et al. [12] proposed a dataset called TAO (Tracking Any Object), which annotates 833 categories and provides a large-scale dataset for Multi-Class Multi-Object Tracking (MCMOT). Based on the TAO dataset, Athar et al. [53] proposed a Benchmark for Unifying object Recognition, Segmentation, and Tracking in video (BURST). The labeling process of the BURST dataset is similar to that of the TAO dataset, which does not provide labels for each frame in the video. In addition, the BURST dataset streamlined and merged 833 categories reported in the TAO dataset, ultimately retaining 482 categories. Among these categories, 78 categories from the COCO dataset are called “common”, and the remaining 404 categories are called “long tail” or “uncommon”. Athar et al. [53] also proposed a method called STCN Tracker, which uses the video object segmentation method named STCN to pass on the object mask over time. Wu et al. [54] proposed a general object visual foundation model called GLEE, which is used for object detection, instance segmentation, and object tracking.

3. Method

In this section, we will first provide an overview of the Text-Guided Multi-Class Multi-Object Tracking (TG-MCMOT) model and then introduce a data preprocessing method specifically designed for maritime rescue scenarios and a variant of TG-MCMOT for maritime rescue.

3.1. TG-MCMOT

To efficiently cater to the urgent matters in maritime rescue, we carefully considered the specific categories in the UAV remote sensing SeaDronesSee-MOT dataset: swimmer, swimmer with life jacket, boat, and life jacket. As far as we know, no other dataset has such detailed classifications as SeaDronesSee, making it extremely challenging to train a new tracker from scratch. Inspired by prompt learning [55], we can use specific prompts such as category names to apply a pretrained model to the SeaDronesSee dataset and other multi-class multi-object tracking datasets such as BURST [53], without making significant parameter adjustments or requiring large amounts of data. To use text prompts to track objects with fine classifications, we improve the well-known text–image object detector Grounding DINO [17]. However, text is semantic in nature, and image features fused with text features mainly contain classification features. In addition, tracking also requires instance features for each object. Therefore, on this basis, we developed the TG-MCMOT (Text-Guided Multi-Class Multi-Object Tracker).

3.1.1. Overview

The TG-MCMOT follows the encoder–decoder architecture. As shown in Figure 3, text information of the category names is first processed through a text feature extractor to generate text information. Meanwhile, the input image is transformed into image information by an image feature extractor, which is then fused with the text information through a feature encoder. Unlike Grounding DINO, the feature decoder of TG-MCMOT is capable of decoding not only object class and bounding box (or mask prompt) but also instance-level identity embedding features.

The text feature extractor uses a pure language model BERT [56], while the image feature extractor uses Swin Transformer [57]. The encoder and decoder will be described in detail below.

3.1.2. Encoder

We follow the well-known pre-trained language model architecture, Bidirectional Encoder Representations from Transformers (BERT) [56] for text encoding. In the multi-class multi-object tracking task, the objects that need to be tracked in each video may not belong to the same category, but it may involve multiple categories. In the proposed TG-MCMOT method, the categories of interest that need to be tracked in each video are specified by the text. The BERT architecture can encode multiple sentences simultaneously and separate different sentences within the same text sequence, allowing downstream models to know which sentence each token belongs to. Therefore, in the proposed TG-MCMOT method, the names of the multiple object categories of interest in the video are considered individual “sentence”, i.e., connected with the English full stop “.”, for example, “swimmer.swimmer with life jacket.boat.life jacket.”. The details of the text encoding are shown in Figure 4. BERT introduces a text self-attention mask matrix, as shown in Figure 4, to delineate the boundaries of each sentence. Therefore, we focus only on the relative positions of words within each sentence, rather than the order of the sentences themselves.

The primary function of TG-MCMOT encoder is to fuse category text features and image visual features using a cross-attention mechanism. Its structure is similar to the encoder of the object detection model Grounding DINO [17], but the specific fusion steps are different. Grounding DINO first performs self-attention to text and image information separately, and it then fuses the information together, whereas our proposed TG-MCMOT first fuses text and image information and then applies self-attention to fused features, as shown in Figure 5. Self-attention introduces positional information from the image and contextual information from the text into fused features.

The text contextual information includes the text position encoding and the text self-attention mask matrix. The text position encoding is a sinusoidal encoding that marks the positions of words within each sentence (category name) in a text sequence, indicating the location of each word in the sentence. The text self-attention mask matrix is used to decouple different category names within the same text sequence, preventing dependencies between different categories. The image position information is generated simultaneously with the extraction of image features by the image feature extractor, indicating the position information of image features at different scales. This includes position encodings for pixels at various scales, starting pixel index markers, shapes, and valid position mask markers. When introducing position information, image features utilize the multi-scale deformable attention module proposed in Deformable DETR [58], which learns cross-scale features by sampling a set of “reference points”.

In the process of fusing image and text features, the feature fusion stage occurs before the introduction of positional information. However, both decoupled categorical text information and cross-scale image positional information are important for cross-modal information fusion. Moreover, as shown in Figure 5, a single-feature fusion encoding and the introduction of positional information are not sufficient to extract the various levels of image features required for multi-object tracking tasks (such as depth features for detection tasks, abstract semantic-level features, and appearance, texture, and other instance-level features for tracking). Therefore, drawing on the experience of Grounding DINO, the encoding process shown in Figure 5 was repeated six times in the TC-MCMOT encoder to facilitate cross-fusion of different category texts and different levels of image features. During the feature encoding process, the input text features and image features, apart from being the output of the text feature extractor and the image feature extractor for the first time, come from the output of the previous process at all other times.

3.1.3. Decoder

Multi-Object Tracking (MOT) involves two core subtasks: object detection and identity recognition. Multi-Class Multi-Object Tracking (MCMOT) adds complexity to these tasks, where the detection task must also deal with unknown categories. Therefore, TG-MCMOT follows Grounding DINO, using object queries for classification and localization to effectively address the detection task. In addition, TG-MCMOT introduces tracking queries to generate object identity embedding features.

The queries are considered anchors, and each pixel point at each scale of the image is the potential center of an anchor. To select those anchors that are more likely to contain objects, a contrast embedding module is used to calculate the attention scores of the image features output by the encoder to the text encoding that has not been fused with the image, and then the positions of the pixels with the highest attention to the text are recorded. The image feature vectors at these positions are the classification part of the object query. The positioning part of the object query is the sinusoidal encoding of the initial reference box at these positions. The initial reference box is the sum of the initial proposal box and the initial offset. The initial proposal box is centered on each pixel point

(x, y)

, and the width and height of the proposal box are defined in four sizes according to four different scales of image features. The initial offset of the proposal box is calculated using some linear layers based on the image features. In addition to appearance features, object category features and location features are also part of object identity features. Therefore, the tracking query will be initialized as the object query first, but the learning and optimization in the later learning process will be different from that of the object query.

As shown in Figure 6, the image features output by the encoder and the original text features pass through a contrast embedding module to select the image content for decoding at 900 positions. This content, which serves as classification and localization information, is added together to form the object query for the detection decoding module.

The detection decoding module, depicted as the light-blue part in Figure 6, includes a multi-head self-attention module, a text cross-attention module, and a multi-scale deformable attention module. The outputs of these modules are all used to update the image content information, and the inputs of these modules are shown in Table 1.

Half the TG-MCMOT decoder follows the decoder of Grounding DINO [17]. To accomplish the task of object identity recognition, TG-MCMOT also adds a tracking decoding module in the decoder, as shown in the orange part of Figure 6. Since text information is semantic-level categorical information, interacting with text information during the decoding process may result in the loss of instance-level features required for tracking. Therefore, the TG-MCMOT tracking decoding module lacks a text cross-attention module compared to the detection decoding module.

Since TG-MCMOT has six encoding layers in its encoder module, six decoding layers were also designed in the decoder module to decode features at different levels, with each following the decoding process shown in Figure 6. Content information, tracking information, initial reference boxes, and updated reference boxes from each decoding layer were recorded.

According to the decoder output, the information of the 900 anchors can be predicted as follows:

Object Category: The content information output by the last layer of the decoder, after passing through a contrast embedding module, is guided by the original text features to obtain attention scores for the position of each anchor’s text category token in the text sequence. That is, the decoder does not predict the specific category, but the position of this category in the text sequence encoding, and the content at this position is the encoding of the text of the category name in the vocabulary.
Object Bounding Box: The reference box input to the last layer of the decoder and the offset output from the last layer are added to obtain the predicted bounding box of the object.
Object Identity Embedding: The tracking information output by each decoder layer is transformed into six different levels of object identity embedding features through a series of linear feature transformations.

3.1.4. Segmentation Module

TG-MCMOT also supports generating masks for objects, enabling the Multi-Class Multi-Object Tracking and Segmentation (MCMOTS) task. The segmentation module of TG-MCMOT utilizes the well-known Segment Anything Model (SAM [59]), and unlike other methods that employ SAM as the segmentation module [60,61,62,63,64], SAM in TG-MCMOT is capable of being co-trained with other modules. Specifically, when extended to the segmentation task, TG-MCMOT uses the object bounding boxes output by the decoder as box prompts for the SAM model to guide the generation of masks for the objects.

TG-MCMOT predicts bounding boxes for 900 queries. To reduce the computational cost of the model, only the bounding boxes corresponding to queries that are more likely to contain objects are selected as prompts for mask generation. During the training process, a matching algorithm is used to select some predicted bounding boxes based on the Ground Truth (GT) labels of the objects. The algorithm (Algorithm 1) will be introduced in the following. In the inference process, the top

K_{top}

predicted detection boxes with the highest category location response scores were selected as prompts for the segmentation module.

The segmentation module plays a crucial role in providing pixel-level details that are more refined than bounding boxes for objects, allowing for a clear distinction of the contours and internal structures of each object in the image. Extending a segmentation branch to our TG-MCMOT model is vital for future applications that require precise understanding of image content, meticulous editing, or advanced analysis in complex scenes.

3.1.5. Training and Inference

For a single frame, the output of TG-MCMOT for object category is the category position prediction response scores for 900 queries, denoted

\hat{S} c - p

to facilitate batch training, where the TG-MCMOT uniformly sets the length of the sequence token encoding to 256 so

\hat{S} c - p

is a vector with size

900 \times 256

. The output of TG-MCMOT for object bounding boxes is the center coordinates and width and height of the predicted bounding boxes for 900 queries, denoted

\hat{B}

, which is a vector with size

900 \times 4

.

We selected queries that were closest to the GT through a matching Algorithm 1. Based on the object categories in the GT labels and the text decoding guide, a true category position response of length

n u m_t o k e n s

(256) was generated for each object. The true category position responses corresponding to the N objects in the current image are denoted as

S_{c - p}

, which is a vector with size

N \times 256

. Object bounding boxes in GT labels are recorded as

B

, which is a vector with size

N \times 4

.

Algorithm 1 The matching algorithm for queries and GT labels

Require: The output of TG-MCMOT:

{\hat{S}}_{c - p}

and

\hat{B}

, GT labels of objects:

S_{c - p}

and

B

Ensure: One-to-one matching between queries and GT labels

1:: $positive_class = μ {(1 - {\hat{S}}_{c - p})}^{γ} log \frac{1}{{\hat{S}}_{c - p}}$
2:: $negative_class = (1 - μ) {({\hat{S}}_{c - p})}^{γ} log \frac{1}{1 - {\hat{S}}_{c - p}}$
3:: ${\cos t}_{Class} = positive_class ⨂ {S_{c - p}}^{⊤} - negative_class ⨂ {S_{c - p}}^{⊤}$
4:: ${\cos t}_{l_{1}} = {∥ \hat{B} - B ∥}_{1}$
5:: ${\cos t}_{GIoU} = L_{giou} (\hat{B}, B)$ , $L_{giou}$ denotes GIoU loss [65]
6:: $\cos t = w_{class} {\cos t}_{Class} + w_{l_{1}} {\cos t}_{l_{1}} + w_{giou} {\cos t}_{GIoU}$
7:: Match 900 queries and N real objects based on cost, and output a one-to-one corresponding “query index”-to-“real object index”

Based on the “query index”-to-“real object index” pairs output by Algorithm 1, we extended N GT category position responses to 900 category position responses: the category position response of the query corresponding to object

i (i \in [1, \dots, N])

was set to the GT category position response of object i, and the category position responses of the other queries were set to 0. The extended GT category position responses are denoted

S_{c - p 900}

, with a size of

900 \times 256

. Since the token encoding length

n u m_t o k e n s

of a text sequence is uniformly set to 256, for short sequences, the latter part of the encoding is meaningless. Therefore, a text information mask is needed to truncate this part of the token encoding. For simplicity, the truncation encoding step is not reflected in the following classification loss calculation. The classification loss of TG-MCMOT is defined as follows:

L_{cls} = \frac{1}{N} {(1 - p_{same})}^{γ} p_{μ} L_{bce} ({\hat{S}}_{c - p}, S_{c - p 900}),

(1)

where

L_{bce}

denotes Binary Cross-Entropy (BCE) loss.

p_{same}

is the probability that the predicted value exactly matches the GT label:

p_{same} = {\hat{S}}_{c - p} S_{c - p 900} + (1 - {\hat{S}}_{c - p}) (1 - S_{c - p 900}),

(2)

p_{μ}

is the probability weight when the predicted value responds to

μ

:

p_{μ} = μ S_{c - p 900} + (1 - μ) (1 - S_{c - p 900}) .

(3)

We only calculate the localization loss for those queries that match the objects in the GT labels according to Algorithm 1. For object i, its GT bounding box label is denoted

b_{i}

, and the query’s bounding box that matches it is recorded as

{\hat{b}}_{i}

. The bounding box prediction loss of TG-MCMOT is defined as follows:

L_{bbox} = w_{l_{1}}^{'} L_{l_{1}} + w_{giou}^{'} L_{GIoU},

(4)

where bounding box loss is the weighted sum of

l_{1}

loss (

L_{l_{1}}

) and GIoU loss (

L_{GIoU}

), with weights of

w_{l_{1}}^{'}

and

w_{giou}^{'}

, respectively.

l_{1}

loss is defined as follows:

L_{l_{1}} = \frac{1}{N} \sum_{i = 1}^{N} | | {\hat{b}}_{i} - b_{i} {| |}_{1} .

(5)

GIoU loss is defined as follows:

L_{GIoU} = \frac{1}{N} \sum_{i = 1}^{N} L_{giou} ({\hat{b}}_{i}, b_{i}) .

(6)

In MCMOT tasks, different categories can have significant differences in appearance and other features, so object identity embedding features are primarily used to distinguish between different instances of the same category. Within the same frame, the same category may correspond to multiple different objects, and identity embedding needs to differentiate these objects. Therefore, the optimization goal of the TG-MCMOT identity embedding module is to make identities within the same frame as distinct as possible. The features obtained from multi-layer encoding and decoding are more for distinguishing semantic features of different categories. Therefore, we retained six layers of object identity embedding features during the decoding process. We selected the object identity embedding features output from the fourth decoder layer. According to the output of Algorithm 1, it obtains the fourth layer object identity embedding feature

{\hat{e}}_{i}

corresponding to the query for object i. The identity embedding loss of TG-MCMOT is calculated as follows:

L_{distinguish} = \frac{1}{N (N - 1)} \sum_{i = 1, j = 1}^{N} (F_{\cos} ({\hat{e}}_{i}, {\hat{e}}_{j}) - N),

(7)

where

F_{\cos}

represents cosine similarity, and the cosine similarity between different objects should be as small as possible (i.e., the identity embeddings of the two objects should be as different as possible).

The segmentation module of the TG-MCMOT uses predicted boxes of the queries matched with GT labels, obtained from Algorithm 1, as prompts to generate object masks. The predicted mask for object i is denoted

{\hat{m}}_{i}

, and the GT label mask is

m_{i}

. The segmentation loss of TG-MCMOT is calculated as follows:

L_{mask} = \frac{1}{N} \sum_{i = 1}^{N} L_{mse} ({\hat{m}}_{i}, m_{i}),

(8)

where

L_{mse}

represents the Mean Squared Error (MSE) loss.

The total loss of the TG-MCMOT is calculated as follows:

L_{TG - MCMOT} = λ_{cls} L_{cls} + λ_{bbox} L_{bbox} + λ_{distinguish} L_{distinguish} + λ_{mask} L_{mask},

(9)

where

λ_{cls}

,

λ_{bbox}

,

λ_{distinguish}

, and

λ_{mask}

are the weights of each subloss, and these hyperparameters work together across the TG-MCMOT model to balance various tasks. The segmentation task is optional, and when no segmentation module is added, there is no need to calculate the last term of Equation (9).

In the inference stage, the top

K_{top}

queries with the highest category position response scores are selected. Then, queries with a category position response score higher than

τ_{l o w}

are retained, and the category number (class_id) is predicted based on these scores (

{\hat{s}}_{c - p}^{(i)}

,

i \in [1, \dots, \hat{N}]

, where

\hat{N}

is the number of queries that meet the criteria). Class numbers that cannot be obtained from the category decoding guidance were set to 9999, which can be considered “categories not present in this image”. In other words, the result corresponding to this query is a false positive (FP).

TG-MCMOT uses retained queries to perform online association based on their corresponding category numbers, detection boxes (or masks), and object identity embedding features.

The association process sequentially considers the following criteria for associating detected objects with existing trajectories:

The similarity of object identity embedding features.
The intersection over Union (IoU) score of high score detection boxes (or masks).
The IoU score of low score detection boxes (or masks).

During each association process, the model avoids assigning the same track_id to objects that belong to different categories. The high and low score thresholds for the category position response are denoted

τ_{h i g h}

and

τ_{l o w}

, respectively.

3.2. TG-MCMOT for Maritime Rescue

When employing the TG-MCMOT model for maritime search and rescue operations, it can use text guides to distinguish the four categories present in the SeaDronesSee-MOT dataset: swimmer, swimmer with life jacket, boat, and life jacket.

Compared to large-scale datasets for MCMOT tasks, maritime rescue visual datasets are relatively scarce. In addition, objects in the maritime rescue scene often appear in the vast ocean, where they are typically small and captured from a top-down perspective. The SeaDronesSee dataset also provides some multispectral data. It has been observed, as shown in Figure 7, that the near-infrared (NIR) spectrum channel in multispectral images (with a wavelength of about 842 nanometers) can effectively “filter out” the background and “retain” the objects. Therefore, when applying TG-MCMOT to maritime rescue, we use near-infrared spectrum images to assist RGB images, enhancing the model’s ability to detect and track objects against complex maritime backgrounds.

3.2.1. Pseudo-near-InfraRed Image Generation

As shown in Figure 8a, in maritime rescue scenarios, the background is often the vast water surface, which presents different colors such as green, cyan, gray, or even close to black depending on the lighting conditions. Compared to the background, the size of objects (especially people) is very small. By analyzing multi-spectral information at several different wavelengths (475 nm, 560 nm, 668 nm, 717 nm, and 842 nm) in the SeaDronesSee dataset [5], we found that the near-infrared wavelength (842 nm) has the best effect on background filtering and object retention.

However, since near-infrared spectral images require specific cameras, and ordinary RGB images do not contain near-infrared channel information, we adopted a data preprocessing method to generate Pseudo-Near-InfraRed (PNIR) images for RGB images. The Pseudo-Near-InfraRed (PNIR) images we are referring to are not real infrared band information but are specific color and brightness information extracted through image processing techniques that mimic the near infrared band. The main steps for generating PNIR images are shown in Algorithm 2. Its output is a “black and white image” of the same size as the RGB image, with values of 0 or 1, as shown in Figure 8b. In the PNIR image, the background information is essentially “filtered out”, while the positions with objects retain bright spots.

Algorithm 2 Generating Pseudo Near InfraRed Images from RGB images

Require: RGB image
Ensure: Pseudo Near Infrared (PNIR) image

1:: Load RGB image
2:: HSV ← RGB
3:: Define the upper and lower bounds of the HSV color space for the region of interest: ${HSV}_{low}$ and ${HSV}_{high}$ .
4:: PNIR image ← The pixel values between ${HSV}_{low}$ and ${HSV}_{high}$ in the HSV image are set to 1, and the remaining values are set to 0.

Algorithm 2 can be considered a “filtering algorithm”, which enhances the location of small objects by adjusting the contrast between the large-area ocean background and the region of interest, providing location markers for areas that may contain objects.

{HSV}_{low}

and

{HSV}_{high}

are two hyperparameters used to select the region of interest. HSV is a color representation method, where H stands for hue, S for saturation, and V for brightness value. These two hyperparameters are determined based on the significant difference between the background and the region of interest.

3.2.2. TG-MCMOT with PNIR Module

When the TG-MCMOT is applied to maritime rescue scenarios, Pseudo-Near-InfraRed (PNIR) images are introduced. PNIR channel information is used in both feature encoding and feature decoding. PNIR images are first processed through a max pooling layer to adjust them to the same size as RGB image features. Subsequently, a

1 \times 1

convolution is applied to adjust the feature dimensions to 256. The resulting PNIR features are consistent in size and dimension with the image features.

Since the generated PNIR images are binary, they primarily contain object location features and do not include category-related appearance information. Therefore, when using PNIR information for feature encoding, PNIR features are only used for fusion with RGB image features. Specifically, during the encoding process, text features and RGB image features are first fused. Then, RGB image features are fused with PNIR image features, as shown in Figure 9. The model learns attention from RGB image features to PNIR image features, aiming to help the model obtain more location information about objects from RGB image data. The pixels with a value of 0 in the PNIR image are almost all background information, it is necessary to shield this information when fusing with RGB features.

In the decoding process, we added PNIR information, as shown in Figure 10. Specifically, a PNIR cross-attention was added between the multi-head self-attention and the text cross-attention during the object query decoding process. This integration helps the object query learn object position information.

3.2.3. Cross-Frame Identity Loss

In popular MCMOT tasks based on datasets such as TAO [12] and BURST [53], the main challenge is to identify and track a wide array of categories. These categories may have significantly different appearance features. The TG-MCMOT calculates an in-frame object distinguishability loss

L_{distinguish}

to ensure that different objects within the same frame have distinct identities.

However, in maritime rescue scenes for MOT tasks, even different categories (such as swimmer and swimmer with life jacket) may have a very similar appearance. Learning only the identity feature differences within a single frame is not enough for cross-frame object association. Therefore, information from the previous frame must be included with the current frame to compute a cross-frame object identity embedding loss

L_{cross}

.

We recorded the current frame as Frame t (

t \in [1, \dots, L_{video}]

, where

L_{video}

is the length of the video). When

t > 1

, calculate the cross-frame identity embedding loss based on the information from the Frame

t - 1

, i.e., do not calculate the cross-frame identity embedding loss for the first frame of the video. For Frame

t - 1

(

t > 1

), we recorded a dictionary

{Dict}_{t - 1}

: Key is the

t r a c k_{i d}^{(i)}

in the GT label of object i (

i \in [1, \dots, N_{t - 1}]

, where

N_{t - 1}

is the actual number of objects in frame

t - 1

); the value is the object identity embedding feature

{\hat{e_{t - 1}}}^{(i)}

corresponding to the object i. Similarly, for Frame t, we also recorded a dictionary

{Dict}_{t}

: Key is the

t r a c k_{i d}^{(i^{'})}

in the GT label of object

i^{'}

(

i^{'} \in [1, \dots, N_{t}]

, where

N_t

is the actual number of objects in frame t); the value is the object identity embedding feature

{\hat{e_{t - 1}}}^{(i^{'})}

corresponding to the object

i^{'}

. We calculated the cross-frame identity embedding loss of the current frame (Frame t,

t > 1

) using Algorithm 3. The cross-frame identity embedding loss of the first frame is 0.

When the TG-MCMOT is used in maritime rescue scenarios, the identity embedding loss is divided into two parts: one is used to calculate the cross-frame loss

L_{cross}

for objects belonging to the same identity in adjacent frames, and one is used to calculate the loss

L_{distinguish}

for objects with different identities in the same frame.

Algorithm 3 Calculating

L_{cross}

Require:

{Dict}_{t}

,

N_{t}

;

{Dict}_{t - 1}

,

N_{t - 1}

Ensure:

L_{cross}

1:: $L_{cross} = 0$
2:: $k e y_l i s t_{t} \leftarrow [$ key for key in ${Dict}_{t}]$
3:: $v a l u e_l i s t_{t} \leftarrow [$ value for value in ${Dict}_{t}$ .values()]
4:: $k e y_l i s t_{t - 1} \leftarrow [$ key for key in ${Dict}_{t - 1}]$
5:: $v a l u e_l i s t_{t - 1} \leftarrow [$ value for value in ${Dict}_{t - 1}$ .values()]
6:: $n u m = 0$
7:: for $i^{'} = 1$ to $N_{t}$ do
8:: ${track_id}^{(i^{'})} \leftarrow k e y_l i s t_{t} [i^{'}]$
9:: for $i = 1$ to $N_{t - 1}$ do
10:: ${track_id}^{(i)} \leftarrow k e y_l i s t_{t} [i]$
11:: if ${track_id}^{(i)} = =$ ${track_id}^{(i^{'})}$ then
12:: ${\hat{e_{t}}}^{(i^{'})} \leftarrow v a l u e_l i s t_{t} [i^{'}]$
13:: ${\hat{e_{t - 1}}}^{(i)} \leftarrow v a l u e_l i s t_{t - 1} [i]$
14:: $l o s s \leftarrow F_{\cos} ({\hat{e_{t}}}^{(i^{'})}, {\hat{e_{t - 1}}}^{(i)})$
15:: $l o s s \leftarrow 1 - l o s s$
16:: $L_{cross} \leftarrow L_{cross} + l o s s$
17:: $n u m \leftarrow n u m + 1$
18:: break
19:: end if
20:: end for
21:: end for
22:: if $n u m > 0$ then
23:: $L_{cross} \leftarrow \frac{1}{n u m} * L_{cross}$
24:: else
25:: $L_{cross} \leftarrow L_{cross}$
26:: end if

4. Experiments and Results

In this section, we first introduce SeaDronesSee [5] and BURST [53] datasets, evaluation metrics, and experimental setup. In Section 4.4, we report not only the performance of the TG-MCMOT on the maritime rescue dataset SeaDronesSee but also its performance on the conventional MCMOT dataset BURST. These comprehensive results allow us to assess the capabilities of the TG-MCMOT in both specialized maritime rescue scenarios and general MCMOT situations.

4.1. Datasets

We evaluate our TG-MCMOT using two datasets for evaluation: the SeaDronesSee dataset, which is specifically designed for maritime rescue missions, and the BURST dataset, which contains a large number of categories.

SeaDronesSee dataset. The SeaDronesSee [5] is a large-scale drone-captured remote sensing dataset specifically collected for search-and-rescue tasks in maritime environments, providing 22 videos (with 21 videos actually usable) for MOT tasks. It contains 54,105 frames and 403,192 annotated instances. The dataset is divided into a training set (approximately 4/7 of all images), a validation set (approximately 1/7 of all images), and a test set (approximately 2/7 of all images), with labels for the test set not publicly disclosed.

However, when reporting results on the SeaDronesSee dataset, all objects are considered to be the same category, meaning the accuracy of classification is not a priority. In maritime rescue scenarios, it is essential to assign different rescue priorities to different categories. Therefore, we used a dataset with many categories to help evaluate the TG-MCMOT.

BURST dataset. The BURST [53] is a multi-category dataset that annotated 482 categories. Of these, 78 categories that originate from the COCO dataset [11] are referred to as “common” categories, while the remaining 404 categories are called “long tail” or “uncommon” categories. BURST includes 2909 videos, with 500 videos in the training set, 988 videos in the validation set, and 1421 videos in the test set. BURST does not offer labels for each frame. It has about 200,000 annotated images in total and provides mask annotations for objects.

4.2. Evaluation Metrics

Following the SeaDronesSee dataset, we used the following metrics to evaluate the performance of the TG-MCMOT:

HOTA: Higher-Order Tracking Accuracy (HOTA) [66] is used to simultaneously measure detection (or segmentation) performance and object identity embedding performance. It is the geometric mean of Detection Accuracy (DetA) and Association Accuracy (AssA), i.e., $HOTA = \sqrt{DetA \times AssA}$ .
MOTA: Multiple Object Tracking Accuracy (MOTA) [67] considers False Positives (FPs), False Negatives (FNs), and ID Switches (IDSs).
IDF1: IDF1 [68] is the ratio of correctly identified detections to the average number of ground truth and computed detections and can be used to evaluate the quality of object identity embedding.
MT: Mostly Tracked (MT) is the number of objects that are tracked for at least $80 %$ of their lifespan.
ML: Mostly Lost (ML) is the number of objects that are lost for at least $80 %$ of their lifespan.

When evaluating the performance of the TG-MCMOT on the SeaDronesSee dataset, all categories are considered as one category.

On the BURST dataset, we calculated the HOTA separately for each class, followed by averaging to obtain the final metric. In order to facilitate easier performance analysis of different object classes, we averaged the HOTA scores for each of the three different object classes:

The “common” set, which includes 78 categories from the COCO dataset.
The “uncommon” set, which includes the remaining 404 categories.
The “all” set, which includes a total of 482 categories from the BURST dataset.

We used the

{HOTA}_{com}

,

{HOTA}_{unc}

, and

{HOTA}_{all}

to represent these three indicators, respectively. The DetA and AssA metrics follow the same calculation and representation as the HOTA. In addition, we also calculated the Average Precision (AP) for each category separately, and then we took the average to obtain the final mean mAP. The mAP score was also calculated separately for common categories, uncommon categories, and all categories.

4.3. Implementation Details

We used the object detector Grounding DINO [17] to initialize the image feature extractor, feature encoder, and the detection part of the feature decoder in the TG-MCMOT. For the text feature extractor, we used the pure language model BERT [56]. On the BURST dataset, we initialized the TG-MCMOT segmentation module using the SAM model [59]. We froze the parameters of the image feature extractor, feature encoder, and text feature extractor during fine-tuning. Specific settings are shown in Table 2.

4.4. Results

4.4.1. Results on SeaDronesSee Dataset

We compare the TG-MCMOT with other state-of-the-art (SOTA) MOT methods on SeaDronesSee test set, as shown in Table 3. The results do not specifically distinguish between the categories of objects in the water. It can be observed that the TG-MCMOT achieved the highest HOTA score compared to other SOAT MOT methods, indicating that it has well balanced the tasks of object detection and object identity recognition. The TG-MCMOT obtained the highest IDF1 score, partly due to the fine-grained categorization of objects, preventing different categories from being incorrectly associated with each other. The TG-MCMOT had the fewest number of identity switches (IDSs) and detected the most correct objects, that is, it had the smallest number of false negatives (FNs). The TG-MCMOT also achieved the highest MT score, with 270 object identities having tracked for over 80% of their trajectory information, demonstrating that it can provide more complete trajectory predictions.

4.4.2. Results on BURST Dataset

To evaluate the performance of TG-MCMOT in terms of classification, we also tested the model on the BURST test set and validation set.

In Table 4, we present comparative results of the TG-MCMOT model’s performance against other MCMOT methods. The TG-MCMOT achieved the best scores in HOTAall, DetAall, AssAall, HOTAunc, DetAunc, and AssAunc. Scores in the uncommon category indicate that the TG-MCMOT exhibited superior overall performance in tracking unknown categories and outperformed other methods in both detection and association capabilities.

In Table 5, the results reported on the BURST validation set demonstrate that the TG-MCMOT outperformed a simple baseline method based on bounding box association across all metrics. Specifically, in the uncommon categories, the TG-MCMOT achieved a significant improvement over the baseline method: the

{HOTA}_{unc}

increased by 16.07 points, representing an improvement of 446.39%. The mAP results indicate that the TG-MCMOT method effectively balances precision and recall. Compared to the widely trained GLEE [54], which is designed for object localization and identification, the TG-MCMOT exhibited slightly lower performance on known categories (common), with a certain gap remaining in localization accuracy (mAP). However, the TG-MCMOT achieved better performance in tracking unknown categories, with an improvement of 0.57 points in HOTAunc. Additionally, the model shows an overall enhancement in higher-order tracking accuracy across all categories, with an increase of 0.32 points in HOTAall.

5. Analysis and Discussion

In this section, we provide additional analysis of the experimental results reported in Section 4.4, including ablation studies and visualizations. We first analyze and discuss the application of our TG-MCMOT to maritime rescue missions, including the effectiveness of the Pseudo-Near-InfraRed (PNIR) module and the effectiveness of object identity embeddings, and we analyze the effectiveness of TG-MCMOT in object classification through visualization. Given that the BURST dataset encompasses a wider range of data and categories, we also conducted further ablation studies on the BURST dataset and engage in analysis and discussion.

5.1. SeaDronesSee Dataset

5.1.1. Ablation Studies

Effectiveness analysis of Pseudo-Near-InfraRed images. To track small objects in the vast ocean background, we generated Pseudo-Near-InfraRed (PNIR) images for RGB images, using PNIR information to assist our TG-MCMOT in learning features for the object. The Pseudo-Near-InfraRed (PNIR) image is designed to filter out the vast ocean background when tracking small objects. In our approach, PNIR served as an auxiliary tool specifically for assisting in the localization of small objects. Incorporating PNIR information can be seen as introducing position markers for areas where objects are likely to be present. The results of the TG-MCMOT on the SeaDronesSee validation set with and without the PNIR module are reported in Table 6. It can be observed that after using the PNIR images, the number of false negatives (FNs) for the TG-MCMOT was significantly reduced from 17,058 to 8161, and the number of false positives (FPs) was also reduced. The results show that preprocessing RGB images into PNIR images and applying PNIR image information to the model greatly help to enhance the model’s detection capabilities for objects in the water. This improvement is critical for maritime search and rescue operations, where the ability to accurately detect small objects against a complex background is critical to mission success.

Effectiveness analysis of object identity embedding. Due to the high speed of the drone’s movement causing significant displacement of objects between adjacent frames, our TG-MCMOT needs to rely on object appearance features for tracking. Results using and not using the object identity embedding module are reported in Table 7. The results show that our TG-MCMOT can significantly reduce the number of identity switches (IDSs) even without using the drone’s flight data. Figure 11a illustrates an example of tracking failure when objects are small and the drone moves quickly, with many object identities changing between two consecutive frames. When considering the appearance features of objects during the tracking process, it can be observed in Figure 11b that the TG-MCMOT maintained the same object identities between two consecutive frames. This demonstrates that incorporating object appearance features into the TG-MCMOT model can enhance tracking performance, especially in conditions where rapid movement and small object size pose challenges to maintaining consistent identities across frames.

5.1.2. Visualizations

Since the SeaDronesSee dataset treats all objects as one category during evaluation, we analyzed the effectiveness of the TG-MCMOT in object classification through visual results. As shown in Figure 12, the TG-MCMOT can categorize objects into different classes: boat (indicated by red boxes), swimmer with life jacket (indicated by green boxes), and swimmer (indicated by yellow boxes). It can maintain object identity over time. Due to the rapid movement of the drone, in the video DJI_0057 depicted in Figure 12, objects move from one side of the frame to the other within just 50 frames. Thanks to the identity embedding feature learning module of the TG-MCMOT, object identities were preserved.

As shown in Figure 13, within a few consecutive frames, a swimmer with life jacket removed the life jacket, becoming a swimmer. It can be observed that the TG-MCMOT continued to classify the object as swimmer with life jacket during the process of removing the life jacket. Once the life jacket was completely removed, the object category changed to swimmer. The TG-MCMOT can track the object and adjust its classification according to the situation demonstrating the effectiveness of incorporating text category information. This information assists the model in learning semantic features of different subclasses within the same general category (person), such as swimmer with life jacket and swimmer.

Figure 12 and Figure 13 illustrate that text category information is highly beneficial for the TG-MCMOT. It allows the model not only to distinguish between different objects but also to adapt to changes in object categories over time, such as the removal of a life jacket. This adaptability is crucial for accurate tracking and classification in dynamic environments, particularly in applications such as maritime search and rescue, where the distinction between different object states can be significant.

5.1.3. Discussion on Other Methods

The SeaDronesSee dataset provides precise drone flight data such as speed, altitude, and angle, which are valuable for enhancing the performance of MOT algorithms. In the MaCVi challenge, some researchers have made full use of this flight data. The results of methods that used drone flight data on SeaDronesSee test set are reported in Table 8. These methods, which use the flight data of the drone, achieved high scores, indicating that incorporating such detailed motion information can significantly improve tracking performance. However, from a practical point of view, it is challenging to transfer the superior performance of these methods to other tasks that do not provide drone flight data. Therefore, we do not discuss these methods in this paper.

5.2. BURST Dataset

5.2.1. Ablation Studies

Effectiveness analysis of encoding strategy. Our TG-MCMOT is similar to the detector Grounding DINO [17] during the feature encoding step. However, Grounding DINO first performs self-attention to text and image information separately and then fuses the information, whereas the TG-MCMOT first fuses text and image information and then performs self-attention to fused features. The results of the two different encoding sequences on the BURST validation set are reported in Table 9. The encoding order that fuses features first and then performs self-attention outperformed the order that performed self-attention first and then fuses on all metrics. This demonstrates that the feature fusion strategy of the TG-MCMOT is effective, incorporating image information into text self-attention and using text information in image self-attention. This approach highlights the benefits of integrating text and visual modalities early in the encoding process, allowing the model to capture a richer representation of the data. By fusing information before self-attention, the TG-MCMOT can better leverage the semantic and contextual cues provided by both text and image, leading to improved performance in multi-object tracking tasks.

Effectiveness analysis of decoding strategy. The detection task requires semantic-level features, primarily for classification and localization, while identity embedding requires instance-level features, mainly to differentiate between different individuals. In our TG-MCMOT model, the primary role of text is to guide the model in classification. However, the tracking task requires distinguishing between different entities within the same category. Therefore, in TG-MCMOT decoding, only object queries are used to predict categories and bounding boxes interact with text features through cross-attention, while tracking queries are used to predict identity embedding features do not interact with text. The results of the model with and without interaction between tracking queries and text information on the BURST validation set are reported in Table 10. When tracking queries did not interact with text information, the TG-MCMOT achieved better performance. This is because text information is semantic, and the challenge in the MCMOT is to distinguish between different instances in the same category. The features that tracking queries aim to learn are instance-level features used to differentiate between objects. Introducing text information may instead result in a loss of instance-level features.

Selection of object identity embedding features. The TG-MCMOT used features from the intermediate layers of the decoder to generate object identity embedding features. With a total of six decoder layers, the results of using these six-layer output features and the encoder output features, for generating object identity embedding features, are reported in Table 11. The results show that using intermediate layer features (from the third and fourth layers) yielded significantly better results than using features from shallow or deeper layers. Among the intermediate layers, the features of the fourth layer were found to be more effective. Therefore, we extracted the object identity embedding features from the fourth layer of the decoder. This selection is based on the observation that intermediate layers often capture a balance between high-level semantic information and fine-grained details, which is beneficial for learning discriminative features necessary for accurate tracking and identity preservation across frames.

Effectiveness analysis of association strategy. During the inference of the TG-MCMOT, the model first used object identity embedding features for data association and then employed bounding boxes for secondary association. At each association step, the model avoided associating objects of different categories into the same trajectory. When the segmentation branch was incorporated, the TG-MCMOT replaced the object bounding boxes used in the secondary association with the object masks. The results of the TG-MCMOT model using different association strategies on BURST validation set are reported in Table 12. Compared to not using category restrictions, avoiding the association of different categories into the same trajectory in each association step improved the accuracy of the association, thus retaining more detection results, which led to higher values for all metrics. Adding the segmentation branch, association via mask IoU could achieve better association accuracy (AssA) than association via box IoU. However, since representing objects with masks allows a pixel to be assigned to only one object, and we do not specifically design a handling scheme for overlaps between masks of different objects, simply assigning overlapping pixels to the object with the higher predicted score can lead to a slight loss in overall model performance (HOTA) and detection performance (DetA).

Effectiveness analysis of training strategy. During training of the TG-MCMOT model, we simultaneously trained the object detection part (classification and localization) and the identity embedding part of the decoder. To assess the effectiveness of joint training, we trained both parts separately, and the results on BURST validation set are shown in Table 13. The results show that neither training the detection part first nor the identity embedding part first yielded results as good as joint training. This is because of the joint training process, where the two tasks can promote each other, leading to improved performance over separate training programs.

5.2.2. Visualizations

Figure 14 presents some visualization results of the TG-MCMOT on the BURST validation set. It includes tracking results for common categories (person) and uncommon categories, with uncommon categories featuring examples such as army tank, hippopotamus, polar bear, and walrus. These visualizations demonstrate the model’s ability to track a diverse range of objects, both common and uncommon.

The visualization results serve as a qualitative evaluation that complements the quantitative metrics reported in the experiments, providing a more intuitive understanding of the model’s tracking capabilities and its effectiveness in addressing various tracking challenges.

6. Conclusions

To further distinguish between different fine-grained categories in maritime rescue multi-object tracking tasks and to apply multi-object tracking methods to a wider range of categories, we propose a Text-Guided Multi-Class Multi-Object Tracking Method (TG-MCMOT). The TG-MCMOT draws on the famous detector Grounding DINO, using an encoder–decoder architecture to fuse text features of object categories with image features. Through text-guided object queries, it decodes the category and location of objects, and extracts object identity embedding features through tracking queries. When used for maritime tasks, the TG-MCMOT also generates Pseudo-Near-InfraRed (PNIR) images for RGB images through a data preprocessing method to filter the background, and it introduces PNIR information into the encoding and decoding process to help TG-MCMOT track small objects in the ocean. On the BURST dataset with many categories, our TG-MCMOT can track both common and uncommon category objects, especially achieving SOTA results in tracking uncommon categories. On the maritime rescue UAV remote sensing dataset called SeaDronesSee, our TG-MCMOT can distinguish between fine-grained object categories such as swimmer and swimmer with life jacket, and it can accurately identify when the object category changes. Experimental results show that the TG-MCMOT, using category text information, can effectively complete multi-class multi-object tracking tasks, which is of great significance for maritime search and rescue missions when rescue forces are limited. In addition, such an ability of TG-MCMOT to excel in tracking less frequently occurring objects is a testament to its robust feature extraction and fusion techniques, allowing it to generalize well across a wide range of objects.

Author Contributions

Conceptualization, S.L. and H.W.; methodology, S.L.; software, S.L.; validation, S.L.; formal analysis, S.L. and Z.L.; investigation, S.L. and Z.L.; resources, W.Y. and H.L.; data curation, S.L.; writing—original draft preparation, S.L.; writing—review and editing, Z.L., H.W., W.Y. and H.L.; visualization, S.L.; supervision, W.Y. and H.L.; project administration, W.Y. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the General Program of the National Natural Science Foundation of China: 62372459; the National Natural Science Foundation of China: 91948303-1; the National Key R&D Program of China: 2021ZD0140301; the Postgraduate Scientific Research Innovation Project of Hunan Province: QL20210018.

Data Availability Statement

The data presented in this study are available in the article.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned Aerial Vehicle
MOT	Multi-Object Tracking
MOTMR	MOT-based Marine Rescue
MCMOT	Multi-Class Multi-Object Tracking
TG-MCMOT	Text-Guided Multi-Class Multi-Object Tracker
TBD	Tracking-by-detection
JDT	Joint detection and tracking
SOT	Single-Object Tracking
NIR	Near Infrared
PNIR	Pseudo-Near-InfraRed
HOTA	Higher-Order Tracking Accuracy
MOTA	Multiple Object Tracking Accuracy
FP	False Positive
FN	False Negative
IDS	ID Switch
MT	Mostly Tracked
ML	Mostly Lost
SOTA	State-of-the-art

References

Lygouras, E.; Santavas, N.; Taitzoglou, A.; Tarchanidis, K.; Mitropoulos, A.; Gasteratos, A. Unsupervised human detection with an embedded vision system on a fully autonomous UAV for search and rescue operations. Sensors 2019, 19, 3542. [Google Scholar] [CrossRef] [PubMed]
Yoneyama, R.; Dake, Y. Vision-based maritime object detection covering far and tiny obstacles. IFAC-PapersOnLine 2022, 55, 210–215. [Google Scholar] [CrossRef]
Huang, W.; Zhou, X.; Dong, M.; Xu, H. Multiple objects tracking in the UAV system based on hierarchical deep high-resolution network. Multimed. Tools Appl. 2021, 80, 13911–13929. [Google Scholar] [CrossRef]
Yang, D.; Solihin, M.I.; Ardiyanto, I.; Zhao, Y.; Li, W.; Cai, B.; Chen, C. A streamlined approach for intelligent ship object detection using EL-YOLO algorithm. Sci. Rep. 2024, 14, 15254. [Google Scholar] [CrossRef] [PubMed]
Varga, L.A.; Kiefer, B.; Messmer, M.; Zell, A. Seadronessee: A maritime benchmark for detecting humans in open water. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2260–2270. [Google Scholar]
Zhang, X.; Feng, Y.; Zhang, S.; Wang, N.; Mei, S.; He, M. Semi-Supervised Person Detection in Aerial Images with Instance Segmentation and Maximum Mean Discrepancy Distance. Remote Sens. 2023, 15, 2928. [Google Scholar] [CrossRef]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–21. [Google Scholar]
Bergmann, P.; Meinhardt, T.; Leal-Taixe, L. Tracking without bells and whistles. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 941–951. [Google Scholar]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Dave, A.; Khurana, T.; Tokmakov, P.; Schmid, C.; Ramanan, D. Tao: A large-scale benchmark for tracking any object. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 436–454. [Google Scholar]
Kiefer, B.; Kristan, M.; Perš, J.; Žust, L.; Poiesi, F.; Andrade, F.; Bernardino, A.; Dawkins, M.; Raitoharju, J.; Quan, Y.; et al. 1st workshop on maritime computer vision (macvi) 2023: Challenge results. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 265–302. [Google Scholar]
Kiefer, B.; Žust, L.; Kristan, M.; Perš, J.; Teršek, M.; Wiliem, A.; Messmer, M.; Yang, C.Y.; Huang, H.W.; Jiang, Z.; et al. 2nd Workshop on Maritime Computer Vision (MaCVi) 2024: Challenge Results. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 869–891. [Google Scholar]
Zhang, Y.; Tao, Q.; Yin, Y. A Lightweight Man-Overboard Detection and Tracking Model Using Aerial Images for Maritime Search and Rescue. Remote Sens. 2023, 16, 165. [Google Scholar] [CrossRef]
Cafarelli, D.; Ciampi, L.; Vadicamo, L.; Gennaro, C.; Berton, A.; Paterni, M.; Benvenuti, C.; Passera, M.; Falchi, F. MOBDrone: A drone video dataset for man overboard rescue. In Proceedings of the International Conference on Image Analysis and Processing, Lecce, Italy, 23–27 May 2022; Springer: Berlin/Heidelberg, Germanu, 2022; pp. 633–644. [Google Scholar]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Li, C.; Yang, J.; Su, H.; Zhu, J.; et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv 2023, arXiv:2303.05499. [Google Scholar]
Leal-Taixé, L.; Milan, A.; Reid, I.; Roth, S.; Schindler, K. Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv 2015, arXiv:1504.01942. [Google Scholar]
Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
Dendorfer, P.; Rezatofighi, H.; Milan, A.; Shi, J.; Cremers, D.; Reid, I.; Roth, S.; Schindler, K.; Leal-Taixé, L. Mot20: A benchmark for multi object tracking in crowded scenes. arXiv 2020, arXiv:2003.09003. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Feng, W.; Lan, L.; Luo, Y.; Yu, Y.; Zhang, X.; Luo, Z. Near-online multi-pedestrian tracking via combining multiple consistent appearance cues. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 1540–1554. [Google Scholar] [CrossRef]
Feng, W.; Lan, L.; Zhang, X.; Luo, Z. Learning sequence-to-sequence affinity metric for near-online multi-object tracking. Knowl. Inf. Syst. 2020, 62, 3911–3930. [Google Scholar] [CrossRef]
Lan, L.; Wang, X.; Zhang, S.; Tao, D.; Gao, W.; Huang, T.S. Interacting tracklets for multi-object tracking. IEEE Trans. Image Process. 2018, 27, 4585–4597. [Google Scholar] [CrossRef]
Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
Dehghan, A.; Tian, Y.; Torr, P.H.; Shah, M. Target identity-aware network flow for online multiple target tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1146–1154. [Google Scholar]
Feng, W.; Lan, L.; Buro, M.; Luo, Z. Online Multiple-Pedestrian Tracking with Detection-Pair-Based Graph Convolutional Networks. IEEE Internet Things J. 2022, 9, 25086–25099. [Google Scholar] [CrossRef]
Liang, T.; Lan, L.; Zhang, X.; Peng, X.; Luo, Z. Enhancing the association in multi-object tracking via neighbor graph. Int. J. Intell. Syst. 2021, 36, 6713–6730. [Google Scholar] [CrossRef]
Lan, L.; Tao, D.; Gong, C.; Guan, N.; Luo, Z. Online Multi-Object Tracking by Quadratic Pseudo-Boolean Optimization. In Proceedings of the IJCAI, New York, NY, USA, 9–15 July 2016; pp. 3396–3402. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. Strongsort: Make deepsort great again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2636–2645. [Google Scholar]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef]
Liu, S.; Li, X.; Lu, H.; He, Y. Multi-object tracking meets moving UAV. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 8876–8885. [Google Scholar]
Mazzeo, P.L.; Manica, A.; Distante, C. UAV Multi-object Tracking by Combining Two Deep Neural Architectures. In Proceedings of the International Conference on Image Analysis and Processing, Udine, Italy, 11–15 September 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 257–268. [Google Scholar]
Kraus, M.; Azimi, S.M.; Ercelik, E.; Bahmanyar, R.; Reinartz, P.; Knoll, A. AerialMPTNet: Multi-pedestrian tracking in aerial imagery using temporal and graphical features. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 2454–2461. [Google Scholar]
Bahmanyar, R.; Vig, E.; Reinartz, P. MRCNet: Crowd counting and density map estimation in aerial and ground imagery. arXiv 2019, arXiv:1909.12743. [Google Scholar]
Azimi, S.M.; Kraus, M.; Bahmanyar, R.; Reinartz, P. Multiple pedestrians and vehicles tracking in aerial imagery using a convolutional neural network. Remote Sens. 2021, 13, 1953. [Google Scholar] [CrossRef]
Varga, L.A.; Koch, S.; Zell, A. Comprehensive Analysis of the Object Detection Pipeline on UAVs. Remote Sens. 2022, 14, 5508. [Google Scholar] [CrossRef]
Zhang, Y.; Yin, Y.; Shao, Z. An Enhanced Target Detection Algorithm for Maritime Search and Rescue Based on Aerial Images. Remote Sens. 2023, 15, 4818. [Google Scholar] [CrossRef]
Yang, C.Y.; Huang, H.W.; Jiang, Z.; Kuo, H.C.; Mei, J.; Huang, C.I.; Hwang, J.N. Sea you later: Metadata-guided long-term re-identification for uav-based multi-object tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 805–812. [Google Scholar]
Zhang, H.; Wang, Y.; Dayoub, F.; Sunderhauf, N. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8514–8523. [Google Scholar]
Liang, T.; Chu, X.; Liu, Y.; Wang, Y.; Tang, Z.; Chu, W.; Chen, J.; Ling, H. Cbnet: A composite backbone network architecture for object detection. IEEE Trans. Image Process. 2022, 31, 6893–6906. [Google Scholar] [CrossRef]
Athar, A.; Luiten, J.; Voigtlaender, P.; Khurana, T.; Dave, A.; Leibe, B.; Ramanan, D. Burst: A benchmark for unifying object recognition, segmentation and tracking in video. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 1674–1683. [Google Scholar]
Wu, J.; Jiang, Y.; Liu, Q.; Yuan, Z.; Bai, X.; Bai, S. General object foundation model for images and videos at scale. arXiv 2023, arXiv:2312.09158. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
Cheng, Y.; Li, L.; Xu, Y.; Li, X.; Yang, Z.; Wang, W.; Yang, Y. Segment and track anything. arXiv 2023, arXiv:2305.06558. [Google Scholar]
Yang, J.; Gao, M.; Li, Z.; Gao, S.; Wang, F.; Zheng, F. Track anything: Segment anything meets videos. arXiv 2023, arXiv:2304.11968. [Google Scholar]
Rajič, F.; Ke, L.; Tai, Y.W.; Tang, C.K.; Danelljan, M.; Yu, F. Segment anything meets point tracking. arXiv 2023, arXiv:2307.01197. [Google Scholar]
Cheng, H.K.; Oh, S.W.; Price, B.; Schwing, A.; Lee, J.Y. Tracking anything with decoupled video segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 1316–1326. [Google Scholar]
Ren, T.; Liu, S.; Zeng, A.; Lin, J.; Li, K.; Cao, H.; Chen, J.; Huang, X.; Chen, Y.; Yan, F.; et al. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv 2024, arXiv:2401.14159. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. Hota: A higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar] [CrossRef]
Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP J. Image Video Process. 2008, 2008, 246309. [Google Scholar] [CrossRef]
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 17–35. [Google Scholar]

Figure 1. Wordcloud of COCO categories (left) and TAO [12] categories (right). Wordclouds are weighted by the number of instances and colored according to their categories.

Figure 2. Scenario of inaccurate object classification. All objects are classified as “person”, but their trajectories remain consistent over time.

Figure 3. The framework of TG-MCMOT. In TG-MCMOT model, the input consists of image and category text, which are processed by an image feature extractor and a text feature extractor to obtain image and text features, respectively. Subsequently, the encoder fuses the text and image features and outputs the fused features. These fused features are then passed through the decoder to complete classification, localization, and identity feature embedding tasks simultaneously.

Figure 4. The text encoding details. The text self-attention mask matrix is used to delineate the boundaries of each sentence.

Figure 5. The encoding process of TG-MCMOT.

Figure 6. The decoding process of TG-MCMOT.

Figure 7. The visualization of multispectral image.

Figure 8. RGB images of maritime rescue scenes and their corresponding PNIR images.

Figure 9. PNIR information is fused in the encoding process of TG-MCMOT for maritime rescue.

Figure 10. PNIR information is used in the decoding process of TG-MCMOT for maritime rescue.

Figure 11. Visualizations of TG-MCMOT with or without identity embedding module. The same bounding box color represents the same identity. Best viewed in color. “w/” stands for “with”, while “w/o” stands for “without”.

Figure 12. Visualizations of TG-MCMOT for distinguishing different categories: there is a 50 frame difference between the left and right sides. The same bounding box color represents the same category, and the same bounding box number represents the same identity. Best viewed in color.

Figure 13. Visualizations of the process by which the object takes off its life jacket. The same bounding box color represents the same category. Best viewed in color.

Figure 14. Visualizations of TG-MCMOT on BURST validation set. The same mask color represents the same identity. The object category is marked in the upper left corner of the image. Best viewed in color.

Table 1. The inputs of detection decoding module.

Module	Query (Q)	Key (K)	Value (V)
Multi-head self-attention	Object queries	Object queries	Content information
Text cross-attention	Content information combined with localization information	Text features	Text features
Multi-scale deformable attention *	Content information combined with localization information	-	Image features

* In the multi-scale deformable attention module, image position information is also introduced, learning features at different scales through the reference box.

Table 2. Specific settings of TG-MCMOT.

Settings	Parameters
Optimizer	AdamW
Epochs	15
Learning rate	$0.0001$
Batch size	8 (SeaDronesSee), 4 (BURST)
Data augmentation	resizing, cropping, mirroring
Hyperparameters of Algorithm 1	$μ = 0.25$ , $γ = 2.0$ , $w_{class} = 1.0$ , $w_{l_{1}} = 5.0, w_{giou} = 2.0$
Hyperparameters of $L_{bbox}$ (Equation (4))	$w_{l_{1}}^{'} = 5.0$ , $w_{giou}^{'} = 2.0$
Hyperparameters of $L_{TG - MCMOT}$ (Equation (9))	$λ_{cls} = 2.0$ , $λ_{bbox} = 1.0$ , $λ_{distinguish} = 1.0$ ,
Hyperparameters of $L_{TG - MCMOT}$ (Equation (9))	$λ_{cross} = 1.0$ (only SeaDronesSee), $λ_{mask} = 5.0$ (only BURST)
$n u m_t o k e n s$	256
$K_{top}$	50
$τ_{h i g h}$	$0.2$
$τ_{l o w}$	$0.1$
Hyperparameters of Algorithm 2	${HSV}_{low} = (120, 43, 60)$ , ${HSV}_{high} = (180, 255, 255)$

Table 3. Comparison of different methods on SeaDronesSee test set *.

Method	HOTA↑	MOTA↑	IDF1↑	MT↑	ML↓	FN↓	FP↓	IDS↓
FairMOT [10]	-	36.5	43.8	28	49	20,867	3788	447
ByteTrack [8]	57.1	67.8	66.6	258	104	21,050	8842	-
Tracktor [9]	46.1	47.8	49.9	175	157	35,765	11,960	1435
TG-MCMOT (ours)	58.8	59.4	69.9	270	98	19,130	19,025	31

* The best results are shown in bold.

Table 4. Comparison of different methods on BURST test set *.

Method	${HOTA}_{all} ↑$	${DetA}_{all} ↑$	${AssA}_{all} ↑$	${HOTA}_{com} ↑$	${DetA}_{com} ↑$	${AssA}_{com} ↑$	${HOTA}_{unc} ↑$	${DetA}_{unc} ↑$	${AssA}_{unc} ↑$
STCN Tracker [53]	4.5	5.4	4.6	17.1	19.6	16.7	2.0	2.6	2.2
Le_Tracker	14.9	12.3	20.1	36.1	30.7	47.8	10.8	8.7	14.6
TG-MCMOT (ours)	20.2	19.7	22.1	31.5	30.6	34.4	17.9	17.6	19.7

* The best results are shown in bold.

Table 5. Comparison of different methods on BURST validation set *.

Method	${HOTA}_{all} ↑$	${mAP}_{all} ↑$	${HOTA}_{com} ↑$	${mAP}_{com} ↑$	${HOTA}_{unc} ↑$	${mAP}_{unc} ↑$
STCN Tracker [53]	5.5	0.9	17.5	0.7	2.5	0.6
Box Tracker [53]	8.2	1.4	27.0	3.0	3.6	0.9
GLEE [54]	22.6	12.6	36.4	18.9	19.1	11.0
TG-MCMOT (ours)	22.92	5.01	35.98	7.28	19.67	4.45

* The best results are shown in bold.

Table 6. Results of TG-MCMOT with or without PNIR module on SeaDroneSeeSee validation set *.

Method	HOTA↑	MOTA↑	IDF1↑	MT↑	ML↓	FN↓	FP↓	IDS↓
w/o PNIR images	46.21	52.72	55.6	359	40	17,058	5406	35
w/ PNIR images (ours)	51.61	71.81	57.14	376	50	8161	5204	47

* The best results are shown in bold. “w/” stands for “with”, while “w/o” stands for “without”.

Table 7. Results of TG-MCMOT with or without identity embedding module on SeaDroneSeeSee validation set *.

Method	HOTA↑	MOTA↑	IDF1↑	MT↑	ML↓	FN↓	FP↓	IDS↓
w/o identity embedding	49.35	70.40	53.37	373	53	7476	5857	733
w/ identity embedding (ours)	51.61	71.81	57.14	376	50	8161	5204	47

* The best results are shown in bold. “w/” stands for “with”, while “w/o” stands for “without”.

Table 8. Results of methods using drone flight data on SeaDroneSeeSee test set.

Method	HOTA↑	MOTA↑	IDF1↑	MT↑	ML↓	FN↓	FP↓	IDS↓
MoveSORT [13]	67	80	77	311	71	10,009	8761	44
MG-MOT [14]	69.5	77.1	85.9	153	27	12,351	9531	18
Yang et al. [14]	69.3	78.0	84.4	165	20	10,391	10,643	16

Table 9. Results of TG-MCMOT using different encoding sequences on BURST validation set *.

Method	${HOTA}_{all} ↑$	${DetA}_{all} ↑$	${AssA}_{all} ↑$	${HOTA}_{com} ↑$	${DetA}_{com} ↑$	${AssA}_{com} ↑$	${HOTA}_{unc} ↑$	${DetA}_{unc} ↑$	${AssA}_{unc} ↑$
Self-attention first [17]	13.38	14.57	13.70	22.02	23.20	23.19	11.11	12.42	11.33
Fusing first (ours)	22.92	23.97	23.63	35.98	35.14	39.01	19.67	21.19	19.80

* The best results are shown in bold.

Table 10. Results of TG-MCMOT with or without interaction between tracking queries and text on BURST validation set *.

Method	${HOTA}_{all} ↑$	${DetA}_{all} ↑$	${AssA}_{all} ↑$	${HOTA}_{com} ↑$	${DetA}_{com} ↑$	${AssA}_{com} ↑$	${HOTA}_{unc} ↑$	${DetA}_{unc} ↑$	${AssA}_{unc} ↑$
w/ interaction	15.89	15.89	17.29	28.12	25.45	33.19	12.85	13.51	13.34
w/o interaction (ours)	22.92	23.97	23.63	35.98	35.14	39.01	19.67	21.19	19.80

* The best results are shown in bold. “w/” stands for “with”, while “w/o” stands for “without”.

Table 11. Results of TG-MCMOT using different layers to generate object identity embedding features on BURST validation set *.

Feature layers	${HOTA}_{all} ↑$	${DetA}_{all} ↑$	${AssA}_{all} ↑$	${HOTA}_{com} ↑$	${DetA}_{com} ↑$	${AssA}_{com} ↑$	${HOTA}_{unc} ↑$	${DetA}_{unc} ↑$	${AssA}_{unc} ↑$
the encoder output	16.73	17.18	17.74	30.03	28.87	33.45	13.42	14.27	13.83
the 1st layer	17.97	18.78	18.64	31.94	31.66	34.61	14.49	15.58	14.66
the 2nd layer	18.34	18.52	19.57	32.58	31.45	36.03	14.80	15.31	15.47
the 3th layer	19.74	19.76	21.15	34.70	33.30	38.24	16.02	16.39	16.90
the 4th layer	22.47	22.72	23.84	35.45	34.21	39.14	19.24	19.87	20.03
the 5th layer	17.84	18.34	18.73	32.17	31.38	35.34	14.27	15.09	14.60
the 6th layer	17.92	18.33	18.96	32.13	31.29	35.41	14.38	15.11	14.86

* The best results are shown in bold.

Table 12. Results of TG-MCMOT using different association strategies on BURST validation set *.

Association Strategy	${HOTA}_{all} ↑$	${DetA}_{all} ↑$	${AssA}_{all} ↑$	${HOTA}_{com} ↑$	${DetA}_{com} ↑$	${AssA}_{com} ↑$	${HOTA}_{unc} ↑$	${DetA}_{unc} ↑$	${AssA}_{unc} ↑$
feature + box	16.73	16.36	18.58	27.62	25.74	31.92	14.02	14.02	15.26
feature + mask	16.24	15.76	18.29	26.00	24.40	30.12	13.82	13.61	15.35
feature + box + class	22.92	23.97	23.63	35.98	35.14	39.01	19.67	21.19	19.80
feature + mask + class	22.47	22.72	23.84	35.45	34.21	39.14	19.24	19.87	20.03

* The best results are shown in bold.

Table 13. Results of TG-MCMOT using different training strategy on BURST validation set *.

Method	${HOTA}_{all} ↑$	${DetA}_{all} ↑$	${AssA}_{all} ↑$	${HOTA}_{com} ↑$	${DetA}_{com} ↑$	${AssA}_{com} ↑$	${HOTA}_{unc} ↑$	${DetA}_{unc} ↑$	${AssA}_{unc} ↑$
Train detection first	18.68	19.14	19.68	31.39	31.31	33.55	15.52	16.11	16.23
Train embedding first	19.35	19.67	20.46	33.71	32.79	36.71	15.78	16.40	16.41
Joint training (ours)	22.92	23.97	23.63	35.98	35.14	39.01	19.67	21.19	19.80

* The best results are shown in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, S.; Lin, Z.; Wang, H.; Yang, W.; Liu, H. Text-Guided Multi-Class Multi-Object Tracking for Fine-Grained Maritime Rescue. Remote Sens. 2024, 16, 3684. https://doi.org/10.3390/rs16193684

AMA Style

Li S, Lin Z, Wang H, Yang W, Liu H. Text-Guided Multi-Class Multi-Object Tracking for Fine-Grained Maritime Rescue. Remote Sensing. 2024; 16(19):3684. https://doi.org/10.3390/rs16193684

Chicago/Turabian Style

Li, Shuman, Zhipeng Lin, Haotian Wang, Wenjing Yang, and Hengzhu Liu. 2024. "Text-Guided Multi-Class Multi-Object Tracking for Fine-Grained Maritime Rescue" Remote Sensing 16, no. 19: 3684. https://doi.org/10.3390/rs16193684

APA Style

Li, S., Lin, Z., Wang, H., Yang, W., & Liu, H. (2024). Text-Guided Multi-Class Multi-Object Tracking for Fine-Grained Maritime Rescue. Remote Sensing, 16(19), 3684. https://doi.org/10.3390/rs16193684

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Text-Guided Multi-Class Multi-Object Tracking for Fine-Grained Maritime Rescue

Abstract

1. Introduction

2. Related Works

2.1. Typical Multi-Object Tracking

2.2. Multi-Object Tracking in Aerial Imagery

2.3. Multi-Object Tracking in Maritime Rescue

2.4. Multi-Class Multi-Object Tracking

3. Method

3.1. TG-MCMOT

3.1.1. Overview

3.1.2. Encoder

3.1.3. Decoder

3.1.4. Segmentation Module

3.1.5. Training and Inference

3.2. TG-MCMOT for Maritime Rescue

3.2.1. Pseudo-near-InfraRed Image Generation

3.2.2. TG-MCMOT with PNIR Module

3.2.3. Cross-Frame Identity Loss

4. Experiments and Results

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Results

4.4.1. Results on SeaDronesSee Dataset

4.4.2. Results on BURST Dataset

5. Analysis and Discussion

5.1. SeaDronesSee Dataset

5.1.1. Ablation Studies

5.1.2. Visualizations

5.1.3. Discussion on Other Methods

5.2. BURST Dataset

5.2.1. Ablation Studies

5.2.2. Visualizations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI